What CodeClanker checks (and why each dimension matters)

CodeClanker scores your repository across nine production-readiness dimensions on a 0–100 scale. The numbers are not arbitrary — each dimension represents a specific category of failure that takes down AI-built MVPs once they meet real users. This page explains what we look at for each one, what evidence we use to score it, and why the dimension matters.

Two things to know before you read on:

Most AI-built MVPs land below 30 overall. That is the calibration, not a bug. The free scan exists to surface the gap between "Cursor shipped this in a weekend" and "this survives 10,000 strangers."
Some checks are deterministic, some are interpretive. CVE detection, secret scanning, committed env files, and license classification are deterministic — they come from real tooling (OSV.dev, regex over file contents, GitHub metadata). Dimension scoring is interpretive: a large language model reads the deterministic findings, your file tree, and key config files, then scores each dimension against an explicit rubric.

Code quality

Whether the codebase looks like something a senior engineer would inherit and ship from, or whether it looks like an AI generated a feature and stopped.

What we look at: presence of TypeScript and strict mode, ESLint and Prettier configuration, how much of the code is boilerplate or default scaffolding (CRA, Vite starter, Next.js example), file organization (clear separation of concerns vs. everything in src/App.js), import patterns, named exports vs. default sprawl, and whether the obvious shortcut — any, // @ts-ignore, untyped responses — has been used everywhere.

Why it matters: code quality is the friction coefficient on every future change. Low code quality is fine for a one-week prototype. It is fatal for an app you are about to take to paying users, because the next ten features compound the mess. Customers do not see code quality directly, but they see the bugs and missed features that low code quality produces over time.

QA & testing

Whether anyone — human, CI, or AI — has any way to know that a change did not break something.

What we look at: count of test files matching common patterns (*.test.ts, *.spec.js, tests/ directories), whether tests are real or just default scaffolding (a CRA app has App.test.js by default — that is not a real test), presence of a CI workflow that actually runs tests, integration vs. unit balance, and whether any end-to-end coverage exists (Playwright, Cypress, Selenium).

Why it matters: without tests, regressions ship silently. The first paying customer notices a bug your AI introduced three commits ago. Refactors become guessing exercises. Onboarding new contributors — including future-you returning to the code in three months — is dramatically slower. AI-built apps tend to score very low here because LLMs generate features faster than tests, and most founders never come back to add them.

Security

Whether a half-decent attacker can take your app or your customer data.

What we look at: known CVEs in declared dependencies (real check via OSV.dev), hardcoded secrets in committed files (regex scan for AWS keys, GitHub tokens, OpenAI keys, Anthropic keys, Slack tokens, Stripe keys, Google API keys, JWTs, private key blocks, database URLs with credentials), .env / .env.local / .env.production files committed to git (they should not exist), authentication patterns visible in the code, evidence of input validation, and rate limiting on public endpoints.

Why it matters: a single committed Stripe key takes you from "indie hacker" to "explaining to your bank why $40,000 of fraudulent charges hit your account." A vulnerable axios version is one lazy supply-chain attacker away from compromise. The deterministic checks here are real: when CodeClanker reports axios@1.10.0 with 17 CVEs, that is a real OSV.dev result, not a guess.

Maintainability

Whether someone other than the original author can understand the project well enough to ship a change without breaking everything.

What we look at: README depth (default CRA / Vite README = bad signal), license clarity (declared vs. unknown vs. copyleft), CHANGELOG, contributing guide, architecture decision records, code comments where they matter (the "why," not the "what"), dependency hygiene (lockfile present, dependencies pinned), and how brittle the project's setup looks (does npm install && npm run dev just work, or is there an unwritten "first-time setup" tribal knowledge step?).

Why it matters: if you ever need to bring in a contractor, hire your first engineer, or pick the project up six months later, every missing piece of documentation is a tax. AI-built MVPs typically have zero project-specific documentation because the LLM generated code, not docs.

Observability

Whether you find out about production failures before your users tell you on Twitter.

What we look at: error tracking (Sentry, Bugsnag, Rollbar), structured logging (pino, winston, JSON logs), metrics endpoints (Prometheus, OpenTelemetry, custom /metrics routes), health checks beyond the framework default, request tracing, and any evidence that someone has thought about "how do we debug a 2 AM incident?"

Why it matters: without observability, you have no idea if your app is broken until a user emails you. With observability, you find out within seconds. The cost of an outage scales with how long it takes you to notice it. AI-built apps almost universally score under 20 here because no LLM has ever volunteered "here is the Sentry integration."

Performance & scale

Whether the app holds up when a thousand users hit it at once, instead of nine.

What we look at: caching strategy (CDN config, in-memory caches, Redis), database query patterns (presence of N+1 patterns in code, indices in migrations), bundle size and code-splitting (lazy imports, dynamic chunks), image optimization, evidence of load testing, and any explicit performance budgets.

Why it matters: performance problems do not exist in a five-user demo. They exist the day a viral tweet sends 10,000 people to your app, or the day your dashboard query crosses 50 projects per user. AI-generated code optimizes for "looks correct," not "scales." That is fine until it isn't.

Architecture & tech debt

Whether the project's structure can absorb the next ten features, or whether it is going to require a rewrite first.

What we look at: separation of concerns (UI vs. data layer vs. business logic vs. side effects), boundaries between services if there are multiple, database schema design and migration strategy, backup strategy, redundancy, choice of stack relative to what the project is doing (is this a static site dressed as a SaaS?), and whether the architecture choices match the team's ability to operate them.

Why it matters: architecture debt compounds. A one-week shortcut becomes a six-month rewrite. The dashboard query that "works fine for now" becomes the reason your second hire spends their first month fighting infrastructure. Catch it now while it is one file, not when it is the whole codebase.

DevOps & CI/CD

Whether deployments are automated and safe, or whether shipping a new feature still involves manually copying files.

What we look at: CI workflows (.github/workflows/, GitLab CI, CircleCI, Buildkite), automated testing on PRs, automated deploys, environment promotion strategy (staging → prod), secret management (Doppler, Vault, hosted env vars), Docker / container definitions, infrastructure-as-code (Terraform, Pulumi, CDK), rollback strategy, and whether anyone can ship without bricking it.

Why it matters: manual deploys are how you ship a broken main branch to production at 11 PM. CI gates prevent the change that breaks types from ever reaching users. AI-built MVPs frequently have a Dockerfile (because the LLM included one) but no CI workflow (because nobody asked).

Cost & infrastructure waste

Whether the project's hosting setup will quietly drain your runway, or whether someone has thought about cost.

What we look at: visible deployment configuration (Vercel, Netlify, Railway, Fly, Render — each has different cost profiles), evidence of resource limits, autoscaling configuration, idle resources (database replicas, queues, workers), CDN strategy, license implications for self-hosting (copyleft licenses force open-sourcing your derivative — that affects monetization), and obvious "spending money to look fancy" patterns (Kubernetes for an app two users will hit, Postgres replicas a startup will not need for a year).

Why it matters: cost per user is the metric every B2B SaaS lives or dies on. Idle infrastructure is silent runway burn. A free-tier-friendly setup keeps an indie hacker afloat; over-engineered infrastructure burns cash for no user benefit. The cheapest fix is the one you make before you have customers depending on it.

How the overall score is calculated

The overall score is the rounded mean of the nine dimensions. There is no weighting magic. If you want to know which dimension hurt your overall the most, look for the lowest individual score and assume that is the one to fix first.

What the score numbers actually mean

0–25: critical gap. The repo would break or get exploited within days under real traffic. This is where most vibe-coded MVPs land — it is a wake-up call, not an insult.
26–45: significant gap. Not production-grade for any paying customer.
46–65: shippable for a soft launch only, with named caveats.
66–80: production-grade with minor polish needed.
81–100: enterprise-ready. Justified only by multiple concrete observed signals — real test coverage, real CI, real observability stack, etc.

The point of the score is not to make you feel bad. It is to give you a checklist before reality does. Every dimension has a fix. The free scan tells you which fix matters most.