How We Vibe-Coded an AI Code Review Tool (and Where We Stopped)

The problem

We’re a small team. Five people building an AI platform. At any given time, we have 14 active repos. That means a lot of PRs and not enough humans to review them well.

Reviews were taking too long. Important stuff was getting skimmed. We were delivering code faster than we could verify.

So we built our own code review bot. It started as a side project to fix an internal pain point. Now it’s a product. Surmado Code Review. $15/month for 100 PRs.

Here’s how we actually built it. What was vibe-coded, what wasn’t, and the framework we’d recommend to anyone prototyping an AI pipeline.

Phase 1: Vibe-coded inception

The first version was a summary agent. That’s it. Open a PR, get a plain-language summary of what changed. We vibe-coded the whole thing. Fast, scrappy, Claude doing the heavy lifting.

Then we started asking: can it catch bugs too?

That’s where things got interesting.

Phase 2: The forward-backward pass

This is the part we think is actually worth sharing. It’s a framework you can use on any AI pipeline. Not just code review.

We borrowed a concept from ML training: the forward-backward pass. But instead of training a model’s weights, we were training an entire pipeline as though it were an LLM.

Here’s how it works.

Forward pass. You define a set of controlled inputs. In our case, PRs with known bugs we deliberately planted. You run them through your pipeline and record what it catches and what it misses.

Backward pass. You look at the misses. You tweak the pipeline. The prompts, the structure, the review logic. Then you rerun the experiment. Did it catch the bug this time? Did it introduce false positives?

Repeat. Each cycle is cheap and fast. You’re not retraining a model. You’re iterating on code, prompts, and architecture with immediate signal on whether it worked.

The key insight: you’re turning Claude Code into a synthetic user. You give it an expected input and an expected output. You let it write additional test inputs and evaluate the results against your pipeline. That’s your test harness.

Test harnesses are easy to vibe-code. As long as you have an active API to call, you can set this up in an afternoon. Something deployed, containerized, or even just running locally. You define the types of bugs you want your pipeline to detect, generate test cases, run them, and iterate.

This was our v1 and v2. We’d have 10 different theories about how to structure the review logic. Instead of debating them, we’d test all 10 through the harness and let the results decide.

Phase 3: Where vibe coding hit the ceiling

Through all that experimentation, we found the real unlock: standards grounding.

The review agent got dramatically better when it had a well-written standards document to reference. Our team’s actual coding standards, conventions, and architectural decisions.

If you’re vibe coding, you probably already have a CLAUDE.md or .cursorrules. That’s a writing contract. It tells the agent what to produce. But most teams don’t have a review contract. There’s no equivalent file telling anyone, human or machine, what’s acceptable to merge. That asymmetry is where drift comes from.

We wrote a STANDARDS.MD for each repo. Once that file existed, the review agent had something real to check against. Not generic best practices. Our specific opinions about how we build.

But here’s the thing. The standards document itself couldn’t be vibe-coded. The way it was formatted, the way it was deployed into the pipeline, the specific language and structure. All of that required careful human judgment. We wrote it, revised it, tested how the agent interpreted it, and revised it again. That loop was human-driven.

Productization was entirely hand-built. Every piece of it.

Authentication and payments. Wired up deliberately. We care about protecting customer money.

Failure states and refunds. Designed by hand. When something goes wrong, we need to know exactly what happens and why.

Privacy. Non-negotiable. Customer code goes through our pipeline. That demands deliberate architecture, not generated code.

Integration with Scout. Scout is our AI research analyst. The wiring between Surmado Code Review and Scout was careful, tested, and intentional.

We also spent significant effort on the weight of the models in the pipeline. Being deliberate about which model handles what, balancing cost and quality. That kind of decision-making doesn’t lend itself to vibe coding. It requires understanding your margins, your customers’ expectations, and the failure modes of each model tier.

None of that was delegated to the machine.

Phase 4: Human review and polish

After the vibe-coded prototype and the agentic testing loop got us to a strong foundation, the rest was a lot of human review. Reading outputs. Comparing against what a senior engineer would actually flag. Tuning the standards document. Killing false positives.

We dogfooded it across all 14 of our internal repos. By the time we launched v7, it had been through thousands of real PRs on real code that we cared about.

The result: 3x improvement in time-to-merge across those repos. PRs stopped sitting. The senior engineers stopped catching the same five things in every review. The junior devs stopped guessing at conventions. The agent-authored PRs stopped drifting into shapes nobody on the team would have written.

My dev team made me release it. They were tired of telling founder friends that no, you can’t have it, it’s internal. So we productized it.

The framework (steal this)

If you’re building any AI pipeline, code review or otherwise, here’s the pattern.

1. Vibe-code your prototype. Get the basic flow working. Don’t overthink the architecture. Let AI do the scaffolding.

2. Build an agentic test harness. Define your expected inputs and outputs. Let Claude Code act as a synthetic user. This is the forward-backward pass. It’s fast to set up if you have an API to call, and it lets you test 10 ideas in the time it would take to manually evaluate one.

3. Let the data pick the winners. Don’t debate which approach is better. Run them all through the harness. The signal is immediate.

4. Find the grounding mechanism. For us, it was a standards document. For you, it might be a style guide, a rubric, a set of rules, or a reference dataset. The thing that makes your pipeline consistent and opinionated. That’s your real differentiator, and it almost certainly requires human craft.

5. Hand-build what protects your customers. Auth, payments, privacy, failure states. These are not vibe-coding territory. These are the things that earn trust. Do them deliberately.

The honest takeaway

Vibe coding is a phenomenal prototyping and experimentation framework. The agentic testing loop, the forward-backward pass pattern, let us move faster in early development than we ever could have otherwise. We went from “what if we built a code review bot” to a working prototype catching real bugs in a matter of days.

But the things that make a product trustworthy required humans making deliberate decisions with full context. The standards grounding. The model selection. The payment flows. The privacy architecture.

Vibe-code your prototyping and experimentation layers. Hand-craft what protects your customers. That’s the line. We think it’s the right one.

Try it on your own repos

Surmado Code Review. $15/month for 100 PRs. Built by a five-person team. Dogfooded across 14 repos.

See how Surmado Code Review works or start a review on your next PR.

Related Reading:

The problem

Phase 1: Vibe-coded inception

Phase 2: The forward-backward pass

Phase 3: Where vibe coding hit the ceiling

Phase 4: Human review and polish

The framework (steal this)

The honest takeaway

Try it on your own repos

Ready to Take Action?

Keep Reading

One Comment Per PR. Edited on Rerun. That's the Whole Idea.

Your AI Keeps Importing Packages That Don't Exist. We Got Tired of It Too.

The GDScript Code Review Checklist