AI Feedback Loops
How engineering teams can use objectives, verifiers, evidence, and human decisions to make AI-generated code reviewable instead of merely plausible work.

AI feedback loops matter because a generated patch is not the same thing as reviewed engineering work. A model can produce code that looks plausible, passes some checks, and still fails the objective the team actually cared about.
That gap is the subject of Part 2 of this series.
Part 1 focused on the assignment layer: the shared artifact that turns private intent into a visible task, spec, or ticket. That layer answers: what are we asking the system to change, what are the constraints, and how should success be judged?
The next layer is evidence. Once the work comes back, the team needs a feedback loop that can answer a different question: did this change prove the thing the assignment claimed it would prove?
Reviewable AI work is not just code with a human reviewer attached. It is work with a clear objective, a matching verifier, returned evidence, and a human decision about the remaining risk.
The short version is a loop: objective, verifier, evidence, decision, and assignment update. Without that loop, the reviewer is forced to infer too much from the diff. With it, the reviewer can judge a claim.
AI makes weak objectives more expensive
Engineering teams already use feedback loops. Unit tests, CI, staging checks, monitoring, canaries, feature flags, and A/B tests are not new.
The AI-specific problem is that agents can produce plausible work quickly against the wrong target.
A human who misunderstands a ticket usually exposes some of that misunderstanding during discussion, implementation, or review. An agent may instead generate a confident patch, add tests that fit its own interpretation, and return a result that looks complete at first glance. The output can be tidy while the objective is wrong.
Common failure modes include:
- The agent optimizes for the literal wording of a task while missing the product intent.
- The agent adds tests that confirm its implementation instead of protecting the desired behavior.
- The agent makes broad cleanup changes that are unrelated to the objective.
- The agent gets a green CI run even though CI does not check the important boundary.
- The agent fixes the visible symptom while leaving the production failure mode untouched.
This is why “the code looks reasonable” is not a strong enough review standard. The review has to start from the objective and work forward to the evidence.
A feedback loop is objective plus verifier
A feedback loop is only useful if it knows what it is checking.
“Improve checkout” is not a reviewable objective. It could mean lower latency, fewer failed payments, a clearer UI, better conversion, fewer support tickets, or less code complexity. Each goal needs a different verifier.
More useful objectives look like this:
- Add validation so malformed payloads return a clear 400 response.
- Refactor this parser without changing its behavior for the current fixture set.
- Keep
/accounts/:idreturning 200 for active and archived accounts while preserving 404 for missing accounts. - Reduce p95 latency for this endpoint without increasing 5xx responses.
- Deploy a new onboarding variant and compare activation against the current flow.
The verifier should follow from the objective:
- Correctness claims need examples, unit tests, or property checks.
- Compatibility claims need contract tests or request-level checks.
- Repository health claims need CI, builds, and static analysis.
- Reliability claims need error rates, latency, and rollback criteria.
- Product outcome claims need production measurement, experiments, or metric comparisons.
The important discipline is choosing the loop before treating the work as done. Otherwise the team can end up with evidence for the wrong claim.
Local loops check narrow behavior
The smallest useful feedback loops run close to the code:
- unit tests
- type checks
- linting
- formatting checks
- snapshot tests
- fixture-based examples
- property tests where the input space matters
These loops are a good fit for parsing, validation, formatting, edge-case handling, small refactors, pure functions, and narrow bug fixes.
For AI-assisted work, local loops are most useful when the assignment includes concrete examples. Instead of asking an agent to “fix validation,” the assignment can say:
“Add validation so an empty email returns email is required, an invalid email returns email is invalid,
and existing valid inputs keep passing. Add tests for those cases.”
Now the review is grounded. The reviewer can inspect the examples, the implementation, and the test result. If the agent invents a different error message, changes valid input behavior, or adds tests that do not cover the requested cases, the failure is visible.
Local loops are fast, but they are narrow. A function can pass its tests and still fail when wired into a route, job, migration, cache, or external dependency. Passing the local loop should mean “this narrow claim has evidence,” not “the whole change is safe.”
Repository loops check shared codebase health
Once a change affects the shared codebase, the feedback loop needs to move beyond the local file.
Repository-level loops include:
- full test suites
- builds
- CI workflows
- static analysis
- dependency checks
- migration checks
- packaging checks
- security scanners
These loops answer a repository-level question: does the codebase still satisfy the checks the team has agreed to run before accepting changes?
This matters for AI because agents can make wide edits. A patch that looks reasonable in one file may break another package, generated type, build step, fixture, deployment artifact, or dependency constraint. CI is the shared checkpoint that catches known risks across the repo.
But CI should not be treated as a universal quality stamp. It only checks what the workflow encodes. If the objective was “preserve the public API contract,” but CI never exercises that contract, a green build is not enough evidence. If the objective was “reduce memory usage,” but the pipeline does not measure memory, CI cannot answer the important question.
The practical move is to name the CI claim explicitly:
preserve public API contract while changing account lookup behavior
unit tests plus endpoint tests in CI
CI passed, including the account contract suite
no production traffic validation yet
This keeps CI in the right role. It is evidence for a defined claim, not a substitute for thinking.
Integration loops check boundaries
Many engineering changes fail at the boundary between components.
The code compiles. Unit tests pass. The service starts. Then a consumer sends a request the implementation did not expect, a queue payload changes shape, a database migration drops an assumption, or an external integration behaves differently from the mock.
Integration feedback loops are designed for those boundaries:
- API contract tests
- endpoint smoke tests
- consumer/provider tests
- database read/write checks
- job and queue processing checks
- migration dry runs
- staging environment checks
Consider a task like this:
“Change the account lookup endpoint to support archived users. Existing requests to /accounts/:id must
continue returning 200 for active accounts, archived accounts should return 200 with archived: true, and
missing accounts should keep returning 404.”
That assignment has three objective checks:
- Active accounts still return 200.
- Archived accounts return 200 with the expected flag.
- Missing accounts still return 404.
The verifier should operate at the request boundary, not only inside the function the agent edited. The returned evidence should show those request shapes were checked. If the agent only added unit tests for a helper function, the review should catch the mismatch: the work may have local evidence, but it does not yet prove the endpoint claim.
This is the difference between “tests were added” and “the right behavior was verified.”
Production loops check real-world claims
Some objectives cannot be fully verified before production.
Performance goals, ranking changes, infrastructure behavior, onboarding flows, recommendation logic, and conversion improvements often depend on real traffic. In those cases, the feedback loop has to extend into production with guardrails.
Production feedback loops include:
- canary deployments
- feature flags
- A/B tests
- endpoint health checks
- synthetic monitoring
- error-rate monitoring
- latency monitoring
- business metric checks
- rollback criteria
This does not mean every AI-generated change should go straight to production. It means the team should be honest about where the objective can actually be verified.
If the objective is “reduce p95 latency for search results without increasing error rate,” unit tests are not the final feedback loop. They may still matter, but the decisive evidence comes from realistic load, production monitoring, or a controlled rollout.
If the objective is “the new onboarding variant improves activation,” code review cannot settle the question. The team needs an experiment design, a metric, a comparison window, and a decision rule.
Production loops need safety boundaries:
- Who sees the change first?
- What metric would trigger rollback?
- How long does the experiment need to run?
- Which error or latency budget must not be exceeded?
- What result is strong enough to keep, revert, or iterate?
Without those boundaries, “test it in production” is not a feedback loop. It is just exposure.
One example across the levels
Suppose the assignment is:
“Improve the account lookup endpoint so archived accounts are returned correctly, without breaking active account lookups or increasing endpoint latency.”
That is not one feedback loop. It is several claims:
- Correctness: archived accounts return the expected response.
- Compatibility: active and missing account behavior remains unchanged.
- Repository health: the codebase still builds and passes shared checks.
- Performance: endpoint latency does not regress.
- Production safety: real traffic does not show a new error pattern.
unit tests, type checks, fixtures
builds, CI, static analysis
contract tests, smoke tests, migrations
canaries, monitoring, experiments
A reviewable plan might look like this:
The exact tools do not matter as much as the alignment. Each verifier checks a specific claim. The reviewer can see what was proven, what was not proven, and where human judgment is still required.
That is what makes the work reviewable.
Reviewers should review the evidence layer
AI code review often starts in the wrong place. The reviewer opens the diff and tries to infer whether the work is safe.
The diff matters, but it should not be the whole review surface. A stronger AI code review workflow asks:
- What objective was this change trying to satisfy?
- Which verifier checked that objective?
- What evidence came back?
- Did the agent add evidence for the actual objective or for a narrower interpretation?
- What did the feedback loop not check?
- Is the remaining risk acceptable?
This changes the review from vibe-checking generated code to judging a claim.
The claim might be small: “This parser handles the new date format.” The evidence might be tests. The remaining risk might be low.
The claim might be larger: “This deployment strategy reduces rollout failures.” The evidence might include staging checks, canary metrics, and rollback behavior. The remaining risk may require a decision from someone who owns the service.
The useful question is not “Do I trust the AI?” It is “Does the evidence support the objective?”
The loop should update the assignment layer
Feedback loops are not only gates. They are learning mechanisms.
Sometimes execution reveals that the original objective was incomplete. A test fails because the task missed an edge case. An endpoint check reveals that one consumer depends on undocumented behavior. A canary shows that latency improved but error rate increased. An A/B test improves activation while hurting retention.
That information should update the shared assignment layer:
- Add the missed edge case to acceptance criteria.
- Promote the endpoint check into the contract suite.
- Add a rollback threshold to the rollout plan.
- Document the constraint that was previously implicit.
- Split the next iteration into a narrower objective.
For the archived account example, the feedback loop might teach the team that one internal consumer expects archived accounts to be hidden. The right response is not only to patch the code. It is to update the spec:
Now the next agent, reviewer, or human developer gets a better assignment than the previous one. The loop has improved the system’s memory.
That is the operating model: assignments define intent, feedback loops return evidence, and evidence updates the next assignment.
A compact checklist for reviewable AI work
Before execution:
- Define the objective in one sentence.
- Name the kind of claim: correctness, compatibility, performance, reliability, security, user behavior, or maintainability.
- Choose the verifier that can actually check that claim.
- State what evidence should come back.
- State what would make the change unacceptable.
During review:
- Compare the returned work against the objective.
- Check whether the verifier matched the claim.
- Look for evidence the agent optimized for a narrower or different objective.
- Identify what was not verified.
- Decide whether the remaining risk is acceptable.
After review:
- Add missing examples to tests or contracts.
- Update specs, runbooks, rollout rules, or monitoring thresholds.
- Narrow the next assignment if the first objective was too broad.
- Preserve what the loop taught the team.
These questions are useful for human work too. AI just makes the need more obvious because agents can produce plausible output faster than teams can build trust in it.
Conclusion: evidence makes AI work reviewable
The future of AI-assisted engineering is not only better code generation. It is better loops around generated work.
Good assignments define the objective. Good feedback loops verify the objective at the right level: local tests for isolated behavior, CI for shared repository health, integration checks for boundaries, and production metrics for outcomes that only real traffic can validate.
When those loops are missing, AI work arrives as a diff that humans have to interpret from scratch. When the loops are clear, AI work arrives with evidence.
That is the difference between generated code and reviewable engineering work.
The next question is what should become routine. Once objectives and feedback loops are explicit, teams can decide which tasks should stay manual, which should become repeatable AI-assisted routines, and where human approval should remain the final gate.
Source References
- Martin Fowler on feature toggles and operational control
- Google SRE book on monitoring distributed systems
- GitHub Actions workflow concepts