Forkline Engineering

AI Feedback Loops

How engineering teams can use objectives, verifiers, evidence, and human decisions to make AI-generated code reviewable instead of merely plausible work.

June 29, 2026·12 min read

ai-engineeringai-coding-agentscode-reviewci-cdengineering-workflowai-runners

Reviewable AI work needs evidence loops around generated code, not only a human looking at the diff.

AI feedback loops matter because a generated patch is not the same thing as reviewed engineering work. A model can produce code that looks plausible, passes some checks, and still fails the objective the team actually cared about.

That gap is the subject of Part 2 of this series.

Part 1 focused on the assignment layer: the shared artifact that turns private intent into a visible task, spec, or ticket. That layer answers: what are we asking the system to change, what are the constraints, and how should success be judged?

The next layer is evidence. Once the work comes back, the team needs a feedback loop that can answer a different question: did this change prove the thing the assignment claimed it would prove?

Reviewable AI work is not just code with a human reviewer attached. It is work with a clear objective, a matching verifier, returned evidence, and a human decision about the remaining risk.

The short version is a loop: objective, verifier, evidence, decision, and assignment update. Without that loop, the reviewer is forced to infer too much from the diff. With it, the reviewer can judge a claim.

FIG 01 Evidence loop

collecting evidence claim moves clockwise through review

01 Objective The claim the change should prove

02 Verifier The check matched to that claim

03 Evidence The result returned to reviewers

04 Decision Human judgment on remaining risk

05 Assignment update What the loop taught the team

reviewable work evidence before trust

Reviewable AI work connects objective, verifier, evidence, decision, and the next assignment update.

AI makes weak objectives more expensive

Engineering teams already use feedback loops. Unit tests, CI, staging checks, monitoring, canaries, feature flags, and A/B tests are not new.

The AI-specific problem is that agents can produce plausible work quickly against the wrong target.

A human who misunderstands a ticket usually exposes some of that misunderstanding during discussion, implementation, or review. An agent may instead generate a confident patch, add tests that fit its own interpretation, and return a result that looks complete at first glance. The output can be tidy while the objective is wrong.

Common failure modes include:

The agent optimizes for the literal wording of a task while missing the product intent.
The agent adds tests that confirm its implementation instead of protecting the desired behavior.
The agent makes broad cleanup changes that are unrelated to the objective.
The agent gets a green CI run even though CI does not check the important boundary.
The agent fixes the visible symptom while leaving the production failure mode untouched.

This is why “the code looks reasonable” is not a strong enough review standard. The review has to start from the objective and work forward to the evidence.

A feedback loop is objective plus verifier

A feedback loop is only useful if it knows what it is checking.

“Improve checkout” is not a reviewable objective. It could mean lower latency, fewer failed payments, a clearer UI, better conversion, fewer support tickets, or less code complexity. Each goal needs a different verifier.

More useful objectives look like this:

Add validation so malformed payloads return a clear 400 response.
Refactor this parser without changing its behavior for the current fixture set.
Keep /accounts/:id returning 200 for active and archived accounts while preserving 404 for missing accounts.
Reduce p95 latency for this endpoint without increasing 5xx responses.
Deploy a new onboarding variant and compare activation against the current flow.

The verifier should follow from the objective:

Correctness claims need examples, unit tests, or property checks.
Compatibility claims need contract tests or request-level checks.
Repository health claims need CI, builds, and static analysis.
Reliability claims need error rates, latency, and rollback criteria.
Product outcome claims need production measurement, experiments, or metric comparisons.

The important discipline is choosing the loop before treating the work as done. Otherwise the team can end up with evidence for the wrong claim.

Local loops check narrow behavior

The smallest useful feedback loops run close to the code:

unit tests
type checks
linting
formatting checks
snapshot tests
fixture-based examples
property tests where the input space matters

These loops are a good fit for parsing, validation, formatting, edge-case handling, small refactors, pure functions, and narrow bug fixes.

For AI-assisted work, local loops are most useful when the assignment includes concrete examples. Instead of asking an agent to “fix validation,” the assignment can say:

“Add validation so an empty email returns email is required, an invalid email returns email is invalid, and existing valid inputs keep passing. Add tests for those cases.”

Now the review is grounded. The reviewer can inspect the examples, the implementation, and the test result. If the agent invents a different error message, changes valid input behavior, or adds tests that do not cover the requested cases, the failure is visible.

Local loops are fast, but they are narrow. A function can pass its tests and still fail when wired into a route, job, migration, cache, or external dependency. Passing the local loop should mean “this narrow claim has evidence,” not “the whole change is safe.”

Repository loops check shared codebase health

Once a change affects the shared codebase, the feedback loop needs to move beyond the local file.

Repository-level loops include:

full test suites
builds
CI workflows
static analysis
dependency checks
migration checks
packaging checks
security scanners

These loops answer a repository-level question: does the codebase still satisfy the checks the team has agreed to run before accepting changes?

This matters for AI because agents can make wide edits. A patch that looks reasonable in one file may break another package, generated type, build step, fixture, deployment artifact, or dependency constraint. CI is the shared checkpoint that catches known risks across the repo.

But CI should not be treated as a universal quality stamp. It only checks what the workflow encodes. If the objective was “preserve the public API contract,” but CI never exercises that contract, a green build is not enough evidence. If the objective was “reduce memory usage,” but the pipeline does not measure memory, CI cannot answer the important question.

The practical move is to name the CI claim explicitly:

FIG 02 CI claim

Repository loop Public API contract check

checking claim

Objective constraint

preserve public API contract while changing account lookup behavior

Verifier selected

unit tests plus endpoint tests in CI

Evidence passed

CI passed, including the account contract suite

Known gap open

no production traffic validation yet

Decision review the diff with CI evidence, but do not claim production validation

CI is useful evidence when the claim is explicit: what contract must stay true, which checks prove it, and which risk remains outside the loop.

This keeps CI in the right role. It is evidence for a defined claim, not a substitute for thinking.

Integration loops check boundaries

Many engineering changes fail at the boundary between components.

The code compiles. Unit tests pass. The service starts. Then a consumer sends a request the implementation did not expect, a queue payload changes shape, a database migration drops an assumption, or an external integration behaves differently from the mock.

Integration feedback loops are designed for those boundaries:

API contract tests
endpoint smoke tests
consumer/provider tests
database read/write checks
job and queue processing checks
migration dry runs
staging environment checks

Consider a task like this:

“Change the account lookup endpoint to support archived users. Existing requests to /accounts/:id must continue returning 200 for active accounts, archived accounts should return 200 with archived: true, and missing accounts should keep returning 404.”

That assignment has three objective checks:

Active accounts still return 200.
Archived accounts return 200 with the expected flag.
Missing accounts still return 404.

The verifier should operate at the request boundary, not only inside the function the agent edited. The returned evidence should show those request shapes were checked. If the agent only added unit tests for a helper function, the review should catch the mismatch: the work may have local evidence, but it does not yet prove the endpoint claim.

This is the difference between “tests were added” and “the right behavior was verified.”

Production loops check real-world claims

Some objectives cannot be fully verified before production.

Performance goals, ranking changes, infrastructure behavior, onboarding flows, recommendation logic, and conversion improvements often depend on real traffic. In those cases, the feedback loop has to extend into production with guardrails.

Production feedback loops include:

canary deployments
feature flags
A/B tests
endpoint health checks
synthetic monitoring
error-rate monitoring
latency monitoring
business metric checks
rollback criteria

This does not mean every AI-generated change should go straight to production. It means the team should be honest about where the objective can actually be verified.

If the objective is “reduce p95 latency for search results without increasing error rate,” unit tests are not the final feedback loop. They may still matter, but the decisive evidence comes from realistic load, production monitoring, or a controlled rollout.

If the objective is “the new onboarding variant improves activation,” code review cannot settle the question. The team needs an experiment design, a metric, a comparison window, and a decision rule.

Production loops need safety boundaries:

Who sees the change first?
What metric would trigger rollback?
How long does the experiment need to run?
Which error or latency budget must not be exceeded?
What result is strong enough to keep, revert, or iterate?

Without those boundaries, “test it in production” is not a feedback loop. It is just exposure.

One example across the levels

Suppose the assignment is:

“Improve the account lookup endpoint so archived accounts are returned correctly, without breaking active account lookups or increasing endpoint latency.”

That is not one feedback loop. It is several claims:

Correctness: archived accounts return the expected response.
Compatibility: active and missing account behavior remains unchanged.
Repository health: the codebase still builds and passes shared checks.
Performance: endpoint latency does not regress.
Production safety: real traffic does not show a new error pattern.

FIG 02 Verifier ladder

Local narrow behavior

unit tests, type checks, fixtures

Repository shared codebase health

builds, CI, static analysis

Integration boundary behavior

contract tests, smoke tests, migrations

Production real-world outcome

canaries, monitoring, experiments

Each feedback loop answers a different claim. A green local test is useful evidence, but not proof of production behavior.

A reviewable plan might look like this:

FIG 03 Review artifact

Archived account lookup plan

Objective

Support archived account lookup

Return archived accounts correctly without breaking active and missing account behavior.

Verifiers

Local verifier

Narrow behavior

Unit tests for active, archived, and missing cases.

Integration verifier

Request boundary

Endpoint smoke tests for the same request shapes.

Repository verifier

Codebase health

CI build and full test suite.

Production verifier

Real-world safety

Canary with 5xx and latency monitoring.

Decision and memory

Decision rule

Merge, then expand

Merge after tests pass. Expand rollout only if canary error rate and latency stay within threshold.

Assignment update

Preserve the lesson

Document archived account response behavior and add endpoint cases to the contract suite.

A reviewable plan names the objective, matches each verifier to a claim, and keeps the final decision explicit.

The exact tools do not matter as much as the alignment. Each verifier checks a specific claim. The reviewer can see what was proven, what was not proven, and where human judgment is still required.

That is what makes the work reviewable.

Reviewers should review the evidence layer

AI code review often starts in the wrong place. The reviewer opens the diff and tries to infer whether the work is safe.

The diff matters, but it should not be the whole review surface. A stronger AI code review workflow asks:

What objective was this change trying to satisfy?
Which verifier checked that objective?
What evidence came back?
Did the agent add evidence for the actual objective or for a narrower interpretation?
What did the feedback loop not check?
Is the remaining risk acceptable?

This changes the review from vibe-checking generated code to judging a claim.

The claim might be small: “This parser handles the new date format.” The evidence might be tests. The remaining risk might be low.

The claim might be larger: “This deployment strategy reduces rollout failures.” The evidence might include staging checks, canary metrics, and rollback behavior. The remaining risk may require a decision from someone who owns the service.

The useful question is not “Do I trust the AI?” It is “Does the evidence support the objective?”

The loop should update the assignment layer

Feedback loops are not only gates. They are learning mechanisms.

Sometimes execution reveals that the original objective was incomplete. A test fails because the task missed an edge case. An endpoint check reveals that one consumer depends on undocumented behavior. A canary shows that latency improved but error rate increased. An A/B test improves activation while hurting retention.

That information should update the shared assignment layer:

Add the missed edge case to acceptance criteria.
Promote the endpoint check into the contract suite.
Add a rollback threshold to the rollout plan.
Document the constraint that was previously implicit.
Split the next iteration into a narrower objective.

For the archived account example, the feedback loop might teach the team that one internal consumer expects archived accounts to be hidden. The right response is not only to patch the code. It is to update the spec:

FIG 04 Spec update

Archived account lookup

Spec

Requirement

Public lookup returns archived accounts with archived: true.

Spec

Constraint

Internal billing lookup keeps excluding archived accounts by default.

Spec

Validation

Contract tests cover both public and billing lookup behavior.

The feedback loop updates the assignment with the behavior, the preserved constraint, and the check that protects both paths.

Now the next agent, reviewer, or human developer gets a better assignment than the previous one. The loop has improved the system’s memory.

That is the operating model: assignments define intent, feedback loops return evidence, and evidence updates the next assignment.

A compact checklist for reviewable AI work

Before execution:

Define the objective in one sentence.
Name the kind of claim: correctness, compatibility, performance, reliability, security, user behavior, or maintainability.
Choose the verifier that can actually check that claim.
State what evidence should come back.
State what would make the change unacceptable.

During review:

Compare the returned work against the objective.
Check whether the verifier matched the claim.
Look for evidence the agent optimized for a narrower or different objective.
Identify what was not verified.
Decide whether the remaining risk is acceptable.

After review:

Add missing examples to tests or contracts.
Update specs, runbooks, rollout rules, or monitoring thresholds.
Narrow the next assignment if the first objective was too broad.
Preserve what the loop taught the team.

These questions are useful for human work too. AI just makes the need more obvious because agents can produce plausible output faster than teams can build trust in it.

Conclusion: evidence makes AI work reviewable

The future of AI-assisted engineering is not only better code generation. It is better loops around generated work.

Good assignments define the objective. Good feedback loops verify the objective at the right level: local tests for isolated behavior, CI for shared repository health, integration checks for boundaries, and production metrics for outcomes that only real traffic can validate.

When those loops are missing, AI work arrives as a diff that humans have to interpret from scratch. When the loops are clear, AI work arrives with evidence.

That is the difference between generated code and reviewable engineering work.

The next question is what should become routine. Once objectives and feedback loops are explicit, teams can decide which tasks should stay manual, which should become repeatable AI-assisted routines, and where human approval should remain the final gate.

Source References

Martin Fowler on feature toggles and operational control
Google SRE book on monitoring distributed systems
GitHub Actions workflow concepts

AI makes weak objectives more expensive

A feedback loop is objective plus verifier

Local loops check narrow behavior

Repository loops check shared codebase health

Integration loops check boundaries

Production loops check real-world claims

One example across the levels

Reviewers should review the evidence layer

The loop should update the assignment layer

A compact checklist for reviewable AI work

Conclusion: evidence makes AI work reviewable

Source References

Related content

Preparing Specs for AI Coding Agents

Ticket-driven AI Automation

What a Runner Summary Should Show