Blog

How to Add a Human-in-the-Loop Step to Any AI Agent Without Breaking Latency

The architecture for inserting human approval checkpoints into AI agents — async queuing, state persistence, resume endpoints, and the patterns that keep P95 latency acceptable when a human is in the critical path.

How to Add a Human-in-the-Loop Step to Any AI Agent Without Breaking Latency
AI Engineering6 min read2026-04-27
By Published Updated

Every production AI agent eventually needs a human in the loop. The legal team needs to approve the contract before it goes to the client. The finance team needs to sign off on the refund before the agent processes it. The manager needs to review the AI-drafted performance review before it lands in the employee's inbox.

The problem is that humans do not respond in milliseconds. If your agent is synchronous — waiting for a human response before continuing — a single approval that takes 45 minutes causes a 45-minute timeout in your application. This article covers the architecture that avoids that: how to pause an agent, persist its state, wait for a human decision asynchronously, and resume exactly where it stopped.

The naive approach and why it fails

The most common mistake is inserting a blocking HTTP call or a polling loop into the agent's execution path. The agent reaches the approval step, sends a Slack message to the approver, and then polls a database row every 30 seconds waiting for an approval flag to be set.

This works in a demo. In production it creates three problems. First, you are holding a server thread (or a serverless function invocation) open during the entire wait period — at 30-minute average approval time, that is expensive and fragile. Second, if the polling function crashes or the server restarts, the state is lost and the approval request is orphaned. Third, the experience for the approver is bad: they have to act within whatever timeout your polling function sets or the workflow fails.

The async queue pattern: decouple agent execution from human time

The correct model is to treat the human approval as an external event that resumes a paused workflow, not as a wait loop inside a running workflow.

Here is the architecture:

  1. Agent reaches checkpoint. The agent has completed some work and needs approval before continuing. At this point, it serializes its current state — the inputs, the intermediate outputs, the next step it would take if approved — and writes it to a durable store (Postgres, Redis with persistence, or DynamoDB). It assigns a unique approval_id to this state snapshot.

  2. Notification is sent asynchronously. The agent posts a message to Slack (or email, or your internal admin UI) containing the approval context and a link to an approval endpoint: https://yourapp.com/api/agent/approve?id=<approval_id>&action=approve. The agent's execution function then returns immediately with a status of pending_approval. The caller (your API endpoint or webhook handler) returns a 200 response.

  3. The approver acts when ready. When the approver clicks Approve or Reject, they hit your resume endpoint. The endpoint loads the persisted state by approval_id, validates the approver's identity, and enqueues a job to resume the agent.

  4. Agent resumes from persisted state. The job runner picks up the queued job, rehydrates the agent state, and continues execution from the checkpoint. The agent has no knowledge of how long it waited — from its perspective, it received an input and is processing it.

What to persist in the state snapshot

The state snapshot needs to contain everything the agent needs to resume without any additional context from the caller. In practice this means:

  • The full agent state at the time of the checkpoint (conversation history, tool call results, intermediate outputs)
  • The next action the agent would take if approved (not the logic to compute it — the computed result itself, serialized)
  • The rejection behavior (what the agent should do if the human says no — roll back, notify, escalate)
  • The approver identity or role (so you can validate that the person clicking Approve is authorized for this class of decision)
  • An expiry timestamp (what happens if nobody approves within 48 hours — automatic rejection, escalation, or timeout alert)

Avoid storing anything that will change by the time the agent resumes. If your agent was about to write to a record, do not store the record ID and assume it still exists — by the time approval comes, it may have been deleted. Store the intent and re-validate at resume time.

Resume endpoint design

The resume endpoint is the critical path. It needs to be idempotent (clicking Approve twice should not run the next step twice), authenticated (anonymous links are a security problem), and fast (the approver should see a confirmation immediately, not wait for the next agent step to complete).

// POST /api/agent/resume
export async function POST(req: Request) {
  const { approvalId, action, approverId } = await req.json();

  const checkpoint = await db.agentCheckpoints.findUnique({ where: { id: approvalId } });
  if (!checkpoint || checkpoint.status !== "pending") {
    return Response.json({ error: "Invalid or already processed" }, { status: 400 });
  }

  // Mark as processing immediately to prevent double-processing
  await db.agentCheckpoints.update({
    where: { id: approvalId },
    data: { status: "processing", resolvedBy: approverId, resolvedAt: new Date() },
  });

  // Enqueue the resume job — do not run it inline
  await queue.enqueue("agent.resume", { checkpointId: approvalId, action });

  return Response.json({ message: "Decision recorded, processing..." });
}

The key line is the last one: enqueue the resume job, return immediately. Do not await the job inline. The approver gets an instant confirmation; the agent continues in the background.

Latency impact and how to measure it

Adding a human checkpoint changes your latency profile from synchronous to bimodal. Requests that do not hit the checkpoint have the same P95 latency as before. Requests that do hit the checkpoint have a latency distribution that matches human response times — median 20–40 minutes during business hours, potentially hours or days if the approver is in a different time zone.

The metric to track is not average latency (which is meaningless for async workflows) but time-to-resolution by checkpoint type: the elapsed time from when the approval request was sent to when the agent resumed. Set SLA targets per checkpoint type and alert when they are breached. A contract approval that has been pending for 6 hours with no action is a process problem, not a code problem — but your monitoring needs to surface it.

The one pattern that breaks this architecture

Nested human-in-the-loop checkpoints — where approval of the first step triggers a workflow that itself requires a second approval — are hard. Each checkpoint needs its own state snapshot, its own approval_id, and its own resume path. If the second approval depends on the outcome of the first, you now have a dependency graph in your approval workflow, not just a single checkpoint.

The practical fix: flatten the approval whenever possible. If you need two humans to approve sequentially, can you gather both approvals simultaneously? If the first approval changes what the second approver sees, consider whether a single approval with both parties in the same Slack thread eliminates the dependency. Most multi-step approvals can be collapsed to one step with a richer context surface.

For the workflows where nested approvals are unavoidable, LangGraph's checkpoint system handles this natively with its persistence layer and interrupt mechanism — it is one of the genuinely good reasons to reach for the framework rather than build your own state machine.

For cost planning on agent workflows with approval steps, the AI Agent Cost Estimator lets you model the per-run cost including the LLM calls that happen before and after the checkpoint.

Related reading

Token Budgets for AI Features: How I Give PMs a Real Answer in 10 Minutes

A repeatable method for calculating token costs per AI feature — the inputs PMs need, how to build a per-feature cost model, and the three assumptions that kill every estimate if you get them wrong.

Continue reading

OpenAI Assistants API vs. Building Your Own Loop: What Actually Broke for Us

A firsthand account of shipping with the OpenAI Assistants API and then migrating to a custom agent loop — the specific failure modes, the latency problem, the debugging wall, and when each approach is actually the right choice.

Continue reading

Why Your n8n AI Workflow Silently Breaks at 3 AM (And the 4 Observability Hooks That Catch It)

The four failure modes that kill n8n AI workflows overnight — credential expiry, LLM rate limits, webhook timeouts, and data schema drift — and the specific observability hooks that catch them before your client notices.

Continue reading

LangGraph vs. n8n vs. a Custom State Machine: Honest Trade-offs After Shipping All Three

A firsthand comparison of LangGraph, n8n, and custom state machines for AI agent orchestration — covering where each breaks down, what it costs to debug, and which to reach for based on your actual constraints.

Continue reading

Advertisement