Engineering with AI agents

Hi,

For more than a year, I have been using coding agents, mostly Claude Code. During that time, I have barely written code myself. Instead, my role has shifted toward planning, reviewing and understanding code. I also use these agents for studying, brainstorming and problem-solving. I definitely struggle with the blurred boundary of my own work and orchestrated LLM work. I feel like a Code DJ by using and mixing ideas and samples from someone else.

One topic that still feels underexplored in many discussions about AI agents is engineering. Agentic design matters, but so does the underlying infrastructure. In practice, AI agents are distributed systems. They rely on repeated API calls to LLMs, coordinate state across multiple steps, handle failures and had to deal with uncertainty, latency and partial results. To understand these problems better, I use Claude Code and Codex to generate an example project, which I then analyze in detail. My goal is to understand the engineering problems behind the code. The agent is mocked in this example.

As a first example, I use my ticketflow project, which is built with Temporal.io. Temporal is a platform for building reliable workflows. It guarantees that workflows can resume from exactly the state where they left off, even if a worker process crashes.

The project implements a simple ticket system where a mock agent resolves support tickets with a conditional human-in-the-loop step. I use this project for studying distributed systems, connecting patterns in the code to ideas from Martin Kleppmann’s Designing Data-Intensive Applications (for reference, DDIA). I use this project to understand reliability, derived data, asynchronous processing, idempotent side effects and observability.

The goal is to understand which problems these patterns solve, which trade-offs they introduce and how they apply to AI engineering.

Project introduction

A support ticket is not a good fit for a single request/response handler. The system has to call an unreliable agent, retry transient failures, wait for human approval, execute a refund, send a reply and answer status checks while the process is still running.

I picked Temporal because each ticket needs a durable orchestration with a sequence of decisions and side effects that may span multiple steps, failures and waiting periods. The ticket flow is as follows: classify the ticket, draft a reply, wait for human approval, then either send the reply and execute the approved action or reject the proposed resolution. The flow also includes fallback queues in case an agent is not available.

The service is split into four components:

Temporal server: Stores workflow state and schedules tasks.
Python Temporal worker: Polls the ticketflow task queue and runs workflows and fast side-effect activities.
LLM Temporal worker: Polls the ticketflow-agent queue with a shared rate limit and the agent-fallback queue.
FastAPI app: Accepts ticket requests and starts, queries and updates Temporal workflows.

SQLite is used as a read model. Temporal deletes workflow history after a retention period and querying a workflow needs a live worker to replay it. Once history is gone or no worker is up, Temporal returns NOT_FOUND/unavailable. As a fallback, each workflow writes its final result to SQLite on completion, so the result stays queryable even when Temporal cannot answer. This ensures data availability while leveraging Temporal’s durability for the core workflow execution.

This gives the project its first connection to DDIA (Chapter 1: Reliable, Scalable and Maintainable Applications). Reliability is not about avoiding faults, but also deciding which faults the system can tolerate. Those four components do not fail in the same way. If the API is down, no new tickets can be created and status checks fail, but running workflows continue. If a worker is down, work is waiting in Temporal and resumes when the worker returns. If the LLM worker is overloaded or unavailable, agent calls can wait, time out and move to a fallback queue. If the Temporal server is down, the system has lost its durable coordinator. So, a dead worker is a fault the system can absorb. A dead Temporal server is a system outage. Separating the LLM worker is not required for correctness, but it creates an independent failure domain for a rate limited dependency.

Durable execution: remembering where the ticket is

The first problem is not the agent call itself. The first problem is remembering where the ticket is when the process running it disappears. A ticket can be in one of several states:

waiting for an LLM
retrying after a transient failure
waiting for human approval
waiting for long-running execution
executing a refund
sending a final reply

If this process lives only in a Python request handler or an in-memory background task, a worker restart turns into a bug, where the ticket may be stuck, repeated, forgotten or be completed twice.

This needs durable execution, where the ticket progress is recorded independently of the currently running process. Temporal solves this by treating the workflow as a replayed state machine. Temporal workflow code looks like normal control flow but its state is reconstructed from an event history.

The ticket workflow follows the ticket process until it is finished. Local fields such as _status, _draft and _decision are reconstructed by replaying the event history.

Another important point is, that waiting for human approval also looks like ordinary async code:

1
2
3
4
5


        try:
            await workflow.wait_condition(
                lambda: self._decision is not None, timeout=APPROVAL_TIMEOUT
            )
        except asyncio.TimeoutError:

This does not hold a worker thread for the full APPROVAL_TIMEOUT. Temporal records a durable timer and can resume the workflow later. The worker process can disappear in the meantime, but the workflow state still exists in Temporal’s event history.

Approval enters the workflow through an update, which is called by the corresponding HTTP request:

1
2
3
4
5
6
7
8


    @workflow.update
    async def submit_approval(self, decision: ApprovalDecision) -> TicketStatus:
        """Accept a human decision and return the resulting status."""
        self._decision = decision
        await workflow.wait_condition(
            lambda: self._status != TicketStatus.AWAITING_APPROVAL
        )
        return self._status

This is useful because approval is not just an external HTTP request. It is a state transition inside the workflow. The workflow can validate the decision, reject duplicate or late approvals and return the resulting status to the caller.

A potential bug is that a late approval could arrive while the workflow is already finishing an escalation. In this case the workflow was logically done, but the terminal status had not been written yet because final activities were still running. The fix is to set the terminal status before running those final activities:

1
2
3
4
5
6
7
8


    async def _finish(
        self, *, reply_text: str, refund: bool, status: TicketStatus
    ) -> TicketResult:
        if self._ticket is None:
            raise ApplicationError("workflow has no ticket", non_retryable=True)
        # Set the terminal status before the final activities so the approval
        # validator rejects updates that arrive while they are still running.
        self._set_status(status)

This is a useful to know, that races inside workflow logic often become ordering bugs in a single logical thread. The fix is not usually a lock. The fix is to put the state transition in the right place in the event history. This fix moves the system to another state-modeling problem, where the workflow needs to distinguish between “no more approvals accepted” and “all final side effects completed”, since the finishing escalation can also fail.

This connects directly to DDIA (Chapter 7: Transactions). A Temporal workflow behaves like a small replicated state machine. Deterministic code plus an ordered event history allows to reproduce the same state after a crash. The history also plays a role similar to a write-ahead log. Before the workflow can safely continue, the important decisions and events must be recorded durably.

Timeouts, rate limits, backpressure and fallback routing

Another problem is deciding how long the workflow should wait for an event like an agent/LLM response. This sounds like one question, but in a distributed system it is several different questions.

Timeouts

Did the task wait too long before any worker picked it up?
Did the worker start the task but take too long to finish?
Did the worker stop making progress while the task was running?

Temporal answers these questions with different timeout types for an activity using:

schedule_to_start: how long an activity task wait in the task queue before a worker picks it up.
start_to_close: how long one activity attempt run after a worker has started it.
heartbeat_timeout: how long a running activity go without sending a heartbeat.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


AGENT_ACTIVITY_TIMEOUT = timedelta(minutes=2)
AGENT_HEARTBEAT_TIMEOUT = timedelta(seconds=30)
AGENT_SCHEDULE_TO_START_S = config.AGENT_SCHEDULE_TO_START_S

...

async def _execute_agent_activity(self, activity_method, *args, **kwargs):
    primary_options = {
        **kwargs,
        "task_queue": AGENT_TASK_QUEUE,
        "schedule_to_start_timeout": timedelta(seconds=AGENT_SCHEDULE_TO_START_S),
        "start_to_close_timeout": AGENT_ACTIVITY_TIMEOUT,
        "heartbeat_timeout": AGENT_HEARTBEAT_TIMEOUT,
        "retry_policy": SINGLE_ATTEMPT_RETRY_POLICY,
    }

These timeouts turn vague waiting into different failure signals. “The task was never picked up” is different from “the task started but did not finish” and also from “the task started but stopped proving liveness.”

Rate limits and backpressure

Those timeouts are implemented from the workflow’s point of view. But there is another issue: throughput. Even if every activity eventually succeeds, the system should not start unlimited LLM calls. This matters because the LLM provider has a global rate limit and each worker process also has limited local capacity.

So Temporal workers need to answer two different questions:

How many LLM calls may the whole system start per second?
How many activities can this worker handle at the same time?

1
2


max_concurrent_activities=config.AGENT_MAX_CONCURRENT,
max_task_queue_activities_per_second=config.AGENT_MAX_PER_SECOND,

max_concurrent_activities limits how many activities one worker process can run at the same time. If this value is 5, one worker can run up to five activities concurrently. If I start three workers with the same setting, the system runs up to fifteen activities concurrently.

max_task_queue_activities_per_second limits how many activities may be started from the task queue per second. This protects the shared LLM budget. If three workers poll the same LLM task queue, the rate limit still belongs to the task queue. It should not become three times larger just because there are three workers.

So the concurrency limit protects the worker host. The task-queue rate limit protects the shared LLM budget. Scaling workers can increase capacity, but it should not accidentally exceed the provider’s global rate limit.

A task queue also makes backpressure visible. Backpressure happens when there are more incoming requests than the system can handle. In this project, that means there are more tickets than the LLM workers can process.

Temporal Task Queues are polled by Workers. If the LLM worker is stopped in the middle of a batch, the agent activities are not lost. They stay pending in Temporal’s task queue. When the worker starts again, it continues polling and the workflows can resume.

So Temporal’s task queue acts like the message queue for the agent step. No separate broker is needed here. The growing queue of pending activity tasks is the backpressure signal.

Fallback routing

The schedule_to_start timeout is also used as a routing signal. If an agent activity waits too long in the primary task queue, the workflow treats this as “the primary LLM worker cannot get to this task in time.” Instead of failing the whole ticket, it reruns the agent call on the fallback queue.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


try:
    return await workflow.execute_activity_method(
        activity_method,
        *args,
        **primary_options,
    )
except ActivityError as exc:
    if _is_schedule_to_start_timeout(exc):
        return await _execute_fallback_agent_activity(
            activity_method, *args, **kwargs
        )

This makes the fallback queue a degraded mode. In this project, the fallback model is cheaper and has lower confidence as you can see in the mock agent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


@classmethod
def fallback(cls, seed: int | None = None) -> "MockAgent":
    """Create a fast, reliable, lower-confidence fallback agent."""
    return cls(
        seed=seed,
        failure_rate=0.0,
        refund_rate=0.0,
        latency_range=(0.0, 0.0),
        confidence_range=(0.0, 0.6),
        model="fallback",
    )

That means the ticket can still make progress, but the result is less trusted. The cost of fallback is therefore visible in the workflow with lower confidence results that are more likely to require human approval.

There is another trade-off in this fallback design. Normally, Temporal can own the retry loop when an activity fails. The retry policy decides when to try again and the workflow code does not need to manage individual attempts. Still, the fallback routing needs more control. The workflow does not only want to retry the same activity. It wants to inspect the failure and switch from the primary task queue to the fallback task queue when the primary worker cannot pick up the task in time. For that reason, automatic retries are limited, and the workflow owns the retry decision itself.

This moves some retry state from Temporal’s retry policy into application code. The benefit is explicit routing control. The cost is that the workflow becomes more complex.

This connects directly to DDIA (Chapter 8: The trouble with distributed systems and Chapter 11: Stream processing). In a distributed system, waiting is not neutral. A task may be slow because a worker is overloaded, a queue is growing or a dependency is unavailable. In this project, schedule_to_start makes overload visible, the task queue absorbs backpressure and the fallback queue turns a blocked primary worker into degraded service instead of a full failure. Temporal provides the durable queue, but the application still has to decide what “too slow” means and what should happen next.

Idempotency: exactly-once as a useful lie

The goal is to make failed tasks safe to retry without applying the same effect twice. In this project, the dangerous case is a crash during a refund. A naive retry could refund the customer twice.

The refund activity writes to SQLite. If the activity succeeds but the worker crashes before Temporal receives the acknowledgement, Temporal may run the activity again. The second attempt reaches the same database, but the refund effect is keyed by ticket_id. Because the ticket_id is unique and the activity uses INSERT OR IGNORE, the duplicate refund is ignored. The customer is refunded once.

That is the practical version of exactly-once. The code may run more than once, but the external effect happens once. Temporal gives at-least-once execution and idempotency makes the effect effectively-once.

The activity does not ask, “did I run before?” Instead, the effect is modeled with an idempotency key. refund_attempts logs every attempt, while refunds records the business effect. Both are committed in one transaction, so the log of what was tried and the record of what happened do not disagree.

This is DDIA’s point about retries and uncertainty. The caller cannot always know whether the callee did nothing, did the work and crashed or did the work and lost the reply. Idempotency does not remove that uncertainty. It makes it harmless.

Payloads outlive code: schema evolution in workflow histories

The next problem is that workflow data lives longer than the code that created it. A Temporal workflow may run for hours, days or longer. During that time, the code can change, but the workflow history still contains payloads written by the old version.

I ran an experiment and added a new required field to the Classification model, while Temporal was running. When Temporal tried to replay that workflow’s history, the old Classification payload no longer matched the new model. Validation failed and the replay raised an error. The workflow was stuck and could never reach the next activity of the workflow.

The lesson is a versioning rule. Only add fields with defaults and don’t add required fields. Also don’t remove or rename existing ones. Each direction breaks a different kind of compatibility. A new required field breaks backward compatibility, because old histories don’t contain it. Removing or renaming a field breaks forward compatibility, because messages already in flight still use the old name.

The project follows that rule in practice. The model field was added as model with str = “primary” as a default. So old payloads that predate the field still deserialize cleanly. Regression tests pin this down by deserializing old-shape payloads against the new models, so a future breaking change fails a test instead of silently bricking a live workflow.

This is DDIA (Chapter 4: Encoding and Evolution) in practice, where encoded data outlives the code that wrote it.

Derived data and two read paths

The system gets asked two different kinds of questions and they do not need the same storage:

Which tickets are waiting for approval right now?
What happened to ticket X?

The first question, the ticketflow-api queries Temporal’s visibility store, which is similar to querying an index. The workflow writes each state change into a Temporal search attribute, so the API does not need to open every workflow to find tickets in a certain state.

The second question is historical and ticket-specific. The workflow is the natural source of truth, but a live query only works while the history exists and a worker can replay it. So when a workflow finishes, it writes its final outcome once to the SQLite read model.

This is DDIA’s (Chapter 11: Stream Processing) argument for materialized views and CQRS at toy scale. You don’t serve every read from the source of truth. You derive read-optimized projections from an log and route each question to the view built for it, while accepting that the views can lag and must be rebuildable from the log.

Where this would break in production

The project is useful as a lab, but it is not production-ready. A reliable system should be clear about the faults it handles and the faults it does not handle yet.

The first gap is heartbeats. The agent activities heartbeat before and after the model call, but not during it. For the mock agent this is fine. For a real LLM call that takes longer than the heartbeat timeout, the activity could be treated as dead even though the request is still running. The production fix would be a periodic heartbeat loop around the long-running call.

The second gap is the fallback path. The primary agent call has a schedule_to_start budget, so the workflow can notice when the primary queue is stuck and route to fallback. But the fallback path should also have an escape hatch. If both LLM workers are down, tickets should not hang forever.

The third gap is the manual retry loop. It gives the workflow explicit routing control, but it also moves retry logic from Temporal’s retry policy into application code. That is a deliberate trade-off, not a free abstraction.

The fourth gap is deployment. The project uses Temporal’s dev server and Docker Compose with persisted state in a volume. That is useful for experiments, but it is not the same as a production Temporal deployment with real persistence, backups and high availability.

There are also ordinary application gaps with refund_amount as a float and the approval endpoint without authentication. Those were out of scope for the exercise, but they would not be acceptable production defaults.

This connects back to DDIA (Chapter 1: Reliability) that states that reliability is not a vague promise that nothing will fail, but starts with knowing your fault assumptions. Which failures the system absorbs, which ones degrade service and which ones are still outages.

Conclusion

This project focuses on the system design aspect of AI engineering. I used Temporal as the orchestrator and together with DDIA it helped me understand the difference between writing an application and engineering a reliable system. DDIA provides the conceptual background and Temporal provides a concrete framework for the implementation.

I want to build more systems like this to develop a better understanding of system design in AI engineering, but also about the strength and weaknesses of different frameworks and tools. At the same time, it is important not to forget the ML part. LLM calls are probabilistic and non-deterministic. They are not just ordinary API calls, even if they often are wrapped to look like one.

That distinction matters. The engineering layer can make an AI system more reliable around the model with retries, timeouts, idempotency, queues, observability and human approval. But it does not make the model itself deterministic or correct. Good AI engineering has to hold both ideas at once: treat the surrounding system like a distributed system and treat the model output like an uncertain prediction.

Thank you for your attention.

Project introduction#

Durable execution: remembering where the ticket is#

Timeouts, rate limits, backpressure and fallback routing#

Timeouts#

Rate limits and backpressure#

Fallback routing#

Idempotency: exactly-once as a useful lie#

Payloads outlive code: schema evolution in workflow histories#

Derived data and two read paths#

Where this would break in production#

Conclusion#