Peter Wang

How We Built a Solution Runs Long-Lived LLM Agents

by

Introduction

Most cloud platforms—AWS, GCP, Azure—are optimized for stateless web apps or short-lived serverless functions. But deploying long-lived, stateful LLM agents is another beast entirely. You need durability, resilience, and observability. When we tried to push our own multi-agent AI system to production, we hit walls with all the complex infrastructure work that not only took hours but unstable.

So we built Agentainer-Lab (GitHub), a local runtime architecture designed specifically for long-running autonomous agents.

Website: Agentainer.io

(Users who sign up for early access will get to try the production-grade service for free.)

  1. The Problem with Naive Docker Setups

Here’s what most developers try first:

# docker-compose.yml
services:
  agent:
    image: my-agent-image:latest
    ports:
      - "5000:5000"
    restart: always

This works—until it doesn’t. Here’s what goes wrong:

  • No snapshot of internal agent state after restart

  • Restart loops silently fail if Docker crashes

  • No observability/logging without extra setup

  • No clean API endpoint mapping per agent

We needed something better.

  1. Core Requirements

  • 24/7 runtime

  • Resilient auto-restart

  • Dynamic agent API mounting

  • Redis for runtime memory

  • PostgreSQL for long-term snapshots

  • Native Docker support (no K8s locally)

  1. Supervisor: the Go-Based Agent Manager

At the heart of Agentainer-Lab is the supervisor. It’s a Go service that listens to Docker events and acts as an agent lifecycle orchestrator.

func watchDockerEvents() {
  cli, _ := client.NewClientWithOpts(client.FromEnv)
  events, _ := cli.Events(ctx, types.EventsOptions{})
  for msg := range events {
    if msg.Type == "container" && msg.Action == "die" {
      handleAgentCrash(msg.Actor.ID)
    }
  }
}

func handleAgentCrash(containerID string) {
  agentID := lookupAgentID(containerID)
  latestSnapshot := loadFromPostgres(agentID)
  restartAgent(agentID, latestSnapshot)
}

This lets us handle crashes gracefully, and more importantly, lets us track them.

  1. State Architecture: Redis + PostgreSQL

We split memory usage by time horizon:

  • Redis: stores ephemeral data like agent heartbeat, in-flight tokens, retry flags.

  • Postgres: stores agent code metadata and full snapshots.

CREATE TABLE agent_snapshots (
  id UUID PRIMARY KEY,
  agent_id UUID NOT NULL,
  snapshot JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

Snapshots can be saved periodically from within the agent logic or externally via /snapshot API calls.

  1. API Routing

We use Gin (Go framework) to dynamically expose each agent via REST or gRPC endpoints.

router.POST("/:agentId/:path", handleAgentRequest)
func handleAgentRequest(c *gin.Context) {
  agentId := c.Param("agentId")
  route := c.Param("path")
  forwardToAgent(agentId, route, c.Request)
}

This gives us clean routes like:

POST /agent-123/process

And internally reroutes the request to the correct container’s port.

  1. Docker Runtime Per Agent

When we deploy an agent:

docker run \
  --name agent-abc \
  -p 6000:5000 \
  -e AGENT_ID=abc \
  -v agent_data:/app/data \
  my-agent:latest

Each agent gets:

  • Dedicated network port

  • Tokenized API key

  • Isolated volume

  • Retry + restart config

This creates isolation without requiring a full orchestrator like K8s.

  1. Crash Recovery Flow

When Docker fires a die event:

  1. We detect it in the supervisor

  2. Check Redis → mark agent as unhealthy

  3. Pull latest snapshot from Postgres

  4. Re-spin container with restored snapshot loaded via startup command or /restore endpoint

  1. Sample Use Case

Let’s say you build a scheduling agent that sends email summaries every day at 9 AM. It reads feeds, generates text using GPT, and emails via SendGrid.

The agent logic handles time-based triggers itself. But you still need:

  • Persistent runtime

  • Logging

  • Crash resilience

  • Daily summary logs (stored in Redis)

All of this is auto-managed via Agentainer-Lab. And you can restart it with a single Docker call.

9. Known Limitations

  • No container pool yet (one agent = one container)

  • Limited snapshot versioning (for now)

  • No inter-agent messaging (coming soon)

  • Basic WebSocket logging only (Grafana/log aggregation later)

10. Future Features (Soon available on Agentainer.io)

  • Auto-scaling: Scalability depends on the workload and load-balancing between containers. All the agents share the same memory with our state persistence feature.

  • Message-bus: Allows multi-agent communication more efficiently internally on the platform.

  • Enhanced metrics/logs: Provide production-grade metrics and logs dashboard (like DataDog).

  • Team workspace: A workspace where you can see what others are working on and manage all agents in one place.

  • Flexible database: Users can connect any external/internal databases as they wish per agent.

  • White-labeling: Use your own domain for agent endpoints. For example, https://{yourDomain}/agents/{age....

-----------

Agentainer-Lab is a developer-first runtime designed to make agent deployment seamless—not just for devs, but for coding agents themselves. We’ve removed 99% of the ops work required to ship long-lived AI workloads.

GitHub: Agentainer-Lab (GitHub)

Website (Signup for early access): Agentainer.io

(Users who sign up for early access will get to try the production-grade service for free.)

30 views

Add a comment

Replies

Be the first to comment