How We Built a Solution Runs Long-Lived LLM Agents : Self-Promotion Discussion Forums

Introduction

Most cloud platforms—AWS, GCP, Azure—are optimized for stateless web apps or short-lived serverless functions. But deploying long-lived, stateful LLM agents is another beast entirely. You need durability, resilience, and observability. When we tried to push our own multi-agent AI system to production, we hit walls with all the complex infrastructure work that not only took hours but unstable.

So we built Agentainer-Lab (GitHub), a local runtime architecture designed specifically for long-running autonomous agents.

Website: Agentainer.io

(Users who sign up for early access will get to try the production-grade service for free.)

The Problem with Naive Docker Setups

Here’s what most developers try first:

# docker-compose.yml
services:
  agent:
    image: my-agent-image:latest
    ports:
      - "5000:5000"
    restart: always

This works—until it doesn’t. Here’s what goes wrong:

No snapshot of internal agent state after restart
Restart loops silently fail if Docker crashes
No observability/logging without extra setup
No clean API endpoint mapping per agent

We needed something better.

Core Requirements

24/7 runtime
Resilient auto-restart
Dynamic agent API mounting
Redis for runtime memory
PostgreSQL for long-term snapshots
Native Docker support (no K8s locally)

Supervisor: the Go-Based Agent Manager

At the heart of Agentainer-Lab is the supervisor. It’s a Go service that listens to Docker events and acts as an agent lifecycle orchestrator.

func watchDockerEvents() {
  cli, _ := client.NewClientWithOpts(client.FromEnv)
  events, _ := cli.Events(ctx, types.EventsOptions{})
  for msg := range events {
    if msg.Type == "container" && msg.Action == "die" {
      handleAgentCrash(msg.Actor.ID)
    }
  }
}

func handleAgentCrash(containerID string) {
  agentID := lookupAgentID(containerID)
  latestSnapshot := loadFromPostgres(agentID)
  restartAgent(agentID, latestSnapshot)
}

This lets us handle crashes gracefully, and more importantly, lets us track them.

State Architecture: Redis + PostgreSQL

We split memory usage by time horizon:

Redis: stores ephemeral data like agent heartbeat, in-flight tokens, retry flags.
Postgres: stores agent code metadata and full snapshots.

CREATE TABLE agent_snapshots (
  id UUID PRIMARY KEY,
  agent_id UUID NOT NULL,
  snapshot JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

Snapshots can be saved periodically from within the agent logic or externally via /snapshot API calls.

API Routing

We use Gin (Go framework) to dynamically expose each agent via REST or gRPC endpoints.

router.POST("/:agentId/:path", handleAgentRequest)
func handleAgentRequest(c *gin.Context) {
  agentId := c.Param("agentId")
  route := c.Param("path")
  forwardToAgent(agentId, route, c.Request)
}

This gives us clean routes like:

POST /agent-123/process

And internally reroutes the request to the correct container’s port.

Docker Runtime Per Agent

When we deploy an agent:

docker run \
  --name agent-abc \
  -p 6000:5000 \
  -e AGENT_ID=abc \
  -v agent_data:/app/data \
  my-agent:latest

Each agent gets:

Dedicated network port
Tokenized API key
Isolated volume
Retry + restart config

This creates isolation without requiring a full orchestrator like K8s.

Crash Recovery Flow

When Docker fires a die event:

We detect it in the supervisor
Check Redis → mark agent as unhealthy
Pull latest snapshot from Postgres
Re-spin container with restored snapshot loaded via startup command or /restore endpoint

Sample Use Case

Let’s say you build a scheduling agent that sends email summaries every day at 9 AM. It reads feeds, generates text using GPT, and emails via SendGrid.

The agent logic handles time-based triggers itself. But you still need:

Persistent runtime
Logging
Crash resilience
Daily summary logs (stored in Redis)

All of this is auto-managed via Agentainer-Lab. And you can restart it with a single Docker call.

9. Known Limitations

No container pool yet (one agent = one container)
Limited snapshot versioning (for now)
No inter-agent messaging (coming soon)
Basic WebSocket logging only (Grafana/log aggregation later)

10. Future Features (Soon available on Agentainer.io)

Auto-scaling: Scalability depends on the workload and load-balancing between containers. All the agents share the same memory with our state persistence feature.
Message-bus: Allows multi-agent communication more efficiently internally on the platform.
Enhanced metrics/logs: Provide production-grade metrics and logs dashboard (like DataDog).
Team workspace: A workspace where you can see what others are working on and manage all agents in one place.
Flexible database: Users can connect any external/internal databases as they wish per agent.
White-labeling: Use your own domain for agent endpoints. For example, https://{yourDomain}/agents/{age....

-----------

Agentainer-Lab is a developer-first runtime designed to make agent deployment seamless—not just for devs, but for coding agents themselves. We’ve removed 99% of the ops work required to ship long-lived AI workloads.

GitHub: Agentainer-Lab (GitHub)

Website (Signup for early access): Agentainer.io

(Users who sign up for early access will get to try the production-grade service for free.)

How We Built a Solution Runs Long-Lived LLM Agents

Replies