TL;DR

We migrated our AI agent infrastructure from Python to Rust. The results:

  • 10x reduction in p99 latency
  • 50x reduction in memory usage
  • Zero runtime errors in production
  • Infinite developer happiness (okay, that one's subjective)

Here's exactly how we did it.

The Problem

Our Python-based agent orchestrator was hitting walls. With 50+ concurrent agents, we were seeing:

p50 latency: 120ms
p99 latency: 2.4s
Memory usage: 8GB+
Error rate: 0.3% (mostly OOM kills)

For an AI system that needs to feel responsive, 2.4 second tail latencies are unacceptable.

Why Rust?

Three reasons:

  1. Zero-cost abstractions — Express high-level concepts without runtime overhead
  2. Fearless concurrency — The borrow checker prevents data races at compile time
  3. Predictable performance — No GC pauses, no runtime surprises

The Architecture

Here's our agent orchestration core:

use tokio::sync::mpsc;
use sqlx::PgPool;

pub struct AgentOrchestrator {
    pool: PgPool,
    agents: HashMap<AgentId, AgentHandle>,
    message_rx: mpsc::Receiver<AgentMessage>,
}

impl AgentOrchestrator {
    pub async fn run(&mut self) {
        while let Some(msg) = self.message_rx.recv().await {
            match msg {
                AgentMessage::Task(task) => {
                    self.dispatch_task(task).await;
                }
                AgentMessage::Result(result) => {
                    self.handle_result(result).await;
                }
            }
        }
    }

    async fn dispatch_task(&self, task: AgentTask) {
        let agent = self.agents.get(&task.agent_id)
            .expect("Agent not found");

        agent.send(task).await
            .expect("Failed to send task");
    }
}

The key insight: Rust's async/await with Tokio gives us Python-like ergonomics with C-like performance.

Database Layer

We use SQLx for compile-time checked SQL:

pub async fn get_agent_state(
    pool: &PgPool,
    agent_id: &AgentId,
) -> Result<AgentState, sqlx::Error> {
    sqlx::query_as!(
        AgentState,
        r#"
        SELECT id, status, last_heartbeat, task_count
        FROM agents
        WHERE id = $1
        "#,
        agent_id.as_str()
    )
    .fetch_one(pool)
    .await
}

If the query is wrong, it fails at compile time. Not in production at 3 AM.

The Results

After migration:

p50 latency: 12ms  (10x improvement)
p99 latency: 48ms  (50x improvement)
Memory usage: 150MB (53x improvement)
Error rate: 0.0%

Lessons Learned

1. Start with the hot path

Don't rewrite everything. Identify the performance-critical code and migrate that first.

2. Keep Python for experimentation

We still use Python for rapid prototyping. Rust is for production.

3. Invest in testing

Rust's type system catches many bugs, but not all. We have extensive integration tests.

4. Embrace the learning curve

Rust is hard at first. After the first month, productivity exceeds Python.

Open Source

We're open-sourcing this stack at tyingshoelaces. The core orchestrator, MCP integration, and agent runtime are all available.


Want to discuss AI infrastructure? Find me on Twitter @tyingshoelaces or check out tyingshoelaces.com.