LangGraph in Prod: What the Docs Don't Tell You

After 300+ revisions of Kane running on LangGraph, here are the production lessons that cost us weekends.

Lesson 1: The Firestore Singleton Rule

Kane runs as a single long-lived process on Cloud Run. Every tool, every skill, every endpoint shares the same Firestore instance.

If you call db.settings() in a tool module, it will throw "Firestore has already been initialized" because the main process already called it. Use a shared getDb() helper. Always.

Lesson 2: Checkpoints Are Cheap Until They're Not

LangGraph checkpoints are great for conversation continuity. But if you're checkpointing every tool call in a long-running pipeline (like our Morning Wire), you'll end up with 50+ checkpoint writes per article, each as a Firestore document. At scale, your checkpoint collection becomes your largest cost center.

Solution: Only checkpoint at meaningful state transitions, not every step.

Lesson 3: Tool Routing > Intent Classification

We originally had a two-brain architecture: Slack would classify intent, then route to LangGraph. This added latency and a failure mode for zero benefit.

Now everything goes straight to LangGraph. The LLM is better at routing than any regex or classifier we could build.

Lesson 4: Rate Limiting Is Your Responsibility

LangGraph won't rate-limit your Vertex AI calls. If your graph has 10 parallel tool calls that each call Gemini, you'll hit 429s immediately.

Solution: Exponential backoff on every LLM call. 2s, 4s, 8s. Non-negotiable.