After 300+ revisions of Kane running on LangGraph, here are the production lessons that cost us weekends.
Lesson 1: The Firestore Singleton Rule
Kane runs as a single long-lived process on Cloud Run. Every tool, every skill, every endpoint shares the same Firestore instance.
If you call db.settings() in a tool module, it will throw "Firestore has already been initialized" because the main process already called it. Use a shared getDb() helper. Always.
Lesson 2: Checkpoints Are Cheap Until They're Not
LangGraph checkpoints are great for conversation continuity. But if you're checkpointing every tool call in a long-running pipeline (like our Morning Wire), you'll end up with 50+ checkpoint writes per article, each as a Firestore document. At scale, your checkpoint collection becomes your largest cost center.
Solution: Only checkpoint at meaningful state transitions, not every step.
Lesson 3: Tool Routing > Intent Classification
We originally had a two-brain architecture: Slack would classify intent, then route to LangGraph. This added latency and a failure mode for zero benefit.
Now everything goes straight to LangGraph. The LLM is better at routing than any regex or classifier we could build.
Lesson 4: Rate Limiting Is Your Responsibility
LangGraph won't rate-limit your Vertex AI calls. If your graph has 10 parallel tool calls that each call Gemini, you'll hit 429s immediately.
Solution: Exponential backoff on every LLM call. 2s, 4s, 8s. Non-negotiable.
