Building Enterprise Agentic AI Systems: Lessons from Production Voice Agents
A deep dive into architecting production-grade conversational AI systems with real-time voice, LLM-powered analysis, and intelligent automation. Insights from building AI call center platforms.
After building multiple production AI systems that handle real customer conversations, I’ve learned that the gap between a demo and a production-ready agentic AI is enormous. This post shares architectural patterns and lessons from building enterprise voice AI platforms.
What is Agentic AI?
Agentic AI refers to AI systems that can autonomously perform tasks, make decisions, and take actions on behalf of users. Unlike simple chatbots that respond to queries, agentic systems:
- Act autonomously within defined boundaries
- Integrate with external systems (databases, APIs, payment gateways)
- Make real-time decisions based on context
- Trigger downstream workflows (dispatch, follow-ups, escalations)
Architecture Overview
Here’s the high-level architecture I’ve used for production voice AI systems:
┌─────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
├─────────────────────────────────────────────────────────────┤
│ React SPA + ElevenLabs Voice Widget (WebRTC) │
│ Real-time Transcript WebSocket + Analytics Dashboard │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ FASTAPI GATEWAY │
│ Async REST API + WebSocket (Uvicorn) │
├─────────────────────────────────────────────────────────────┤
│ /api/calls - Call lifecycle management │
│ /api/webhooks - ElevenLabs callbacks │
│ /ws/transcript - Real-time streaming │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ SERVICE LAYER │
├─────────────────────────────────────────────────────────────┤
│ ElevenLabsService - Conversational AI orchestration │
│ OpenRouterService - Multi-model LLM analysis │
│ RoutingService - Intelligent dispatch/assignment │
│ SchedulerService - Automated follow-ups │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ DATA LAYER │
├─────────────────────────────────────────────────────────────┤
│ PostgreSQL (AsyncPG) + JSONB for flexible schemas │
│ ElevenLabs API + OpenRouter API │
└─────────────────────────────────────────────────────────────┘
Key Technical Decisions
1. Async-First Backend
Every I/O operation must be non-blocking. A single call involves:
- Database queries for customer context
- WebSocket connections to voice AI
- HTTP calls to LLM APIs
- Real-time transcript streaming
@router.post("/api/calls/start")
async def start_call(
request: CallStartRequest,
db: AsyncSession = Depends(get_db)
):
# Non-blocking database lookup
customer = await db.execute(
select(Customer).where(Customer.id == request.customer_id)
)
# Async WebSocket to ElevenLabs
conversation = await elevenlabs_service.create_conversation(
customer_context=build_context(customer)
)
return CallStartResponse(conversation_id=conversation.id)
2. Dynamic Prompt Engineering
Static prompts fail in production. Every conversation needs context injection:
def build_system_prompt(customer: Customer, history: List[Call]) -> str:
# Customer-specific context
context = f"""
CUSTOMER INFORMATION:
- Name: {customer.name}
- Outstanding Amount: ${customer.outstanding_amount:,.2f}
- Days Overdue: {customer.days_overdue}
- Risk Level: {customer.risk_category}
PREVIOUS INTERACTIONS:
- Total Calls: {len(history)}
- Last Contact: {history[0].date if history else 'Never'}
- Active Promise: {customer.active_promise or 'None'}
"""
# Stage-specific strategy
strategy = STAGE_STRATEGIES[customer.collection_stage]
return BASE_PROMPT.format(
context=context,
strategy=strategy,
tone=strategy.tone,
approach=strategy.approach
)
Critical insight: The prompt must explicitly tell the AI what information it already has. Otherwise, it will redundantly ask for details already on file.
REGISTERED VEHICLE (ALREADY ON FILE - DO NOT ASK):
- Vehicle: 2022 Honda Civic, Silver
- **IMPORTANT**: Just confirm "Is this about your Honda Civic?"
- Do NOT ask for make, model, or color - we have it!
3. Multi-Model LLM Pipeline
Different tasks need different models:
| Task | Model | Why |
|---|---|---|
| Real-time conversation | Qwen 30B / GPT-4o | Low latency, good reasoning |
| Sentiment analysis | GPT-4o-mini | Fast, accurate emotion detection |
| Promise extraction | GPT 5.1/ GPT oss 120b/ Gemini/ Sonnet 4.5 | Best structured output reasoning |
| Comprehensive analysis | GPT 5.1/ GPT oss 120b/ Gemini/ Sonnet 4.5 | Complex multi-factor analysis |
class OpenRouterService:
async def analyze_call_comprehensive(
self,
transcript: List[dict],
customer_context: dict
) -> CallAnalysisResult:
prompt = self._build_analysis_prompt(transcript, customer_context)
response = await self.client.post(
"https://openrouter.ai/api/v1/chat/completions",
json={
"model": "anthropic/claude-4.5-sonnet",
"messages": [{"role": "user", "content": prompt}],
"response_format": {"type": "json_object"}
}
)
return CallAnalysisResult.model_validate_json(
response.json()["choices"][0]["message"]["content"]
)
4. Structured Output Extraction
LLMs return unstructured text. Production systems need structured data:
class CallAnalysisResult(BaseModel):
sentiment: SentimentAnalysis
intent: IntentAnalysis
promise: PromiseExtraction
next_action: NextActionRecommendation
summary: str
class PromiseExtraction(BaseModel):
promise_made: bool
promised_amount: Optional[float]
promised_date: Optional[date]
payment_method: Optional[str]
confidence: int # 0-100
reasoning: str
Key pattern: Always include a confidence score and reasoning field. This enables:
- Filtering low-confidence extractions
- Human review of uncertain cases
- Debugging and improvement
5. Business Rules in LLM Prompts
Instead of post-processing LLM output, inject business rules into the prompt:
BUSINESS_RULES = """
Apply these rules for next_action recommendations (PRIORITY ORDER):
1. FIRST-TIME DEFAULTERS (1 EMI overdue):
- HIGHEST PRIORITY: Max 3-day callback
- If customer requests later, override and flag has_business_override=true
2. CRITICAL AMOUNTS (>$10,000):
- Max 7-day follow-up regardless of customer preference
- Urgency: immediate
3. HIGH RISK CUSTOMERS:
- Max 14-day follow-up cap
- Always schedule specific time, never "any time"
When overriding customer requests, set:
- has_business_override: true
- override_reason: "Specific rule that applied"
"""
The LLM applies rules and explains overrides in its response.
6. Meaningful Call Detection
Don’t waste LLM API calls on incomplete interactions:
def is_meaningful_call(transcript: List[dict], duration: int) -> bool:
# Minimum conversation depth
if len(transcript) < 4: # At least 2 exchanges
return False
# Minimum duration
if duration < 30: # seconds
return False
# Both parties must speak
speakers = {entry.get("speaker") for entry in transcript}
if "agent" not in speakers or "customer" not in speakers:
return False
return True
7. Real-Time Transcript Streaming
WebSocket architecture for live transcripts:
class WebSocketManager:
def __init__(self):
self.connections: Dict[str, WebSocket] = {}
self.transcripts: Dict[str, List[dict]] = {}
async def broadcast_transcript(
self,
call_id: str,
entry: dict
):
# Store for persistence
self.transcripts[call_id].append(entry)
# Broadcast to connected clients
if call_id in self.connections:
await self.connections[call_id].send_json({
"type": "transcript",
"data": entry
})
Frontend receives real-time updates:
const useTranscriptWebSocket = (callId: string) => {
const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);
useEffect(() => {
const ws = new WebSocket(`ws://api/ws/transcript/${callId}`);
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "transcript") {
setTranscript((prev) => [...prev, data.data]);
}
};
return () => ws.close();
}, [callId]);
return transcript;
};
Conversation Flow Design
Inbound Call Flow
1. Call Initiated
└─ Phone/Web → ElevenLabs Voice API
2. Context Building
├─ Lookup customer by phone
├─ Fetch last N calls for history
├─ Load active promises/pending issues
└─ Generate dynamic system prompt
3. AI Agent Conversation
├─ Context-aware greeting
├─ Issue identification
├─ Information gathering (only what's missing)
├─ Solution/dispatch/payment flow
└─ Confirmation and wrap-up
4. Post-Call Processing
├─ Meaningful call check
├─ LLM analysis (sentiment, intent, extraction)
├─ Database updates
├─ Trigger downstream actions
└─ Schedule follow-ups
Critical Conversation Rules
Through production testing, I’ve learned these rules are essential:
-
Never end calls abruptly
After providing help: "Is there anything else I can help with?" Wait for explicit "goodbye" before ending -
Confirmation loops for critical data
AI: "Just to confirm, you're at 123 Main Street - is that correct?" Customer: "Yes" / "No, it's..." -
Response length limits
CRITICAL: Keep responses under 25 words. Ask ONE question maximum per turn. -
Don’t ask for known information
If vehicle on file: "Is this about your Honda Civic?" NOT: "What vehicle do you have?"
Intelligent Routing Algorithm
For dispatch/assignment scenarios, use multi-factor scoring:
def find_best_match(request: ServiceRequest) -> Optional[Agent]:
candidates = get_available_agents()
scored = []
for agent in candidates:
score = (
calculate_distance_score(agent, request) * 0.6 +
calculate_specialization_score(agent, request) * 0.3 +
calculate_availability_score(agent) * 0.1
)
# Emergency boost for specialists
if request.is_emergency and request.issue_type in agent.specializations:
score *= 1.2
scored.append((agent, score))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[0][0] if scored else None
Automated Follow-Up System
Background scheduler for automated outreach:
class SchedulerService:
def __init__(self):
self.scheduler = AsyncIOScheduler()
# Process due follow-ups every 5 minutes
self.scheduler.add_job(
self.process_due_followups,
'interval',
minutes=5
)
# Daily risk score updates
self.scheduler.add_job(
self.update_risk_scores,
'cron',
hour=3
)
async def process_due_followups(self):
due = await self.get_due_followups()
for followup in due:
await self.outbound_service.initiate_call(
customer_id=followup.customer_id,
context=self.build_followup_context(followup)
)
Data Model Design
JSONB for Flexible Extraction
Different calls extract different data. Use JSONB:
class CallRecord(Base):
__tablename__ = "call_records"
id = Column(UUID, primary_key=True)
customer_id = Column(UUID, ForeignKey("customers.id"))
# Structured fields
status = Column(Enum(CallStatus))
duration_seconds = Column(Integer)
sentiment_score = Column(Float)
# Flexible JSONB fields
transcript = Column(JSONB) # Full conversation
extracted_data = Column(JSONB) # Issue, location, vehicle, etc.
analysis_result = Column(JSONB) # Full LLM analysis
# Example extracted_data:
# {
# "issue": {"type": "flat_tire", "severity": "normal"},
# "location": {"address": "123 Main St", "landmark": "near gas station"},
# "promise": {"amount": 5000, "date": "2024-12-25"}
# }
Production Lessons
1. Webhook Reliability
ElevenLabs sends webhooks on conversation end. Always implement:
- Signature verification (HMAC-SHA256)
- Idempotency (handle duplicate webhooks)
- Fallback polling for missed webhooks
2. Cost Management
Each call generates multiple API calls:
- Voice AI (per-minute pricing)
- LLM analysis (per-token pricing)
Implement meaningful call detection to avoid wasting analysis on dropped calls.
3. Error Recovery
WebSocket connections drop. Implement:
- Auto-reconnect with exponential backoff
- Session state recovery
- Graceful degradation (store locally, sync later)
4. Observability
Log everything:
- Full transcripts with timestamps
- LLM prompts and responses
- Latency metrics per component
- Extraction confidence scores
Results and Impact
These systems achieve:
- 24/7 autonomous operation without human agents
- Sub-second response latency for natural conversations
- 90%+ intent detection accuracy with structured extraction
- Automated follow-up scheduling based on business rules
- Real-time analytics for operational visibility
Tech Stack Summary
| Layer | Technology |
|---|---|
| Backend | FastAPI, SQLAlchemy 2.0, AsyncPG |
| Frontend | React 18, TypeScript, TailwindCSS |
| Voice AI | ElevenLabs Conversational API |
| LLM | OpenRouter (Claude, GPT-4o, Qwen) |
| Real-time | WebSockets, WebRTC |
| Database | PostgreSQL 17 with JSONB |
| Deployment | Docker Compose, Nginx |
Conclusion
Building production agentic AI requires thinking beyond the conversation itself. You need:
- Dynamic context injection for personalized interactions
- Multi-model orchestration for different analysis tasks
- Business rules integration in LLM prompts
- Structured output extraction with confidence scoring
- Automated workflows for follow-ups and escalations
- Robust error handling for production reliability
The future of enterprise AI isn’t chatbots answering questions. It’s autonomous agents that understand context, make decisions, and drive outcomes.
These insights come from building AI-powered call center platforms handling real customer interactions in roadside assistance and financial services domains.