Building Enterprise Agentic AI Systems: Lessons from Production Voice Agents

After building multiple production AI systems that handle real customer conversations, I’ve learned that the gap between a demo and a production-ready agentic AI is enormous. This post shares architectural patterns and lessons from building enterprise voice AI platforms.

What is Agentic AI?

Agentic AI refers to AI systems that can autonomously perform tasks, make decisions, and take actions on behalf of users. Unlike simple chatbots that respond to queries, agentic systems:

Act autonomously within defined boundaries
Integrate with external systems (databases, APIs, payment gateways)
Make real-time decisions based on context
Trigger downstream workflows (dispatch, follow-ups, escalations)

Architecture Overview

Here’s the high-level architecture I’ve used for production voice AI systems:

┌─────────────────────────────────────────────────────────────┐
│                      CLIENT LAYER                           │
├─────────────────────────────────────────────────────────────┤
│  React SPA + ElevenLabs Voice Widget (WebRTC)               │
│  Real-time Transcript WebSocket + Analytics Dashboard       │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    FASTAPI GATEWAY                          │
│         Async REST API + WebSocket (Uvicorn)                │
├─────────────────────────────────────────────────────────────┤
│  /api/calls     - Call lifecycle management                 │
│  /api/webhooks  - ElevenLabs callbacks                      │
│  /ws/transcript - Real-time streaming                       │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    SERVICE LAYER                            │
├─────────────────────────────────────────────────────────────┤
│  ElevenLabsService  - Conversational AI orchestration       │
│  OpenRouterService  - Multi-model LLM analysis              │
│  RoutingService     - Intelligent dispatch/assignment       │
│  SchedulerService   - Automated follow-ups                  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    DATA LAYER                               │
├─────────────────────────────────────────────────────────────┤
│  PostgreSQL (AsyncPG) + JSONB for flexible schemas          │
│  ElevenLabs API + OpenRouter API                            │
└─────────────────────────────────────────────────────────────┘

Key Technical Decisions

1. Async-First Backend

Every I/O operation must be non-blocking. A single call involves:

Database queries for customer context
WebSocket connections to voice AI
HTTP calls to LLM APIs
Real-time transcript streaming

@router.post("/api/calls/start")
async def start_call(
    request: CallStartRequest,
    db: AsyncSession = Depends(get_db)
):
    # Non-blocking database lookup
    customer = await db.execute(
        select(Customer).where(Customer.id == request.customer_id)
    )

    # Async WebSocket to ElevenLabs
    conversation = await elevenlabs_service.create_conversation(
        customer_context=build_context(customer)
    )

    return CallStartResponse(conversation_id=conversation.id)

2. Dynamic Prompt Engineering

Static prompts fail in production. Every conversation needs context injection:

def build_system_prompt(customer: Customer, history: List[Call]) -> str:
    # Customer-specific context
    context = f"""
    CUSTOMER INFORMATION:
    - Name: {customer.name}
    - Outstanding Amount: ${customer.outstanding_amount:,.2f}
    - Days Overdue: {customer.days_overdue}
    - Risk Level: {customer.risk_category}

    PREVIOUS INTERACTIONS:
    - Total Calls: {len(history)}
    - Last Contact: {history[0].date if history else 'Never'}
    - Active Promise: {customer.active_promise or 'None'}
    """

    # Stage-specific strategy
    strategy = STAGE_STRATEGIES[customer.collection_stage]

    return BASE_PROMPT.format(
        context=context,
        strategy=strategy,
        tone=strategy.tone,
        approach=strategy.approach
    )

Critical insight: The prompt must explicitly tell the AI what information it already has. Otherwise, it will redundantly ask for details already on file.

REGISTERED VEHICLE (ALREADY ON FILE - DO NOT ASK):
- Vehicle: 2022 Honda Civic, Silver
- **IMPORTANT**: Just confirm "Is this about your Honda Civic?"
- Do NOT ask for make, model, or color - we have it!

3. Multi-Model LLM Pipeline

Different tasks need different models:

Task	Model	Why
Real-time conversation	Qwen 30B / GPT-4o	Low latency, good reasoning
Sentiment analysis	GPT-4o-mini	Fast, accurate emotion detection
Promise extraction	GPT 5.1/ GPT oss 120b/ Gemini/ Sonnet 4.5	Best structured output reasoning
Comprehensive analysis	GPT 5.1/ GPT oss 120b/ Gemini/ Sonnet 4.5	Complex multi-factor analysis

class OpenRouterService:
    async def analyze_call_comprehensive(
        self,
        transcript: List[dict],
        customer_context: dict
    ) -> CallAnalysisResult:
        prompt = self._build_analysis_prompt(transcript, customer_context)

        response = await self.client.post(
            "https://openrouter.ai/api/v1/chat/completions",
            json={
                "model": "anthropic/claude-4.5-sonnet",
                "messages": [{"role": "user", "content": prompt}],
                "response_format": {"type": "json_object"}
            }
        )

        return CallAnalysisResult.model_validate_json(
            response.json()["choices"][0]["message"]["content"]
        )

4. Structured Output Extraction

LLMs return unstructured text. Production systems need structured data:

class CallAnalysisResult(BaseModel):
    sentiment: SentimentAnalysis
    intent: IntentAnalysis
    promise: PromiseExtraction
    next_action: NextActionRecommendation
    summary: str

class PromiseExtraction(BaseModel):
    promise_made: bool
    promised_amount: Optional[float]
    promised_date: Optional[date]
    payment_method: Optional[str]
    confidence: int  # 0-100
    reasoning: str

Key pattern: Always include a confidence score and reasoning field. This enables:

Filtering low-confidence extractions
Human review of uncertain cases
Debugging and improvement

5. Business Rules in LLM Prompts

Instead of post-processing LLM output, inject business rules into the prompt:

BUSINESS_RULES = """
Apply these rules for next_action recommendations (PRIORITY ORDER):

1. FIRST-TIME DEFAULTERS (1 EMI overdue):
   - HIGHEST PRIORITY: Max 3-day callback
   - If customer requests later, override and flag has_business_override=true

2. CRITICAL AMOUNTS (>$10,000):
   - Max 7-day follow-up regardless of customer preference
   - Urgency: immediate

3. HIGH RISK CUSTOMERS:
   - Max 14-day follow-up cap
   - Always schedule specific time, never "any time"

When overriding customer requests, set:
- has_business_override: true
- override_reason: "Specific rule that applied"
"""

The LLM applies rules and explains overrides in its response.

6. Meaningful Call Detection

Don’t waste LLM API calls on incomplete interactions:

def is_meaningful_call(transcript: List[dict], duration: int) -> bool:
    # Minimum conversation depth
    if len(transcript) < 4:  # At least 2 exchanges
        return False

    # Minimum duration
    if duration < 30:  # seconds
        return False

    # Both parties must speak
    speakers = {entry.get("speaker") for entry in transcript}
    if "agent" not in speakers or "customer" not in speakers:
        return False

    return True

7. Real-Time Transcript Streaming

WebSocket architecture for live transcripts:

class WebSocketManager:
    def __init__(self):
        self.connections: Dict[str, WebSocket] = {}
        self.transcripts: Dict[str, List[dict]] = {}

    async def broadcast_transcript(
        self,
        call_id: str,
        entry: dict
    ):
        # Store for persistence
        self.transcripts[call_id].append(entry)

        # Broadcast to connected clients
        if call_id in self.connections:
            await self.connections[call_id].send_json({
                "type": "transcript",
                "data": entry
            })

Frontend receives real-time updates:

const useTranscriptWebSocket = (callId: string) => {
  const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);

  useEffect(() => {
    const ws = new WebSocket(`ws://api/ws/transcript/${callId}`);

    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.type === "transcript") {
        setTranscript((prev) => [...prev, data.data]);
      }
    };

    return () => ws.close();
  }, [callId]);

  return transcript;
};

Conversation Flow Design

Inbound Call Flow

1. Call Initiated
   └─ Phone/Web → ElevenLabs Voice API

2. Context Building
   ├─ Lookup customer by phone
   ├─ Fetch last N calls for history
   ├─ Load active promises/pending issues
   └─ Generate dynamic system prompt

3. AI Agent Conversation
   ├─ Context-aware greeting
   ├─ Issue identification
   ├─ Information gathering (only what's missing)
   ├─ Solution/dispatch/payment flow
   └─ Confirmation and wrap-up

4. Post-Call Processing
   ├─ Meaningful call check
   ├─ LLM analysis (sentiment, intent, extraction)
   ├─ Database updates
   ├─ Trigger downstream actions
   └─ Schedule follow-ups

Critical Conversation Rules

Through production testing, I’ve learned these rules are essential:

Never end calls abruptly

After providing help: "Is there anything else I can help with?"
Wait for explicit "goodbye" before ending

Confirmation loops for critical data

AI: "Just to confirm, you're at 123 Main Street - is that correct?"
Customer: "Yes" / "No, it's..."

Response length limits

CRITICAL: Keep responses under 25 words.
Ask ONE question maximum per turn.

Don’t ask for known information

If vehicle on file: "Is this about your Honda Civic?"
NOT: "What vehicle do you have?"

Intelligent Routing Algorithm

For dispatch/assignment scenarios, use multi-factor scoring:

def find_best_match(request: ServiceRequest) -> Optional[Agent]:
    candidates = get_available_agents()

    scored = []
    for agent in candidates:
        score = (
            calculate_distance_score(agent, request) * 0.6 +
            calculate_specialization_score(agent, request) * 0.3 +
            calculate_availability_score(agent) * 0.1
        )

        # Emergency boost for specialists
        if request.is_emergency and request.issue_type in agent.specializations:
            score *= 1.2

        scored.append((agent, score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[0][0] if scored else None

Automated Follow-Up System

Background scheduler for automated outreach:

class SchedulerService:
    def __init__(self):
        self.scheduler = AsyncIOScheduler()

        # Process due follow-ups every 5 minutes
        self.scheduler.add_job(
            self.process_due_followups,
            'interval',
            minutes=5
        )

        # Daily risk score updates
        self.scheduler.add_job(
            self.update_risk_scores,
            'cron',
            hour=3
        )

    async def process_due_followups(self):
        due = await self.get_due_followups()

        for followup in due:
            await self.outbound_service.initiate_call(
                customer_id=followup.customer_id,
                context=self.build_followup_context(followup)
            )

Data Model Design

JSONB for Flexible Extraction

Different calls extract different data. Use JSONB:

class CallRecord(Base):
    __tablename__ = "call_records"

    id = Column(UUID, primary_key=True)
    customer_id = Column(UUID, ForeignKey("customers.id"))

    # Structured fields
    status = Column(Enum(CallStatus))
    duration_seconds = Column(Integer)
    sentiment_score = Column(Float)

    # Flexible JSONB fields
    transcript = Column(JSONB)  # Full conversation
    extracted_data = Column(JSONB)  # Issue, location, vehicle, etc.
    analysis_result = Column(JSONB)  # Full LLM analysis

    # Example extracted_data:
    # {
    #   "issue": {"type": "flat_tire", "severity": "normal"},
    #   "location": {"address": "123 Main St", "landmark": "near gas station"},
    #   "promise": {"amount": 5000, "date": "2024-12-25"}
    # }

Production Lessons

1. Webhook Reliability

ElevenLabs sends webhooks on conversation end. Always implement:

Signature verification (HMAC-SHA256)
Idempotency (handle duplicate webhooks)
Fallback polling for missed webhooks

2. Cost Management

Each call generates multiple API calls:

Voice AI (per-minute pricing)
LLM analysis (per-token pricing)

Implement meaningful call detection to avoid wasting analysis on dropped calls.

3. Error Recovery

WebSocket connections drop. Implement:

Auto-reconnect with exponential backoff
Session state recovery
Graceful degradation (store locally, sync later)

4. Observability

Log everything:

Full transcripts with timestamps
LLM prompts and responses
Latency metrics per component
Extraction confidence scores

Results and Impact

These systems achieve:

24/7 autonomous operation without human agents
Sub-second response latency for natural conversations
90%+ intent detection accuracy with structured extraction
Automated follow-up scheduling based on business rules
Real-time analytics for operational visibility

Tech Stack Summary

Layer	Technology
Backend	FastAPI, SQLAlchemy 2.0, AsyncPG
Frontend	React 18, TypeScript, TailwindCSS
Voice AI	ElevenLabs Conversational API
LLM	OpenRouter (Claude, GPT-4o, Qwen)
Real-time	WebSockets, WebRTC
Database	PostgreSQL 17 with JSONB
Deployment	Docker Compose, Nginx

Conclusion

Building production agentic AI requires thinking beyond the conversation itself. You need:

Dynamic context injection for personalized interactions
Multi-model orchestration for different analysis tasks
Business rules integration in LLM prompts
Structured output extraction with confidence scoring
Automated workflows for follow-ups and escalations
Defensive error handling for production reliability

The future of enterprise AI isn’t chatbots answering questions. It’s autonomous agents that understand context, make decisions, and drive outcomes.

These insights come from building AI-powered call center platforms handling real customer interactions in roadside assistance and financial services domains.