Want to see how we can help?Talk to us

BlogRetail BankingHow To GuideYuvoice

How YuVoice Handles 2.5 Crore Calls a Month Without Downtime

Deep dive into the architecture and engineering behind YuVoice's ability to process 2.5 crore voice AI calls monthly for Indian banks with 99.95% uptime. Covers scalability, fault tolerance, and real-time processing.

YuVerse Team

Published June 3, 2026 · Updated July 3, 2026 · 16 min read

How YuVoice Handles 2.5 Crore Calls a Month Without Downtime

Processing 2.5 crore (25 million) voice conversations every month for India's most demanding financial institutions is not a marketing claim that can be achieved through brute force alone. It requires a fundamentally different architectural approach from what works at pilot scale or even at moderate production volumes.

When a bank's customer service depends entirely on your voice AI platform — when every dropped call, every second of latency, and every misunderstood word translates directly to customer frustration, lost revenue, or compliance violations — the engineering standards change dramatically. This isn't about building a voice bot that works. It's about building voice infrastructure that never stops working.

This article examines the architectural principles, engineering practices, and operational disciplines that enable YuVoice to maintain 99.95% uptime while processing over 80,000 concurrent conversations during peak periods — salary days, festival seasons, and market events that spike banking call volumes by 3-5x overnight.

The Scale Challenge: What 2.5 Crore Monthly Calls Actually Means

Breaking Down the Numbers

2.5 crore calls per month translates to:

833,000+ calls per day on average
35,000+ calls per hour sustained
580+ calls per minute around the clock
Peak hours: 80,000+ concurrent conversations (10 AM - 2 PM on salary days)

Each call involves:

Real-time audio streaming (bidirectional)
Speech-to-text conversion (within 200ms)
Natural language understanding (within 100ms)
Banking system API calls (within 300ms)
Response generation (within 100ms)
Text-to-speech synthesis (within 200ms)

Total latency budget per conversational turn: under 800ms — to feel natural to the customer.

Why This Is Hard

At this scale, problems that are invisible at small volumes become existential:

Tail Latency: If 99% of calls have acceptable latency but 1% don't, that's 8,330 customers per day experiencing poor service. At banking scale, 1% failure means thousands of complaints.

Cascade Failures: A single overloaded component can create backpressure that propagates through the entire system. A slow database query that adds 500ms can cause upstream timeouts that drop thousands of calls simultaneously.

Data Gravity: 2.5 crore calls generate approximately 15 petabytes of raw audio data annually. Storage, retrieval, and processing of this volume requires purpose-built data infrastructure.

Model Serving: Running 12+ language models simultaneously, each requiring GPU inference for speech recognition and synthesis, demands a model-serving infrastructure as sophisticated as any hyperscaler's ML platform.

Architectural Principles

Principle 1: Stateless Processing with Distributed State

The core processing pipeline is stateless — any component can process any request without depending on local state. Conversation state (what has been discussed, what information has been collected) is stored in a distributed, replicated state store.

Why this matters: If a processing node fails mid-call, another node can seamlessly continue the conversation from the last state checkpoint. The customer never knows a failover occurred.

Implementation: Redis Cluster with cross-AZ replication serves as the conversation state store. Each conversational turn updates the distributed state atomically. If the processing node handling a call fails, the telephony layer detects the absence of heartbeats within 2 seconds and reconnects the call to a healthy node that loads the conversation state and continues.

Principle 2: Microservices with Independent Scaling

Each component of the voice AI pipeline is an independent service that scales independently based on its specific load pattern:

Telephony Gateway: Scales based on concurrent call count
ASR Service: Scales based on audio minutes being processed
NLU Service: Scales based on text utterances per second
Dialog Engine: Scales based on active conversation count
Banking Integration Service: Scales based on API calls per second
TTS Service: Scales based on text characters being synthesised
Recording Service: Scales based on audio bytes per second

Why this matters: During peak hours, the ASR service may need 3x more capacity while the TTS service only needs 1.5x more (because many calls are inbound-heavy at first). Independent scaling avoids over-provisioning.

Principle 3: Multi-Region Active-Active

YuVoice operates across multiple availability zones within India, with active-active configuration:

Primary: Mumbai region
Secondary: Hyderabad region
DR: Chennai region

All three regions actively process traffic. A region failure doesn't trigger a "failover" (which implies downtime) — traffic is simply redistributed to the remaining active regions within seconds.

Why this matters for Indian banking: RBI requires data to remain within India, but doesn't mandate single-point architecture. Multi-region within India satisfies data residency while providing geographic redundancy.

Principle 4: Circuit Breakers and Graceful Degradation

When dependent systems experience issues, the platform degrades gracefully rather than failing completely:

Banking System Slow:

If the CBS takes >2 seconds (vs normal 300ms), the voice bot acknowledges the delay: "I'm looking up your account details. One moment please."
If the CBS is completely unavailable, the bot offers alternatives: "I'm unable to access account details right now. I can take your request and call you back within 30 minutes, or connect you to an agent."

ASR Accuracy Drop:

If background noise causes recognition confidence to drop below threshold, the bot asks for confirmation rather than acting on uncertain input
If sustained accuracy issues are detected for a specific call, automatic noise cancellation is amplified or the call is routed to a human agent

TTS Service Degraded:

If the primary TTS engine has latency issues, automatic fallback to secondary TTS engine (potentially lower quality but functional)
Pre-generated audio for common responses (greetings, confirmations) serves as ultra-fast fallback

Principle 5: Zero-Downtime Deployments

Model updates, feature releases, and bug fixes deploy without interrupting a single active call:

Canary Deployment: New versions receive 1% of new calls (not existing calls). If metrics degrade, automatic rollback occurs before more customers are affected.

Blue-Green for Models: New language models deploy alongside existing models. Traffic is gradually shifted (1% → 10% → 50% → 100%) with automatic rollback triggers on accuracy metrics.

Active Call Protection: Calls in progress always complete on the version they started with. Version switches only happen for new calls.

Deep Dive: The Real-Time Processing Pipeline

Audio Ingestion

When a customer connects:

SIP INVITE received → Call setup (50ms)
Audio stream established → Bidirectional RTP flow
Audio buffered in 20ms frames → Streamed to ASR
Voice Activity Detection (VAD) → Distinguishes speech from silence/noise
Acoustic preprocessing → Noise cancellation, echo removal, gain normalisation

At peak, the audio ingestion layer handles:

80,000+ simultaneous bidirectional audio streams
3.2 million audio frames per second (80K calls × 2 directions × 20 frames/sec)
256 Mbps aggregate audio bandwidth

Speech Recognition at Scale

The ASR subsystem processes audio frames into text hypotheses:

Model Architecture: Custom-trained Conformer models for each Indian language, optimised for banking vocabulary and Indian telephony acoustic conditions.

Streaming Recognition: Text hypotheses are generated incrementally as the customer speaks — not waiting for them to finish. This enables:

Early intent detection (understanding the request before the customer finishes speaking)
Predictive backend queries (starting to fetch account info as soon as the account is mentioned, before the full request is complete)

GPU Fleet Management: ASR inference runs on a fleet of NVIDIA GPUs. Each GPU handles 40-60 simultaneous ASR sessions depending on the language model size. At peak (80K concurrent calls), approximately 1,500-2,000 GPU sessions are active for ASR alone.

Batching and Efficiency: Audio frames from multiple calls are batched for GPU inference, maximising throughput per GPU. Dynamic batching adjusts batch size based on current load to optimise the latency-throughput trade-off.

Natural Language Understanding

Once the ASR produces a transcript:

Intent Classification: Multi-label classification across 500+ banking intents
Entity Extraction: Named entity recognition for accounts, amounts, dates, products
Coreference Resolution: "it" → the account mentioned earlier
Sentiment Scoring: Real-time emotion tracking
Context Integration: Combining current utterance with conversation history

Latency: NLU processing completes in 50-100ms (not on the critical path for response latency — it runs in parallel with end-of-speech detection).

Dialog Management and Action Execution

The Dialog Engine:

Updates conversation state with new information from NLU
Evaluates dialog policy — what should happen next?
Executes backend actions if needed (API calls to banking systems)
Generates response strategy — what to say and how

Banking API Calls: The platform maintains persistent connection pools to banking systems (CBS, LOS, CRM). Connection pooling eliminates per-call TCP handshake overhead. Average API response time: 150-300ms.

Parallel Execution: When multiple pieces of information are needed (e.g., account balance AND recent transactions), queries execute in parallel — not sequentially.

Response Generation and Speech Synthesis

Template Selection: Choose appropriate response template based on context
Dynamic Content Insertion: Fill in real-time data (balances, dates, amounts)
Language-Specific Formatting: Amounts in Indian format, dates in DD-MM-YYYY
TTS Synthesis: Convert text to natural-sounding speech in the customer's language
Audio Streaming: Stream synthesised audio back to the customer

TTS at Scale: Neural TTS models generate audio in real time. The TTS fleet handles:

80,000+ concurrent synthesis streams at peak
Average response length: 3-5 seconds of audio
Latency from text-ready to first audio byte: <200ms

Handling Peak Load: The Salary Day Challenge

The Pattern

Indian banking call volumes exhibit extreme periodicity:

1st of month: Salary credit queries, EMI deductions, rent transfers → 3-4x normal volume
Last week of month: Payment reminders, bill payments, balance management → 2-3x normal
Diwali/Festival periods: Card limit requests, travel-related queries → 2-3x normal
Monday mornings: Weekend transaction queries → 1.5-2x normal
Budget day / RBI announcements: Policy-related queries → 2-3x spike for 2-3 hours

Pre-Scaling Strategy

YuVoice uses predictive scaling based on:

Historical Patterns: ML models trained on 24 months of call volume data predict next-day and next-week volumes with 92%+ accuracy.

Calendar Awareness: Known events (salary days, festivals, RBI meetings) trigger pre-scaling 2 hours before expected spike.

Real-Time Adaptive Scaling: Every 30 seconds, the autoscaler evaluates current load vs. capacity headroom. If utilisation exceeds 60%, additional capacity spins up (warm nodes from a pre-provisioned pool).

Capacity Reservation

For known peak events:

Pre-warm GPU instances (cold start time for a GPU node: 3-5 minutes — too long for reactive scaling)
Scale banking integration connection pools
Increase state store replica count
Alert engineering team for monitoring

Load Shedding (Last Resort)

If unprecedented load exceeds all capacity (has happened once in platform history — during a nationwide banking system outage that drove millions of confusion calls):

Priority-based shedding:

First shed: Non-critical background processing (analytics, model training)
Then: Reduce TTS quality (switch to faster, lower-quality synthesis)
Then: Limit concurrent new calls (queue excess calls with estimated wait time)
Never: Drop existing active calls

Monitoring and Observability

Real-Time Metrics (Sampled Every 10 Seconds)

Concurrent active calls (total and per-client)
ASR latency (P50, P95, P99)
NLU accuracy (measured against online evaluation samples)
Banking API response times (per-system)
TTS latency (P50, P95, P99)
End-to-end turn latency (customer finishes speaking → bot starts responding)
Call completion rate
Escalation rate
Error rate by component

Alert Hierarchy

P0 (Immediate page, auto-escalation at 5 min):

Call drop rate > 0.1%
ASR error rate > 5% (indicates model or infrastructure issue)
Banking API failure rate > 2%
Any single-client SLA breach

P1 (Alert, 15-minute response):

Latency P95 exceeds SLA threshold
Capacity utilisation > 80%
Model accuracy drift detected
Escalation rate abnormally high

P2 (Daily review):

Cost anomalies
Capacity trend projections
Model performance degradation signals
Infrastructure maintenance needs

Incident Response

When something goes wrong at scale:

Automated Response (No Human Required):

Node failures → Automatic traffic redistribution (2-second detection, 5-second recovery)
Region issues → Cross-region traffic shift
Model degradation → Automatic rollback to last-known-good version
Dependent service slow → Circuit breaker activation + graceful degradation

Human Response (Required):

Complete CBS outage (customer-facing messaging decisions)
Security incident (potential data breach)
Multi-region failure (disaster recovery procedure)
Regulatory compliance incident (communication to bank + regulators)

Data Architecture for 2.5 Crore Monthly Calls

Data Generated Per Call

Data Type	Size	Retention
Raw audio (both sides)	5-8 MB (3-minute avg call)	7 years (RBI mandate)
ASR transcript	2-5 KB	7 years
NLU metadata (intents, entities, sentiment)	1-3 KB	3 years
Conversation state snapshots	5-10 KB	1 year
API call logs	3-8 KB	3 years
Analytics events	1-2 KB	5 years

Monthly Data Volume

Raw audio: ~150 TB/month
Structured metadata: ~500 GB/month
Total annual: ~1.8 PB audio + 6 TB structured

Storage Architecture

Hot Storage (0-30 days): Object storage with SSD-backed metadata for fast retrieval. Any recent call can be replayed within 2 seconds.

Warm Storage (30-365 days): Object storage with slower access tier. Retrieval within 30 seconds for compliance reviews.

Cold Storage (1-7 years): Archived to lowest-cost storage tier. Retrieval within 4 hours (sufficient for regulatory audit requests).

Metadata Always Hot: Structured metadata (transcripts, intents, entities) remains in fast-access databases regardless of audio tier — enabling instant analytics queries across the full history.

Continuous Improvement at Scale

Model Improvement Cycle

The 2.5 crore monthly calls generate massive training data:

Automated Quality Sampling: 0.1% of calls randomly selected for human review (25,000 calls/month)
Error Detection: Calls where the customer repeated themselves, expressed frustration, or escalated are automatically flagged
Model Retraining: Monthly model updates incorporating new patterns, vocabulary, and corrections
A/B Deployment: New models tested on 1% of traffic for 48 hours before wider rollout
Accuracy Measurement: Continuous online evaluation against human-transcribed ground truth

Language Model Scaling

Each language model improves proportionally to its usage data:

Hindi model: Improves weekly (highest volume)
Tamil/Telugu models: Improve bi-weekly (high volume)
Smaller language models: Improve monthly (sufficient volume)

Cold Start Problem for New Languages: When adding a new language, initial accuracy is lower due to limited training data. The platform compensates with:

Transfer learning from related languages
Active learning (prioritising uncertain samples for human review)
Higher escalation threshold (route to human agent more readily until accuracy stabilises)

Security at Scale

Voice Data Security

All audio encrypted at rest (AES-256) and in transit (TLS 1.3)
Encryption keys managed through HSM (Hardware Security Module)
Access to raw audio requires two-factor authentication + approval workflow
Audio anonymisation pipeline for training data (removes PII from transcripts)

Authentication and Authorisation

Every API call authenticated (mTLS between services)
Role-based access control (RBAC) for operational access
Audit logging of all administrative actions
Regular penetration testing (quarterly)
Bug bounty programme for security researchers

Banking Data Isolation

Each bank's data is logically isolated (separate encryption keys, separate access policies)
No cross-bank data leakage is possible even at the infrastructure level
Bank-specific models trained only on that bank's data (no data pooling across clients)

Lessons Learned: What Breaks at Scale

Lesson 1: Timeouts Cascade

If you set a 5-second timeout on banking API calls, and the API starts responding in 4.5 seconds (due to load), you get 5-second holds on thousands of calls simultaneously. Those holds consume resources (memory, connection slots), causing other components to slow, which causes more timeouts.

Solution: Adaptive timeouts that tighten under load. If the system is stressed, timeout faster and degrade gracefully rather than holding resources hoping for slow responses.

Lesson 2: Garbage Collection Pauses Kill Voice Quality

A 200ms GC pause is imperceptible in a web application. In a voice call, it's a jarring silence. At scale, GC pauses hit randomly across the fleet — but with 80K concurrent calls, "randomly" means "constantly somewhere."

Solution: Low-latency runtime choices, GC tuning, off-heap memory for audio buffers, and real-time audio buffering that masks brief processing delays.

Lesson 3: Logging Can Be the Bottleneck

At 80K concurrent calls generating detailed logs, the logging infrastructure itself becomes a critical system. A logging slowdown can backpressure the application.

Solution: Asynchronous, sampled logging for hot paths. Full logging only for error paths. Separate logging infrastructure from production traffic paths.

Lesson 4: Feature Flags at Scale

A seemingly innocent feature flag that adds 10ms of processing is invisible at low scale. At 80K concurrent calls, it adds 10ms × 80,000 = 800 seconds of cumulative compute per second — requiring multiple additional nodes.

Solution: Performance regression testing for every change, no matter how small. Feature flag impact assessment before enabling at scale.

Frequently Asked Questions

What happens if a call is in progress and a server crashes?

The customer experiences a brief pause (1-3 seconds maximum). The telephony layer detects the heartbeat absence, reconnects the audio stream to a new processing node, which loads the conversation state from the distributed store and continues. Most customers don't notice.

How do you handle 3-4x traffic spikes on salary days?

Predictive scaling pre-warms capacity 2 hours before expected peaks. We maintain a warm pool of GPU instances that can activate in under 60 seconds. Historical data predicts salary day spikes with 95%+ accuracy. Reactive autoscaling handles any unexpected additional load.

What's the maximum concurrent call capacity?

Architecture supports 200,000+ concurrent calls (2.5x the current peak of 80,000). This headroom ensures that even unprecedented events don't exceed platform capacity. Scaling beyond 200K is a configuration change, not an architectural one.

How do you ensure data security for sensitive banking conversations?

Multiple layers: AES-256 encryption at rest, TLS 1.3 in transit, HSM-managed keys, bank-specific data isolation, RBAC access controls, comprehensive audit logging, quarterly penetration tests, and SOC 2 Type II certified operations. No single person can access call data without multiple approval layers.

How quickly can new features or models be deployed?

Model updates deploy within 4 hours (including canary testing). Feature changes deploy within 2 hours. Critical bug fixes deploy within 30 minutes. All deployments are zero-downtime — no active call is ever interrupted.

What's the disaster recovery strategy?

Active-active multi-region architecture means there's no "disaster recovery" in the traditional sense — all regions are always active. If Mumbai goes completely offline, Hyderabad and Chennai absorb the traffic automatically. RPO (Recovery Point Objective): 0 seconds for conversation state. RTO (Recovery Time Objective): under 30 seconds for traffic redistribution.

Conclusion

Handling 2.5 crore voice AI calls monthly isn't achieved through a single breakthrough technology. It's the result of hundreds of engineering decisions — each individually modest but collectively enabling a system that processes 80,000+ simultaneous banking conversations with sub-second latency, 99.95% uptime, and continuous improvement.

For Indian banks evaluating voice AI platforms, this scale represents more than a technical capability. It's a trust signal. A platform processing 25 million monthly conversations has encountered and solved every edge case, handled every failure mode, and proven every component under real production stress. There's no substitute for this battle-tested maturity.

The architecture described here — stateless processing, independent scaling, multi-region active-active, graceful degradation, and zero-downtime deployment — isn't unique to YuVoice. These are well-known distributed systems principles. What's unique is applying them specifically to voice AI for Indian banking, with the domain-specific optimisations that make them work for multilingual, regulatory-compliant, real-time financial conversations.

How YuVoice Handles 2.5 Crore Calls a Month Without Downtime

The Scale Challenge: What 2.5 Crore Monthly Calls Actually Means

Breaking Down the Numbers

2.5 crore calls per month translates to:

833,000+ calls per day on average
35,000+ calls per hour sustained
580+ calls per minute around the clock
Peak hours: 80,000+ concurrent conversations (10 AM - 2 PM on salary days)

Each call involves:

Real-time audio streaming (bidirectional)
Speech-to-text conversion (within 200ms)
Natural language understanding (within 100ms)
Banking system API calls (within 300ms)
Response generation (within 100ms)
Text-to-speech synthesis (within 200ms)

Total latency budget per conversational turn: under 800ms — to feel natural to the customer.

Why This Is Hard

At this scale, problems that are invisible at small volumes become existential:

Tail Latency: If 99% of calls have acceptable latency but 1% don't, that's 8,330 customers per day experiencing poor service. At banking scale, 1% failure means thousands of complaints.

Data Gravity: 2.5 crore calls generate approximately 15 petabytes of raw audio data annually. Storage, retrieval, and processing of this volume requires purpose-built data infrastructure.

Architectural Principles

Principle 1: Stateless Processing with Distributed State

Why this matters: If a processing node fails mid-call, another node can seamlessly continue the conversation from the last state checkpoint. The customer never knows a failover occurred.

Principle 2: Microservices with Independent Scaling

Each component of the voice AI pipeline is an independent service that scales independently based on its specific load pattern:

Telephony Gateway: Scales based on concurrent call count
ASR Service: Scales based on audio minutes being processed
NLU Service: Scales based on text utterances per second
Dialog Engine: Scales based on active conversation count
Banking Integration Service: Scales based on API calls per second
TTS Service: Scales based on text characters being synthesised
Recording Service: Scales based on audio bytes per second

Principle 3: Multi-Region Active-Active

YuVoice operates across multiple availability zones within India, with active-active configuration:

Primary: Mumbai region
Secondary: Hyderabad region
DR: Chennai region

All three regions actively process traffic. A region failure doesn't trigger a "failover" (which implies downtime) — traffic is simply redistributed to the remaining active regions within seconds.

Principle 4: Circuit Breakers and Graceful Degradation

When dependent systems experience issues, the platform degrades gracefully rather than failing completely:

Banking System Slow:

If the CBS takes >2 seconds (vs normal 300ms), the voice bot acknowledges the delay: "I'm looking up your account details. One moment please."
If the CBS is completely unavailable, the bot offers alternatives: "I'm unable to access account details right now. I can take your request and call you back within 30 minutes, or connect you to an agent."

ASR Accuracy Drop:

If background noise causes recognition confidence to drop below threshold, the bot asks for confirmation rather than acting on uncertain input
If sustained accuracy issues are detected for a specific call, automatic noise cancellation is amplified or the call is routed to a human agent

TTS Service Degraded:

If the primary TTS engine has latency issues, automatic fallback to secondary TTS engine (potentially lower quality but functional)
Pre-generated audio for common responses (greetings, confirmations) serves as ultra-fast fallback

Principle 5: Zero-Downtime Deployments

Model updates, feature releases, and bug fixes deploy without interrupting a single active call:

Canary Deployment: New versions receive 1% of new calls (not existing calls). If metrics degrade, automatic rollback occurs before more customers are affected.

Blue-Green for Models: New language models deploy alongside existing models. Traffic is gradually shifted (1% → 10% → 50% → 100%) with automatic rollback triggers on accuracy metrics.

Active Call Protection: Calls in progress always complete on the version they started with. Version switches only happen for new calls.

Deep Dive: The Real-Time Processing Pipeline

Audio Ingestion

When a customer connects:

SIP INVITE received → Call setup (50ms)
Audio stream established → Bidirectional RTP flow
Audio buffered in 20ms frames → Streamed to ASR
Voice Activity Detection (VAD) → Distinguishes speech from silence/noise
Acoustic preprocessing → Noise cancellation, echo removal, gain normalisation

At peak, the audio ingestion layer handles:

80,000+ simultaneous bidirectional audio streams
3.2 million audio frames per second (80K calls × 2 directions × 20 frames/sec)
256 Mbps aggregate audio bandwidth

Speech Recognition at Scale

The ASR subsystem processes audio frames into text hypotheses:

Model Architecture: Custom-trained Conformer models for each Indian language, optimised for banking vocabulary and Indian telephony acoustic conditions.

Streaming Recognition: Text hypotheses are generated incrementally as the customer speaks — not waiting for them to finish. This enables:

Early intent detection (understanding the request before the customer finishes speaking)
Predictive backend queries (starting to fetch account info as soon as the account is mentioned, before the full request is complete)

Natural Language Understanding

Once the ASR produces a transcript:

Intent Classification: Multi-label classification across 500+ banking intents
Entity Extraction: Named entity recognition for accounts, amounts, dates, products
Coreference Resolution: "it" → the account mentioned earlier
Sentiment Scoring: Real-time emotion tracking
Context Integration: Combining current utterance with conversation history

Latency: NLU processing completes in 50-100ms (not on the critical path for response latency — it runs in parallel with end-of-speech detection).

Dialog Management and Action Execution

The Dialog Engine:

Updates conversation state with new information from NLU
Evaluates dialog policy — what should happen next?
Executes backend actions if needed (API calls to banking systems)
Generates response strategy — what to say and how

Parallel Execution: When multiple pieces of information are needed (e.g., account balance AND recent transactions), queries execute in parallel — not sequentially.

Response Generation and Speech Synthesis

Template Selection: Choose appropriate response template based on context
Dynamic Content Insertion: Fill in real-time data (balances, dates, amounts)
Language-Specific Formatting: Amounts in Indian format, dates in DD-MM-YYYY
TTS Synthesis: Convert text to natural-sounding speech in the customer's language
Audio Streaming: Stream synthesised audio back to the customer

TTS at Scale: Neural TTS models generate audio in real time. The TTS fleet handles:

80,000+ concurrent synthesis streams at peak
Average response length: 3-5 seconds of audio
Latency from text-ready to first audio byte: <200ms

Handling Peak Load: The Salary Day Challenge

The Pattern

Indian banking call volumes exhibit extreme periodicity:

1st of month: Salary credit queries, EMI deductions, rent transfers → 3-4x normal volume
Last week of month: Payment reminders, bill payments, balance management → 2-3x normal
Diwali/Festival periods: Card limit requests, travel-related queries → 2-3x normal
Monday mornings: Weekend transaction queries → 1.5-2x normal
Budget day / RBI announcements: Policy-related queries → 2-3x spike for 2-3 hours

Pre-Scaling Strategy

YuVoice uses predictive scaling based on:

Historical Patterns: ML models trained on 24 months of call volume data predict next-day and next-week volumes with 92%+ accuracy.

Calendar Awareness: Known events (salary days, festivals, RBI meetings) trigger pre-scaling 2 hours before expected spike.

Capacity Reservation

For known peak events:

Pre-warm GPU instances (cold start time for a GPU node: 3-5 minutes — too long for reactive scaling)
Scale banking integration connection pools
Increase state store replica count
Alert engineering team for monitoring

Load Shedding (Last Resort)

If unprecedented load exceeds all capacity (has happened once in platform history — during a nationwide banking system outage that drove millions of confusion calls):

Priority-based shedding:

First shed: Non-critical background processing (analytics, model training)
Then: Reduce TTS quality (switch to faster, lower-quality synthesis)
Then: Limit concurrent new calls (queue excess calls with estimated wait time)
Never: Drop existing active calls

Monitoring and Observability

Real-Time Metrics (Sampled Every 10 Seconds)

Concurrent active calls (total and per-client)
ASR latency (P50, P95, P99)
NLU accuracy (measured against online evaluation samples)
Banking API response times (per-system)
TTS latency (P50, P95, P99)
End-to-end turn latency (customer finishes speaking → bot starts responding)
Call completion rate
Escalation rate
Error rate by component

Alert Hierarchy

P0 (Immediate page, auto-escalation at 5 min):

Call drop rate > 0.1%
ASR error rate > 5% (indicates model or infrastructure issue)
Banking API failure rate > 2%
Any single-client SLA breach

P1 (Alert, 15-minute response):

Latency P95 exceeds SLA threshold
Capacity utilisation > 80%
Model accuracy drift detected
Escalation rate abnormally high

P2 (Daily review):

Cost anomalies
Capacity trend projections
Model performance degradation signals
Infrastructure maintenance needs

Incident Response

When something goes wrong at scale:

Automated Response (No Human Required):

Node failures → Automatic traffic redistribution (2-second detection, 5-second recovery)
Region issues → Cross-region traffic shift
Model degradation → Automatic rollback to last-known-good version
Dependent service slow → Circuit breaker activation + graceful degradation

Human Response (Required):

Complete CBS outage (customer-facing messaging decisions)
Security incident (potential data breach)
Multi-region failure (disaster recovery procedure)
Regulatory compliance incident (communication to bank + regulators)

Data Architecture for 2.5 Crore Monthly Calls

Data Generated Per Call

Data Type	Size	Retention
Raw audio (both sides)	5-8 MB (3-minute avg call)	7 years (RBI mandate)
ASR transcript	2-5 KB	7 years
NLU metadata (intents, entities, sentiment)	1-3 KB	3 years
Conversation state snapshots	5-10 KB	1 year
API call logs	3-8 KB	3 years
Analytics events	1-2 KB	5 years

Monthly Data Volume

Raw audio: ~150 TB/month
Structured metadata: ~500 GB/month
Total annual: ~1.8 PB audio + 6 TB structured

Storage Architecture

Hot Storage (0-30 days): Object storage with SSD-backed metadata for fast retrieval. Any recent call can be replayed within 2 seconds.

Warm Storage (30-365 days): Object storage with slower access tier. Retrieval within 30 seconds for compliance reviews.

Cold Storage (1-7 years): Archived to lowest-cost storage tier. Retrieval within 4 hours (sufficient for regulatory audit requests).

Metadata Always Hot: Structured metadata (transcripts, intents, entities) remains in fast-access databases regardless of audio tier — enabling instant analytics queries across the full history.

Continuous Improvement at Scale

Model Improvement Cycle

The 2.5 crore monthly calls generate massive training data:

Automated Quality Sampling: 0.1% of calls randomly selected for human review (25,000 calls/month)
Error Detection: Calls where the customer repeated themselves, expressed frustration, or escalated are automatically flagged
Model Retraining: Monthly model updates incorporating new patterns, vocabulary, and corrections
A/B Deployment: New models tested on 1% of traffic for 48 hours before wider rollout
Accuracy Measurement: Continuous online evaluation against human-transcribed ground truth

Language Model Scaling

Each language model improves proportionally to its usage data:

Hindi model: Improves weekly (highest volume)
Tamil/Telugu models: Improve bi-weekly (high volume)
Smaller language models: Improve monthly (sufficient volume)

Cold Start Problem for New Languages: When adding a new language, initial accuracy is lower due to limited training data. The platform compensates with:

Transfer learning from related languages
Active learning (prioritising uncertain samples for human review)
Higher escalation threshold (route to human agent more readily until accuracy stabilises)

Security at Scale

Voice Data Security

All audio encrypted at rest (AES-256) and in transit (TLS 1.3)
Encryption keys managed through HSM (Hardware Security Module)
Access to raw audio requires two-factor authentication + approval workflow
Audio anonymisation pipeline for training data (removes PII from transcripts)

Authentication and Authorisation

Every API call authenticated (mTLS between services)
Role-based access control (RBAC) for operational access
Audit logging of all administrative actions
Regular penetration testing (quarterly)
Bug bounty programme for security researchers

Banking Data Isolation

Each bank's data is logically isolated (separate encryption keys, separate access policies)
No cross-bank data leakage is possible even at the infrastructure level
Bank-specific models trained only on that bank's data (no data pooling across clients)

Lessons Learned: What Breaks at Scale

Lesson 1: Timeouts Cascade

Solution: Adaptive timeouts that tighten under load. If the system is stressed, timeout faster and degrade gracefully rather than holding resources hoping for slow responses.

Lesson 2: Garbage Collection Pauses Kill Voice Quality

Solution: Low-latency runtime choices, GC tuning, off-heap memory for audio buffers, and real-time audio buffering that masks brief processing delays.

Lesson 3: Logging Can Be the Bottleneck

At 80K concurrent calls generating detailed logs, the logging infrastructure itself becomes a critical system. A logging slowdown can backpressure the application.

Solution: Asynchronous, sampled logging for hot paths. Full logging only for error paths. Separate logging infrastructure from production traffic paths.

Lesson 4: Feature Flags at Scale

Solution: Performance regression testing for every change, no matter how small. Feature flag impact assessment before enabling at scale.