[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-050d4528-ad90-48c3-b889-bed479178974":3,"$f8JZ5TB5KzgELcSuHmMvoC6yVgehT2pLeAAz9E_qv3tM":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"050d4528-ad90-48c3-b889-bed479178974","voice-agents","语音代理代表着人工智能交互的前沿——人类","cat_life_career","mod_other","sickn33,other","---\nname: voice-agents\ndescription: Voice agents represent the frontier of AI interaction - humans\n  speaking naturally with AI systems.\nrisk: safe\nsource: vibeship-spawner-skills (Apache 2.0)\ndate_added: 2026-02-27\n---\n\n# Voice Agents\n\nVoice agents represent the frontier of AI interaction - humans speaking\nnaturally with AI systems. The challenge isn't just speech recognition\nand synthesis, it's achieving natural conversation flow with sub-800ms\nlatency while handling interruptions, background noise, and emotional\nnuance.\n\nThis skill covers two architectures: speech-to-speech (OpenAI Realtime API,\nlowest latency, most natural) and pipeline (STT→LLM→TTS, more control,\neasier to debug). Key insight: latency is the constraint. Humans expect\nresponses in 500ms. Every millisecond matters.\n\n84% of organizations are increasing voice AI budgets in 2025. This is the\nyear voice agents go mainstream.\n\n## Principles\n\n- Latency is the constraint - target \u003C800ms end-to-end\n- Jitter (variance) matters as much as absolute latency\n- VAD quality determines conversation flow\n- Interruption handling makes or breaks the experience\n- Start with focused MVP, iterate based on real conversations\n- Combine best-in-class components (Deepgram STT + ElevenLabs TTS)\n\n## Capabilities\n\n- voice-agents\n- speech-to-speech\n- speech-to-text\n- text-to-speech\n- conversational-ai\n- voice-activity-detection\n- turn-taking\n- barge-in-detection\n- voice-interfaces\n\n## Scope\n\n- phone-system-integration → backend\n- audio-processing-dsp → audio-specialist\n- music-generation → audio-specialist\n- accessibility-compliance → accessibility-specialist\n\n## Tooling\n\n### Speech_to_speech\n\n- OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms\n- Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular\n\n### Speech_to_text\n\n- OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results\n- Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio\n- AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance\n\n### Text_to_speech\n\n- ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression\n- OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming\n- Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB\n\n### Frameworks\n\n- Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling\n- Vapi - When: Managed voice agent platform Note: No infrastructure management\n- Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption\n\n## Patterns\n\n### Speech-to-Speech Architecture\n\nDirect audio-to-audio processing for lowest latency\n\n**When to use**: Maximum naturalness, emotional preservation, real-time conversation\n\n# SPEECH-TO-SPEECH ARCHITECTURE:\n\n\"\"\"\n[User Audio] → [S2S Model] → [Agent Audio]\n\nAdvantages:\n- Lowest latency (sub-500ms)\n- Preserves emotion, emphasis, accents\n- Most natural conversation flow\n\nDisadvantages:\n- Less control over responses\n- Harder to debug\u002Faudit\n- Can't easily modify what's said\n\"\"\"\n\n## OpenAI Realtime API\n\"\"\"\nimport { RealtimeClient } from '@openai\u002Frealtime-api-beta';\n\nconst client = new RealtimeClient({\n  apiKey: process.env.OPENAI_API_KEY,\n});\n\n\u002F\u002F Configure for voice conversation\nclient.updateSession({\n  modalities: ['text', 'audio'],\n  voice: 'alloy',\n  input_audio_format: 'pcm16',\n  output_audio_format: 'pcm16',\n  instructions: `You are a helpful customer service agent.\n    Be concise and friendly. If you don't know something,\n    say so rather than making things up.`,\n  turn_detection: {\n    type: 'server_vad',  \u002F\u002F or 'semantic_vad'\n    threshold: 0.5,\n    prefix_padding_ms: 300,\n    silence_duration_ms: 500,\n  },\n});\n\n\u002F\u002F Handle audio streams\nclient.on('conversation.item.input_audio_transcription', (event) => {\n  console.log('User said:', event.transcript);\n});\n\nclient.on('response.audio.delta', (event) => {\n  \u002F\u002F Stream audio to speaker\n  audioPlayer.write(Buffer.from(event.delta, 'base64'));\n});\n\n\u002F\u002F Send user audio\nclient.appendInputAudio(audioBuffer);\n\"\"\"\n\n### Use Cases:\n- Real-time customer support\n- Voice assistants\n- Interactive voice response (IVR)\n- Live language translation\n\n### Pipeline Architecture\n\nSeparate STT → LLM → TTS for maximum control\n\n**When to use**: Need to know\u002Fcontrol exactly what's said, debugging, compliance\n\n# PIPELINE ARCHITECTURE:\n\n\"\"\"\n[Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]\n\nAdvantages:\n- Full control at each step\n- Can log\u002Faudit all text\n- Easier to debug\n- Mix best-in-class components\n\nDisadvantages:\n- Higher latency (700-1200ms typical)\n- Loses some emotion\u002Fnuance\n- More components to manage\n\"\"\"\n\n## Production Pipeline Example\n\"\"\"\nimport { Deepgram } from '@deepgram\u002Fsdk';\nimport { ElevenLabsClient } from 'elevenlabs';\nimport OpenAI from 'openai';\n\n\u002F\u002F Initialize clients\nconst deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY);\nconst elevenlabs = new ElevenLabsClient();\nconst openai = new OpenAI();\n\nasync function processVoiceInput(audioStream) {\n  \u002F\u002F 1. Speech-to-Text (Deepgram Nova-3)\n  const transcription = await deepgram.transcription.live({\n    model: 'nova-3',\n    punctuate: true,\n    endpointing: 300,  \u002F\u002F ms of silence before end\n  });\n\n  transcription.on('transcript', async (data) => {\n    if (data.is_final && data.speech_final) {\n      const userText = data.channel.alternatives[0].transcript;\n      console.log('User:', userText);\n\n      \u002F\u002F 2. LLM Processing\n      const completion = await openai.chat.completions.create({\n        model: 'gpt-4o-mini',\n        messages: [\n          { role: 'system', content: 'You are a concise voice assistant.' },\n          { role: 'user', content: userText }\n        ],\n        max_tokens: 150,  \u002F\u002F Keep responses short for voice\n      });\n\n      const agentText = completion.choices[0].message.content;\n      console.log('Agent:', agentText);\n\n      \u002F\u002F 3. Text-to-Speech (ElevenLabs)\n      const audioStream = await elevenlabs.textToSpeech.stream({\n        voice_id: 'voice_id_here',\n        text: agentText,\n        model_id: 'eleven_flash_v2_5',  \u002F\u002F Lowest latency\n      });\n\n      \u002F\u002F Stream to user\n      playAudioStream(audioStream);\n    }\n  });\n\n  \u002F\u002F Pipe audio to transcription\n  audioStream.pipe(transcription);\n}\n\"\"\"\n\n### Optimization Tips:\n- Start TTS while LLM still generating (streaming)\n- Pre-compute first response segment during user speech\n- Use Flash\u002Fturbo models for latency\n\n### Voice Activity Detection Pattern\n\nDetect when user starts\u002Fstops speaking\n\n**When to use**: All voice agents need VAD for turn-taking\n\n# VOICE ACTIVITY DETECTION (VAD):\n\n\"\"\"\nVAD Types:\n1. Energy-based: Simple, fast, noise-sensitive\n2. Model-based: Silero VAD, more accurate\n3. Semantic VAD: Understands meaning, best for conversation\n\"\"\"\n\n## Silero VAD (Popular Open Source)\n\"\"\"\nimport { SileroVAD } from '@pipecat-ai\u002Fsilero-vad';\n\nconst vad = new SileroVAD({\n  threshold: 0.5,           \u002F\u002F Speech probability threshold\n  min_speech_duration: 250, \u002F\u002F ms before speech confirmed\n  min_silence_duration: 500, \u002F\u002F ms of silence = end of turn\n});\n\nvad.on('speech_start', () => {\n  console.log('User started speaking');\n  \u002F\u002F Stop any playing TTS (barge-in)\n  audioPlayer.stop();\n});\n\nvad.on('speech_end', () => {\n  console.log('User finished speaking');\n  \u002F\u002F Trigger response generation\n  processTranscript();\n});\n\n\u002F\u002F Feed audio to VAD\naudioStream.on('data', (chunk) => {\n  vad.process(chunk);\n});\n\"\"\"\n\n## OpenAI Semantic VAD\n\"\"\"\n\u002F\u002F In Realtime API session config\nclient.updateSession({\n  turn_detection: {\n    type: 'semantic_vad',  \u002F\u002F Uses meaning, not just silence\n    \u002F\u002F Model waits longer after \"ummm...\"\n    \u002F\u002F Responds faster after \"Yes, that's correct.\"\n  },\n});\n\"\"\"\n\n## Barge-In Handling\n\"\"\"\n\u002F\u002F When user interrupts:\nfunction handleBargeIn() {\n  \u002F\u002F 1. Stop TTS immediately\n  audioPlayer.stop();\n\n  \u002F\u002F 2. Cancel pending LLM generation\n  llmController.abort();\n\n  \u002F\u002F 3. Reset state\n  conversationState.checkpoint();\n\n  \u002F\u002F 4. Listen to new input\n  startListening();\n}\n\n\u002F\u002F VAD triggers barge-in\nvad.on('speech_start', () => {\n  if (audioPlayer.isPlaying) {\n    handleBargeIn();\n  }\n});\n\"\"\"\n\n### Latency Optimization Pattern\n\nAchieving \u003C800ms end-to-end response time\n\n**When to use**: Production voice agents\n\n# LATENCY OPTIMIZATION:\n\n\"\"\"\nTarget Metrics:\n- End-to-end: \u003C800ms (ideal: \u003C500ms)\n- Time-to-First-Token (TTFT): \u003C300ms\n- Barge-in response: \u003C200ms\n- Jitter variance: \u003C100ms std dev\n\"\"\"\n\n## Pipeline Latency Breakdown\n\"\"\"\nTypical breakdown:\n- VAD processing: 50-100ms\n- STT first result: 150-200ms\n- LLM TTFT: 100-300ms\n- TTS TTFA: 75-200ms\n- Audio buffering: 50-100ms\n\nTotal: 425-900ms\n\"\"\"\n\n## Optimization Strategies\n\n### 1. Streaming Everything\n\"\"\"\n\u002F\u002F Stream STT results as they come\nstt.on('partial_transcript', (text) => {\n  \u002F\u002F Start processing before final transcript\n  llmPreprocessor.prepare(text);\n});\n\n\u002F\u002F Stream LLM output to TTS\nconst llmStream = await openai.chat.completions.create({\n  stream: true,\n  \u002F\u002F ...\n});\n\nfor await (const chunk of llmStream) {\n  tts.appendText(chunk.choices[0].delta.content);\n}\n\"\"\"\n\n### 2. Pre-computation\n\"\"\"\n\u002F\u002F While user is speaking, predict and prepare\nstt.on('partial_transcript', async (text) => {\n  \u002F\u002F Pre-fetch relevant context\n  const context = await retrieveContext(text);\n\n  \u002F\u002F Pre-compute likely first sentence\n  const firstSentence = await generateOpener(context);\n});\n\"\"\"\n\n### 3. Use Low-Latency Models\n\"\"\"\n\u002F\u002F STT: Deepgram Nova-3 (150ms TTFT)\n\u002F\u002F LLM: gpt-4o-mini (fastest GPT-4 class)\n\u002F\u002F TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms)\n\"\"\"\n\n### 4. Edge Deployment\n\"\"\"\n\u002F\u002F Run inference closer to user\n\u002F\u002F - Cloud regions near user\n\u002F\u002F - Edge computing for VAD\u002FSTT\n\u002F\u002F - WebSocket over HTTP for lower overhead\n\"\"\"\n\n### Conversation Design Pattern\n\nDesigning natural voice conversations\n\n**When to use**: Building voice UX\n\n# CONVERSATION DESIGN:\n\n## Voice-First Principles\n\"\"\"\nVoice is different from text:\n- No undo button - say it right the first time\n- Linear - user can't scroll back\n- Ephemeral - easy to miss information\n- Emotional - tone matters as much as words\n\"\"\"\n\n## Response Design\n\"\"\"\n# Keep responses short (10-20 seconds max)\n# Front-load the answer\n# Use signposting for lists\n\nBad: \"I found several options. The first is... second is...\"\nGood: \"I found 3 options. Want me to go through them?\"\n\n# Confirm understanding\nBad: \"I'll transfer $500 to John.\"\nGood: \"So that's $500 to John Smith. Should I proceed?\"\n\"\"\"\n\n## Prompting for Voice\n\"\"\"\nsystem_prompt = '''\nYou are a voice assistant. Follow these rules:\n\n1. Be concise - keep responses under 30 words\n2. Use natural speech - contractions, casual language\n3. Never use formatting (bullets, numbers in lists)\n4. Spell out numbers and abbreviations\n5. End with a question to keep conversation flowing\n6. If unclear, ask for clarification\n7. Never say \"I'm an AI\" unless asked\n\nGood: \"Got it. I'll set that reminder for three pm. Anything else?\"\nBad: \"I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?\"\n'''\n\"\"\"\n\n## Error Recovery\n\"\"\"\n\u002F\u002F Handle recognition errors gracefully\nconst errorResponses = {\n  no_speech: \"I didn't catch that. Could you say it again?\",\n  unclear: \"Sorry, I'm not sure I understood. You said [repeat]. Is that right?\",\n  timeout: \"Still there? I'm here when you're ready.\",\n};\n\n\u002F\u002F Always offer human fallback for complex issues\nif (confidenceScore \u003C 0.6) {\n  response = \"I want to make sure I get this right. Would you like to speak with a human agent?\";\n}\n\"\"\"\n\n## Sharp Edges\n\n### Response Latency Exceeds 800ms\n\nSeverity: CRITICAL\n\nSituation: Building a voice agent pipeline\n\nSymptoms:\nConversations feel awkward. Users repeat themselves. \"Are you\nthere?\" questions. Users hang up or give up. Low satisfaction\nscores despite correct answers.\n\nWhy this breaks:\nIn human conversation, responses typically arrive within 500ms.\nAnything over 800ms feels like the agent is slow or confused.\nUsers lose confidence and patience. Every component adds latency:\nVAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.\n\nRecommended fix:\n\n# Measure and budget latency for each component:\n\n### Target latencies:\n- VAD processing: \u003C100ms\n- STT time-to-first-token: \u003C200ms\n- LLM time-to-first-token: \u003C300ms\n- TTS time-to-first-audio: \u003C150ms\n- Total end-to-end: \u003C800ms\n\n### Optimization strategies:\n\n1. Use low-latency models:\n   - STT: Deepgram Nova-3 (150ms) vs Whisper (500ms+)\n   - TTS: ElevenLabs Flash (75ms) vs standard (200ms+)\n   - LLM: gpt-4o-mini streaming\n\n2. Stream everything:\n   - Don't wait for full STT transcript\n   - Stream LLM output to TTS\n   - Start audio playback before TTS finishes\n\n3. Pre-compute:\n   - While user speaks, prepare context\n   - Generate opening phrase in parallel\n\n4. Edge deployment:\n   - Run VAD\u002FSTT at edge\n   - Use nearest cloud region\n\n### Measure continuously:\nLog timestamps at each stage, track P50\u002FP95 latency\n\n### Response Time Variance Disrupts Rhythm\n\nSeverity: HIGH\n\nSituation: Voice agent with inconsistent response times\n\nSymptoms:\nConversations feel unpredictable. User doesn't know when to speak.\nSometimes agent responds immediately, sometimes after long pause.\nUsers talk over agent. Agent talks over users.\n\nWhy this breaks:\nJitter (variance in response time) disrupts conversational rhythm\nmore than absolute latency. Consistent 800ms feels better than\nalternating 400ms and 1200ms. Users can't adapt to unpredictable\ntiming.\n\nRecommended fix:\n\n# Target jitter metrics:\n- Standard deviation: \u003C100ms\n- P95-P50 gap: \u003C200ms\n\n### Reduce jitter sources:\n\n1. Consistent model loading:\n   - Keep models warm\n   - Pre-load on connection start\n\n2. Buffer audio output:\n   - Small buffer (50-100ms) smooths playback\n   - Don't start playing until buffer filled\n\n3. Handle LLM variance:\n   - gpt-4o-mini more consistent than larger models\n   - Set max_tokens to limit long responses\n\n4. Monitor and alert:\n   - Track response time distribution\n   - Alert on jitter spikes\n\n### Implementation:\nconst MIN_RESPONSE_TIME = 400;  \u002F\u002F ms\n\nasync function respondWithConsistentTiming(text) {\n  const startTime = Date.now();\n  const audio = await generateSpeech(text);\n\n  const elapsed = Date.now() - startTime;\n  if (elapsed \u003C MIN_RESPONSE_TIME) {\n    await delay(MIN_RESPONSE_TIME - elapsed);\n  }\n\n  playAudio(audio);\n}\n\n### Using Silence Duration for Turn Detection\n\nSeverity: HIGH\n\nSituation: Detecting when user finishes speaking\n\nSymptoms:\nAgent interrupts user mid-thought. Or waits too long after user\nfinishes. \"Let me think...\" triggers premature response. Short\nanswers have awkward pause before response.\n\nWhy this breaks:\nSimple silence detection (e.g., \"end turn after 500ms silence\")\ndoesn't understand conversation. Humans pause mid-sentence.\n\"Yes.\" needs fast response, \"Well, let me think about that...\"\nneeds patience. Fixed timeout fits neither.\n\nRecommended fix:\n\n# Use semantic VAD:\n\n### OpenAI Semantic VAD:\nclient.updateSession({\n  turn_detection: {\n    type: 'semantic_vad',\n    \u002F\u002F Waits longer after \"umm...\"\n    \u002F\u002F Responds faster after \"Yes, that's correct.\"\n  },\n});\n\n### Pipecat SmartTurn:\nconst pipeline = new Pipeline({\n  vad: new SileroVAD(),\n  turnDetection: new SmartTurn(),\n});\n\n\u002F\u002F SmartTurn considers:\n\u002F\u002F - Speech content (complete sentence?)\n\u002F\u002F - Prosody (falling intonation?)\n\u002F\u002F - Context (question asked?)\n\n### Fallback: Adaptive silence threshold:\nfunction calculateSilenceThreshold(transcript) {\n  const endsWithComplete = transcript.match(\u002F[.!?]$\u002F);\n  const hasFillers = transcript.match(\u002Fum|uh|like|well\u002Fi);\n\n  if (endsWithComplete && !hasFillers) {\n    return 300;  \u002F\u002F Fast response\n  } else if (hasFillers) {\n    return 1500;  \u002F\u002F Wait for continuation\n  }\n  return 700;  \u002F\u002F Default\n}\n\n### Agent Doesn't Stop When User Interrupts\n\nSeverity: HIGH\n\nSituation: User tries to interrupt agent mid-sentence\n\nSymptoms:\nAgent talks over user. User has to wait for agent to finish.\nFrustrating experience. Users give up and abandon call.\n\"STOP! STOP!\" doesn't work.\n\nWhy this breaks:\nWithout barge-in handling, the TTS plays to completion regardless\nof user input. This violates basic conversational norms - in human\nconversation, we stop when interrupted.\n\nRecommended fix:\n\n# Implement barge-in detection:\n\n### Basic barge-in:\nvad.on('speech_start', () => {\n  if (ttsPlayer.isPlaying) {\n    \u002F\u002F 1. Stop audio immediately\n    ttsPlayer.stop();\n\n    \u002F\u002F 2. Cancel pending TTS generation\n    ttsController.abort();\n\n    \u002F\u002F 3. Checkpoint conversation state\n    conversationState.save();\n\n    \u002F\u002F 4. Listen to new input\n    startTranscription();\n  }\n});\n\n### Advanced: Distinguish interruption types:\nvad.on('speech_start', async () => {\n  if (!ttsPlayer.isPlaying) return;\n\n  \u002F\u002F Wait 200ms to get first words\n  await delay(200);\n  const firstWords = getTranscriptSoFar();\n\n  if (isBackchannel(firstWords)) {\n    \u002F\u002F \"uh-huh\", \"yeah\" - don't interrupt\n    return;\n  }\n\n  if (isClarification(firstWords)) {\n    \u002F\u002F \"What?\", \"Sorry?\" - repeat last sentence\n    repeatLastSentence();\n  } else {\n    \u002F\u002F Real interruption - stop and listen\n    handleFullInterruption();\n  }\n});\n\n### Response time target:\n- Barge-in response: \u003C200ms\n- User should feel heard immediately\n\n### Generating Text-Length Responses for Voice\n\nSeverity: MEDIUM\n\nSituation: Prompting LLM for voice agent responses\n\nSymptoms:\nAgent rambles. Users lose track of information. \"Can you repeat\nthat?\" requests. Users interrupt to ask for shorter version.\nLow comprehension of conveyed information.\n\nWhy this breaks:\nText can be scanned and re-read. Voice is linear and ephemeral.\nA 3-paragraph response that works in chat is overwhelming in voice.\nUsers can only hold ~7 items in working memory.\n\nRecommended fix:\n\n# Constrain response length in prompts:\n\nsystem_prompt = '''\nYou are a voice assistant. Keep responses UNDER 30 WORDS.\nFor complex information, break into chunks and confirm\nunderstanding between each.\n\nInstead of: \"Here are the three options. First, you could...\nSecond... Third...\"\n\nSay: \"I found 3 options. Want me to go through them?\"\n\nNever list more than 3 items without pausing for confirmation.\n'''\n\n### Enforce at generation:\nconst response = await openai.chat.completions.create({\n  max_tokens: 100,  \u002F\u002F Hard limit\n  \u002F\u002F ...\n});\n\n### Chunking pattern:\nif (information.length > 3) {\n  response = `I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?`;\n}\n\n### Progressive disclosure:\n\"I found your account. Want the balance, recent transactions, or something else?\"\n\u002F\u002F Don't dump all info at once\n\n### Using Bullets\u002FNumbers\u002FMarkdown in Voice\n\nSeverity: MEDIUM\n\nSituation: Formatting LLM output for voice\n\nSymptoms:\n\"First bullet point: item one\" read aloud. Numbers read as \"one\ntwo three\" instead of \"one, two, three.\" Markdown artifacts in\nspeech. Robotic, unnatural delivery.\n\nWhy this breaks:\nTTS models read what they're given. Text formatting intended for\nvisual display sounds robotic when read aloud. Users can't \"see\"\nstructure in audio.\n\nRecommended fix:\n\n# Prompt for spoken format:\n\nsystem_prompt = '''\nFormat responses for SPOKEN delivery:\n- No bullet points, numbered lists, or markdown\n- Spell out numbers: \"twenty-three\" not \"23\"\n- Spell out abbreviations: \"United States\" not \"US\"\n- Use verbal signposting: \"There are three things. First...\"\n- Never use asterisks, dashes, or special characters\n'''\n\n### Post-processing:\nfunction prepareForSpeech(text) {\n  return text\n    \u002F\u002F Remove markdown\n    .replace(\u002F[*_#`]\u002Fg, '')\n    \u002F\u002F Convert numbers\n    .replace(\u002F\\d+\u002Fg, numToWords)\n    \u002F\u002F Expand abbreviations\n    .replace(\u002F\\betc\\b\u002Fgi, 'et cetera')\n    .replace(\u002F\\be\\.g\\.\u002Fgi, 'for example')\n    \u002F\u002F Add pauses\n    .replace(\u002F\\. \u002Fg, '... ')\n    .replace(\u002F, \u002Fg, '... ');\n}\n\n### SSML for precise control:\n\u003Cspeak>\n  The total is \u003Csay-as interpret-as=\"currency\">$49.99\u003C\u002Fsay-as>.\n  \u003Cbreak time=\"500ms\"\u002F>\n  Want to proceed?\n\u003C\u002Fspeak>\n\n### VAD\u002FSTT Fails in Noisy Environments\n\nSeverity: MEDIUM\n\nSituation: Users in cars, cafes, outdoors\n\nSymptoms:\n\"I didn't catch that\" frequently. Background noise triggers\nfalse starts. Fan\u002FAC causes continuous listening. Car engine\nnoise confuses STT.\n\nWhy this breaks:\nDefault VAD thresholds work for quiet environments. Real-world\nusage includes background noise that triggers false positives\nor masks speech, causing false negatives.\n\nRecommended fix:\n\n# Implement noise handling:\n\n### 1. Noise reduction in STT:\nconst transcription = await deepgram.transcription.live({\n  model: 'nova-3',\n  noise_reduction: true,\n  \u002F\u002F or\n  smart_format: true,\n});\n\n### 2. Adaptive VAD threshold:\n\u002F\u002F Measure ambient noise level\nconst ambientLevel = measureAmbientNoise(5000);  \u002F\u002F 5 sec sample\n\nvad.setThreshold(ambientLevel * 1.5);  \u002F\u002F Above ambient\n\n### 3. Confidence filtering:\nstt.on('transcript', (data) => {\n  if (data.confidence \u003C 0.7) {\n    \u002F\u002F Low confidence - probably noise\n    askForRepeat();\n    return;\n  }\n  processTranscript(data.transcript);\n});\n\n### 4. Echo cancellation:\n\u002F\u002F Prevent agent's voice from being transcribed\nconst echoCanceller = new EchoCanceller();\nechoCanceller.reference(ttsOutput);\nconst cleanedAudio = echoCanceller.process(userAudio);\n\n### STT Produces Incorrect or Hallucinated Text\n\nSeverity: MEDIUM\n\nSituation: Processing unclear or accented speech\n\nSymptoms:\nAgent responds to something user didn't say. Names consistently\nwrong. Technical terms misheard. \"I said X, not Y\" frustration.\n\nWhy this breaks:\nSTT models can hallucinate, especially on proper nouns, technical\nterms, or accented speech. These errors propagate through the\npipeline and produce nonsensical responses.\n\nRecommended fix:\n\n# Mitigate STT errors:\n\n### 1. Use keywords\u002Fbiasing:\nconst transcription = await deepgram.transcription.live({\n  keywords: ['Acme Corp', 'ProductName', 'John Smith'],\n  keyword_boost: 'high',\n});\n\n### 2. Confirmation for critical info:\nif (containsNameOrNumber(transcript)) {\n  response = `I heard \"${name}\". Is that correct?`;\n}\n\n### 3. Confidence-based fallback:\nif (confidence \u003C 0.8) {\n  response = `I think you said \"${transcript}\". Did I get that right?`;\n}\n\n### 4. Multiple hypothesis handling:\n\u002F\u002F Some STT APIs return n-best list\nconst alternatives = transcription.alternatives;\nif (alternatives[0].confidence - alternatives[1].confidence \u003C 0.1) {\n  \u002F\u002F Ambiguous - ask for clarification\n}\n\n### 5. Error correction patterns:\npromptPattern = `\n  User may correct previous mistakes. If they say \"no, I said X\"\n  or \"not Y, Z\", update your understanding accordingly.\n`;\n\n## Validation Checks\n\n### Missing Latency Measurement\n\nSeverity: ERROR\n\nVoice agents must track latency at each stage\n\nMessage: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.\n\n### Using Batch STT Instead of Streaming\n\nSeverity: WARNING\n\nStreaming STT reduces latency significantly\n\nMessage: Using batch transcription. Consider streaming for lower latency in voice agents.\n\n### TTS Without Streaming Output\n\nSeverity: WARNING\n\nStreaming TTS reduces time to first audio\n\nMessage: TTS without streaming. Stream audio to reduce time to first audio.\n\n### Hardcoded VAD Silence Threshold\n\nSeverity: WARNING\n\nFixed silence thresholds don't adapt to conversation\n\nMessage: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.\n\n### Missing Barge-In Handling\n\nSeverity: WARNING\n\nVoice agents should stop when user interrupts\n\nMessage: VAD without barge-in handling. Stop TTS when user starts speaking.\n\n### Voice Prompt Without Length Constraints\n\nSeverity: WARNING\n\nVoice prompts should constrain response length\n\nMessage: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.\n\n### Markdown Formatting Sent to TTS\n\nSeverity: WARNING\n\nMarkdown will be read literally by TTS\n\nMessage: Check for markdown in TTS input. Strip formatting before sending to TTS.\n\n### STT Without Error Handling\n\nSeverity: WARNING\n\nSTT can fail or return low confidence\n\nMessage: STT without error handling. Check confidence scores and handle failures.\n\n### WebSocket Without Reconnection\n\nSeverity: WARNING\n\nRealtime APIs need reconnection handling\n\nMessage: Realtime connection without reconnection logic. Handle disconnects gracefully.\n\n### Missing Noise Handling\n\nSeverity: INFO\n\nReal-world audio includes background noise\n\nMessage: Consider adding noise handling for real-world audio quality.\n\n## Collaboration\n\n### Delegation Triggers\n\n- user needs phone\u002Ftelephony integration -> backend (Twilio, Vonage, SIP integration)\n- user needs LLM optimization -> llm-architect (Model selection, prompting, fine-tuning)\n- user needs tools for voice agent -> agent-tool-builder (Tool design for voice context)\n- user needs multi-agent voice system -> multi-agent-orchestration (Voice agents working together)\n- user needs accessibility compliance -> accessibility-specialist (Voice interface accessibility)\n\n## Related Skills\n\nWorks well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`\n\n## When to Use\n- User mentions or implies: voice agent\n- User mentions or implies: speech to text\n- User mentions or implies: text to speech\n- User mentions or implies: whisper\n- User mentions or implies: elevenlabs\n- User mentions or implies: deepgram\n- User mentions or implies: realtime api\n- User mentions or implies: voice assistant\n- User mentions or implies: voice ai\n- User mentions or implies: conversational ai\n- User mentions or implies: tts\n- User mentions or implies: stt\n- User mentions or implies: asr\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,202,1947,"2026-05-16 13:46:30",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"其他","other","mdi-page-next-outline","其他类型Skill",5,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"职场发展","career","mdi-briefcase-outline","面试准备、简历优化、职业规划",4,575,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"d1ec9a9f-8976-4ab6-9fd2-3caa7e44dd1f","1.0.0","voice-agents.zip",10191,"uploads\u002Fskills\u002F050d4528-ad90-48c3-b889-bed479178974\u002Fvoice-agents.zip","b8dcc0b1fd81f83b3a2701215d14eec5facebd5a7ed60080a4660590f6015cc6","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":26115}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]