╔══════════════════════════════════════════════════════════════╗ ║ EXPERIMENT: TTS Voiceover Generation ║ ║ KEY METRIC: 6 scenes, ~1.7MB audio, ~3 min runtime ║ ║ workway.co ║ ╚══════════════════════════════════════════════════════════════╝
Can AI-generated voiceover replace manual recording for product walkthrough videos?
Claim: ElevenLabs TTS with SSML break tags, optimized voice settings, and "Nicely Said" script writing can produce natural-sounding voiceovers for product walkthroughs without manual recording.
| Iteration | Change | Result |
|---|---|---|
| 1 | Baseline voice (River) | Too fast, odd inflections |
| 2 | Try different voices (Dakota, Mark, Hope) | Better tone, still unnatural pacing |
| 3 | Rewrite script with natural cadence cues | Significant improvement |
| 4 | Add SSML break tags + speed 0.9 | Natural, conversational delivery |
| 5 | Final voice (Jamahal) + stability 0.55 | Production ready |
Avoid
Use
| Metric | Value |
|---|---|
| Total scenes | 6 |
| Total audio size | ~2 MB |
| Runtime | ~3 minutes |
| Generation time per iteration | ~13 seconds |
| Voices tested | 6 |
| Total iterations | 5 |
Voice: Jamahal (DTKMou8ccj1ZaWGBiotd) Model: eleven_turbo_v2_5 Settings: stability: 0.55 similarity_boost: 0.75 style: 0.0 speed: 0.9 SSML: <break time="0.3s" /> to <break time="0.7s" />
The final voiceover for the Focus Workflow walkthrough. Six scenes, ~3 minutes total.
Voice: Jamahal · Model: eleven_turbo_v2_5 · Speed: 0.9x
workway-platform/
├── scripts/generate-focus-voiceover.js # Generation script
├── docs/FOCUS_WORKFLOW_SCREEN_CAPTURE_SCRIPT.md
└── docs/voiceover-audio/
├── 01-problem.mp3
├── 02-what-focus-does.mp3
├── 03-working.mp3
├── 04-completing.mp3
├── 05-notion.mp3
├── 06-setup.mp3
├── 07-close.mp3
└── VOICEOVER_SCRIPT.md# Set API key export ELEVENLABS_API_KEY=sk_... # Generate audio cd workway-platform node scripts/generate-focus-voiceover.js
ElevenLabs TTS with SSML breaks and optimized script writing produces natural-sounding voiceover suitable for product walkthrough videos.
Evidence: 6 scenes, ~3 minutes of audio, natural conversational delivery achieved in 5 iterations.