All Experiments
╔══════════════════════════════════════════════════════════════╗
║  EXPERIMENT: TTS Voiceover Generation                        ║
║  KEY METRIC: 6 scenes, ~1.7MB audio, ~3 min runtime          ║
║  workway.co                                                  ║
╚══════════════════════════════════════════════════════════════╝
ValidatedJanuary 15, 2026

TTS Voiceover Generation

Can AI-generated voiceover replace manual recording for product walkthrough videos?

Hypothesis

Claim: ElevenLabs TTS with SSML break tags, optimized voice settings, and "Nicely Said" script writing can produce natural-sounding voiceovers for product walkthroughs without manual recording.

Success Criteria

  • Generate 7 audio scenes covering complete workflow walkthrough
  • Audio sounds conversational, not robotic or "announcer-like"
  • Script iteration cycle under 5 minutes per adjustment
  • Total generation cost under $5

Methodology

What Was Built

  • • Screen capture script following "Nicely Said" framework (Fenton/Lee)
  • • Node.js script to generate audio via ElevenLabs API
  • • 6-scene voiceover for Focus Workflow walkthrough
  • • SSML break tags for natural pacing

Iteration Cycles

IterationChangeResult
1Baseline voice (River)Too fast, odd inflections
2Try different voices (Dakota, Mark, Hope)Better tone, still unnatural pacing
3Rewrite script with natural cadence cuesSignificant improvement
4Add SSML break tags + speed 0.9Natural, conversational delivery
5Final voice (Jamahal) + stability 0.55Production ready

Key Learnings: Script Writing for TTS

Avoid

  • • Choppy periods: "Task. Duration. Date."
  • • Numbered lists: "One: Two: Three:"
  • • Em dashes for pauses
  • • Brand names without context

Use

  • • Flowing commas: "The task, the duration, the date"
  • • Word ordinals: "First, Second, Third"
  • • SSML breaks: <break time="0.5s" />
  • • Spelled-out numbers: "twenty-three"

Results

MetricValue
Total scenes6
Total audio size~2 MB
Runtime~3 minutes
Generation time per iteration~13 seconds
Voices tested6
Total iterations5

Final Configuration

Voice: Jamahal (DTKMou8ccj1ZaWGBiotd)
Model: eleven_turbo_v2_5
Settings:
  stability: 0.55
  similarity_boost: 0.75
  style: 0.0
  speed: 0.9
  
SSML: <break time="0.3s" /> to <break time="0.7s" />

Listen

The final voiceover for the Focus Workflow walkthrough. Six scenes, ~3 minutes total.

What Focus Does~45s
Working Without Interruption~18s
Completing the Block~25s
Automatic Logging to Notion~30s
The Setup~35s
Close~18s

Voice: Jamahal · Model: eleven_turbo_v2_5 · Speed: 0.9x

Honest Assessment

What This Proves

  • • TTS can produce usable voiceover for informational/product content
  • • Script writing matters more than voice selection
  • • SSML breaks are essential for natural pacing
  • • Iteration is fast enough (~13s) to experiment freely

What This Doesn't Prove

  • • Doesn't prove TTS works for emotional/storytelling content
  • • Doesn't prove this voice works for all demographics
  • • Doesn't prove long-form content (30+ minutes) works

Where Intervention Was Needed

  • Script rewriting: Original script had unnatural phrasing for TTS
  • Voice selection: Required listening to multiple options
  • SSML tuning: Break timing required iteration

Reproducibility

Prerequisites

  • • ElevenLabs account (Creator tier for premium voices)
  • • Node.js 18+
  • • API key from ElevenLabs dashboard

Files

workway-platform/
├── scripts/generate-focus-voiceover.js   # Generation script
├── docs/FOCUS_WORKFLOW_SCREEN_CAPTURE_SCRIPT.md
└── docs/voiceover-audio/
    ├── 01-problem.mp3
    ├── 02-what-focus-does.mp3
    ├── 03-working.mp3
    ├── 04-completing.mp3
    ├── 05-notion.mp3
    ├── 06-setup.mp3
    ├── 07-close.mp3
    └── VOICEOVER_SCRIPT.md

Run Command

# Set API key
export ELEVENLABS_API_KEY=sk_...

# Generate audio
cd workway-platform
node scripts/generate-focus-voiceover.js

Outcome

Hypothesis Validated

ElevenLabs TTS with SSML breaks and optimized script writing produces natural-sounding voiceover suitable for product walkthrough videos.

Evidence: 6 scenes, ~3 minutes of audio, natural conversational delivery achieved in 5 iterations.

Next Steps

  • • Record screen capture synced to audio
  • • Test with actual users for clarity
  • • Create template for future workflow walkthroughs