Speech & TranscriptionDocumentedScanned

audio-gen

Generate audiobooks, podcasts, or educational audio content on demand.

Share:

Installation

npx clawhub@latest install audio-gen

View the full skill documentation and source below.

Documentation

🎙️ Audio Content Generator

Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.

Quick Start

Create an audiobook chapter:

User: "Create a 5-minute audiobook chapter about a dragon discovering friendship"

Generate a podcast:

User: "Make a 10-minute podcast about the history of coffee"

Produce educational content:

User: "Generate a 15-minute educational audio explaining how neural networks work"

Content Formats

Audiobook

Style: Narrative storytelling with emotional depth
  • Clear beginning, middle, and end
  • Descriptive language and vivid imagery
  • Dramatic pacing with thoughtful pauses
  • Emotional tone that matches the story
  • Use voice effects like [whispers], [excited], [serious] for impact
Example Structure:
[Opening hook - set the scene]
[long pause]

[Story development with character emotions]
[short pause] between sentences
[long pause] between paragraphs

[Climax with dramatic tension]
[long pause]

[Resolution and emotional closure]

Podcast

Style: Conversational and engaging
  • Warm, welcoming intro (15-30 seconds)
  • Main content with natural flow
  • Transitions between topics
  • Memorable outro with key takeaways
  • Conversational tone throughout
Example Structure:
**Intro:** "Welcome to [topic]. I'm excited to share..."
[short pause]

**Main Content:** "Let's start with... [topic 1]"
[long pause] between segments

**Outro:** "Thanks for listening! Remember..."

Educational Content

Style: Clear explanations for learning
  • Simple introductions to complex topics
  • Step-by-step breakdowns
  • Real-world examples and analogies
  • Recap of key concepts at the end
  • Enthusiastic delivery with [excited] for important points
Example Structure:
**Introduction:** What is [topic] and why it matters?

**Main Content:**
- Concept 1: Explanation + Example
- Concept 2: Explanation + Example
- Concept 3: Explanation + Example

**Summary:** Key takeaways and next steps

Length Guidelines

Word Count to Duration Conversion:

  • 5 minutes = ~375 words

  • 10 minutes = ~750 words

  • 15 minutes = ~1,125 words

  • 20 minutes = ~1,500 words

  • 30 minutes = ~2,250 words


Pacing: Average conversational speed is ~75 words per minute

Practical Limits:

  • Minimum: 2 minutes (~150 words)

  • Maximum: 30 minutes (~2,250 words)

  • Sweet spot: 5-15 minutes for best engagement


Workflow Instructions

Step 1: Understand the Request

Parse the user's request for:

  • Content type (audiobook, podcast, educational, or inferred from topic)

  • Topic/theme (what should the content be about)

  • Target length (how many minutes)

  • Tone/style (dramatic, casual, educational, etc.)

  • Special requests (specific voice, emphasis on certain points)
  • Step 2: Calculate Word Count

    target_words = target_minutes × 75

    Example: 10 minutes = 10 × 75 = 750 words

    Step 3: Generate the Script

    Write the complete script following these rules:

    Content Guidelines:

    • Start strong with an engaging hook

    • Maintain natural, conversational flow

    • Use active voice and simple sentence structure

    • Include relevant examples and stories

    • End with a satisfying conclusion


    Formatting Rules:
    • Add [short pause] after sentences (use sparingly, not every sentence)

    • Add [long pause] between paragraphs or major sections

    • Use voice effects strategically: [whispers], [shouts], [excited], [serious], [sarcastic], [sings], [laughs]

    • Write numbers as words: "twenty-three" not "23"

    • Spell out acronyms first time: "AI, or artificial intelligence"

    • Avoid complex punctuation (em-dashes work, but semicolons don't read well)

    • Remove markdown formatting before TTS conversion


    Step 4: Present the Script

    Show the script to the user and ask:

    Here's the [format] script I've created (approximately [length] minutes):
    
    [Display the script]
    
    Would you like me to:
    1. Generate the audio now
    2. Make changes to the script
    3. Adjust the length or tone

    Step 5: Handle User Feedback

    If user requests changes:

    • Regenerate the script with adjustments

    • Maintain the target word count

    • Present the revised version


    If user approves:
    • Proceed to audio generation


    Step 6: Generate Audio

    Format the script for TTS:

  • Remove any remaining markdown (headers, bold, italics)

  • Ensure voice effects are in proper [effect] format

  • Check that pauses are appropriately placed

  • Verify numbers and acronyms are spelled out
  • Invoke the TTS script:

    IMPORTANT: The ELEVENLABS_API_KEY environment variable is already configured in the system. Simply invoke the TTS script directly.

    uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
      -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
      -m eleven_multilingual_v2 \
      "[formatted_script]"

    For long scripts, use heredoc:

    uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \
      -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \
      -m eleven_multilingual_v2 \
      "$(cat <<'EOF'
    [formatted_script]
    EOF
    )"

    Return the result:

    MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3
    
    Your [format] is ready! [Brief description of content]. Duration: approximately [X] minutes.

    Voice Effects (SSML Tags)

    Available voice modulation effects (use sparingly for impact):

    • [whispers] - Soft, intimate delivery
    • [shouts] - Loud, emphatic delivery
    • [excited] - Enthusiastic, energetic tone
    • [serious] - Grave, solemn tone
    • [sarcastic] - Ironic, mocking tone
    • [sings] - Musical, melodic delivery
    • [laughs] - Amused, jovial tone
    • [short pause] - Brief silence (~0.5s)
    • [long pause] - Extended silence (~1-2s)
    Best Practices:
    • Use effects for emotional moments, not every sentence
    • Pauses are your most powerful tool for pacing
    • Voice effects work best in audiobooks and dramatic content
    • Keep podcasts and educational content mostly natural

    Error Handling

    Script Too Long

    If the generated script exceeds target by >20%:
    The script I generated is [X] words ([Y] minutes), which is longer than your target of [Z] minutes. Would you like me to:
    1. Condense it to fit the target length
    2. Split it into multiple parts
    3. Keep it as is

    Script Too Short

    If the generated script is under target by >20%:
    The script is [X] words ([Y] minutes), shorter than your target. Would you like me to:
    1. Expand it with more detail
    2. Add additional examples or stories
    3. Generate as is

    TTS Generation Fails

    If the TTS script fails:
    I've created the script, but I'm unable to generate the audio right now. Here's your script:
    
    [Display script]
    
    Error: [specific error message]
    
    You can:
    1. Check that ELEVENLABS_API_KEY is configured
    2. Use the script with your own text-to-speech tool
    3. Try again in a moment
    4. Ask me to troubleshoot the audio generation

    Common TTS Issues:

    • API key not set: Verify ELEVENLABS_API_KEY in config

    • Rate limit: Wait a moment and try again

    • Text too long: Break into smaller chunks (max ~5000 characters)


    Invalid Request


    For unrealistic requests (e.g., "100-hour audiobook"):
    That length would require [X] words and take significant time to generate. I recommend:
    - Breaking it into multiple episodes/chapters
    - Targeting 5-30 minutes per audio file
    - Creating a series instead of one long file

    Tips for Best Results

    For Engaging Audiobooks

    • Focus on character emotions and sensory details
    • Use pauses to build dramatic tension
    • Vary sentence length for rhythm
    • Include internal monologue and reflection

    For Compelling Podcasts

    • Start with a question or surprising fact
    • Use conversational phrases: "You know what's interesting..."
    • Include relatable examples from everyday life
    • End with actionable takeaways

    For Effective Educational Content

    • Use the "explain like I'm five" approach
    • Build from simple to complex concepts
    • Repeat key terms and definitions
    • Provide multiple examples for clarity

    Technical Notes

    TTS Implementation:

    • Uses Python script: ~/.clawdbot/clawdbot/skills/sag/scripts/tts.py

    • No binary installation required (pure Python + requests)

    • Directly calls ElevenLabs API

    • Compatible with Linux and macOS


    File Storage:
    • Audio files are saved to /tmp/audio-gen/

    • Filename format: audio-gen-[timestamp]-[topic-slug].mp3

    • Files are automatically cleaned up after 24 hours


    API Requirements:
    • Anthropic API for script generation (already configured)

    • ElevenLabs API for text-to-speech (configured via ELEVENLABS_API_KEY)

    • Both services must be configured and have available credits


    Supported Models:
    • eleven_multilingual_v2 - Best quality (default)

    • eleven_turbo_v2 - Faster generation

    • eleven_turbo_v2_5 - Fastest generation

    • eleven_multilingual_v1 - Legacy model


    Cost Estimate:
    • 10-minute audio (~750 words): approximately $1.43

    - Claude API: ~$0.075
    - ElevenLabs: ~$1.35
    • Longer content scales proportionally


    Generation Time:
    • Script generation: 5-30 seconds (depending on length)

    • Audio generation: 5-15 seconds (ElevenLabs processing)

    • Total: Usually under 1 minute for 10-minute audio


    Limitations

  • Maximum Length: 30 minutes (~2,250 words) per audio file

  • - For longer content, create multiple parts/episodes

  • Single Voice: Currently supports one narrator voice

  • - Cannot do multi-voice dialogue or character voices

  • No Background Music: Pure voice narration only

  • - No background music, sound effects, or audio mixing

  • Real-time Generation: Each request generates fresh content

  • - No pre-made templates or cached audio

  • Language: Primarily English

  • - ElevenLabs supports other languages, but content generation optimized for English

    Example Conversations

    Example 1: Quick Audiobook

    User: Create a 5-minute audiobook chapter about a lonely robot learning to feel emotions
    
    Agent: I'll create a 5-minute sci-fi audiobook chapter. Let me write the story...
    
    [Generates ~400-word narrative with emotional arc]
    
    Here's the story I've created about Unit-7, a maintenance robot who discovers the meaning of friendship through an unexpected encounter. Would you like me to generate the audio or make changes first?
    
    User: Generate the audio
    
    Agent: [Invokes SAG with formatted script]
    
    MEDIA:/tmp/audio-gen-1234567890-lonely-robot.mp3
    
    Your audiobook chapter is ready! The story follows Unit-7's emotional awakening through a touching encounter with a young engineer. Duration: approximately 5 minutes.

    Example 2: Podcast with Revisions

    User: Make a 10-minute podcast about sustainable living tips
    
    Agent: [Generates podcast script with intro, 5 practical tips, outro]
    
    Here's a 10-minute podcast with 5 actionable sustainable living tips. The tone is encouraging and practical. Would you like me to:
    1. Generate audio now
    2. Make it more urgent/climate-focused
    3. Add more specific examples
    4. Adjust the length
    
    User: Add more specific examples for each tip
    
    Agent: [Regenerates with detailed examples]
    
    Better? I've added specific examples like calculating your carbon footprint, composting at home, and choosing sustainable brands. Ready for audio?
    
    User: Perfect!
    
    Agent: MEDIA:/tmp/audio-gen-1234567890-sustainable-living.mp3
    
    Your sustainable living podcast is ready with 5 practical tips, each with concrete examples you can start using today! Duration: 10 minutes.

    Example 3: Educational Content

    User: Create a 15-minute educational audio explaining quantum computing for beginners
    
    Agent: I'll create a beginner-friendly 15-minute explanation of quantum computing...
    
    [Generates ~1,100 words with analogies, examples, and clear structure]
    
    Script preview:
    - Introduction: What is quantum computing? (2 min)
    - Classical vs Quantum: The key differences (4 min)
    - Qubits and Superposition: How it works (5 min)
    - Real Applications: What it means for you (3 min)
    - Conclusion: The future of computing (1 min)
    
    Should I generate the audio?
    
    User: Yes please
    
    Agent: MEDIA:/tmp/audio-gen-1234567890-quantum-computing.mp3
    
    Your educational podcast on quantum computing is ready! It uses everyday analogies to explain qubits, superposition, and real-world applications. Duration: approximately 15 minutes.

    When to Use This Skill

    Use the audio-gen skill when users request:

    • "Create an audiobook about..."

    • "Make a podcast episode about..."

    • "Generate audio content explaining..."

    • "Record a story about..."

    • "Produce an audio guide for..."

    • "Create a narrated version of..."

    • "Make an audio lesson about..."


    Do NOT use for:
    • Transcribing existing audio (use whisper skills instead)

    • Converting user-written text to audio (use SAG skill directly)

    • Real-time voice interaction (use voice-call plugin)

    • Multi-speaker dialogues or interviews (current limitation)


    Advanced Usage

    Voice Selection

    Users can request specific voices:
    User: Create a podcast with the Rachel voice
    
    Agent: [Checks voice configuration for "Rachel" voice ID and uses it in SAG command]

    Series/Episodes

    For multi-part content:
    User: Create a 3-episode series about space exploration
    
    Agent: I'll create Episode 1 first. Each episode will be 10 minutes. Let's start with "The Dawn of the Space Age"...
    
    [After completion]
    
    Episode 1 is ready! Would you like me to continue with Episode 2?

    Format Blending

    Mix formats for unique styles:
    User: Create an educational podcast that tells a story
    
    Agent: [Generates content that combines storytelling narrative with educational explanations]

    Troubleshooting

    Issue: Audio sounds robotic or unnatural
    Solution: Add more pauses and voice effects. Use contractions and conversational language.

    Issue: Script doesn't match requested length
    Solution: Regenerate with explicit word count target. Check calculations (75 words/min).

    Issue: Content is too technical or too simple
    Solution: Ask user for target audience. Adjust complexity accordingly.

    Issue: SAG command fails
    Solution: Check ELEVENLABS_API_KEY is set. Verify SAG skill is installed and working.

    Issue: User wants to edit the script manually
    Solution: Provide the plain text script. User can modify it and paste back for audio generation.


    💡 Pro Tip: Always generate the script first and get user approval before creating audio. This saves time and API costs, and ensures the user gets exactly what they want.