Speech & TranscriptionDocumentedScanned

audio-reply

Generate audio replies using TTS.

Share:

Installation

npx clawhub@latest install audio-reply-skill

View the full skill documentation and source below.

Documentation

Audio Reply Skill

Generate spoken audio responses using MLX Audio TTS (chatterbox-turbo model).

Trigger Phrases

  • "read it to me [URL]" - Fetch content from URL and read it aloud
  • "talk to me [topic/question]" - Generate a conversational response as audio
  • "speak", "say it", "voice reply" - Convert your response to audio

How to Use

Mode 1: Read URL Content

User: read it to me
  • Fetch the URL content using WebFetch
  • Extract readable text (strip HTML, focus on main content)
  • Generate audio using TTS
  • Play the audio and delete the file afterward
  • Mode 2: Conversational Audio Response

    User: talk to me about the weather today
  • Generate a natural, conversational response
  • Keep it concise (TTS works best with shorter segments)
  • Convert to audio, play it, then delete the file
  • Implementation

    TTS Command

    uv run mlx_audio.tts.generate \
      --model mlx-community/chatterbox-turbo-fp16 \
      --text "Your text here" \
      --play \
      --file_prefix /tmp/audio_reply

    Key Parameters

    • --model mlx-community/chatterbox-turbo-fp16 - Fast, natural voice
    • --play - Auto-play the generated audio
    • --file_prefix - Save to temp location for cleanup
    • --exaggeration 0.3 - Optional: add expressiveness (0.0-1.0)
    • --speed 1.0 - Adjust speech rate if needed

    Text Preparation Guidelines

    For "read it to me" mode:

  • Fetch URL with WebFetch tool

  • Extract main content, strip navigation/ads/boilerplate

  • Summarize if very long (>500 words) - keep key points

  • Add natural pauses with periods and commas
  • For "talk to me" mode:

  • Write conversationally, as if speaking

  • Use contractions (I'm, you're, it's)

  • Add filler words sparingly for naturalness ([chuckle], um, anyway)

  • Keep responses under 200 words for best quality

  • Avoid technical jargon unless explaining it
  • Audio Generation & Cleanup (IMPORTANT)

    Always delete the audio file after playing - it's already in the chat history.

    # Generate with unique filename and play
    OUTPUT_FILE="/tmp/audio_reply_$(date +%s)"
    uv run mlx_audio.tts.generate \
      --model mlx-community/chatterbox-turbo-fp16 \
      --text "Your response text" \
      --play \
      --file_prefix "$OUTPUT_FILE"
    
    # ALWAYS clean up after playing
    rm -f "${OUTPUT_FILE}"*.wav 2>/dev/null

    Error Handling

    If TTS fails:

  • Check if model is downloaded (first run downloads ~500MB)

  • Ensure uv is installed and in PATH

  • Fall back to text response with apology
  • Example Workflows

    Example 1: Read URL

    User: read it to me 
    
    Assistant actions:
    1. WebFetch the URL
    2. Extract article content
    3. Generate TTS:
       uv run mlx_audio.tts.generate \
         --model mlx-community/chatterbox-turbo-fp16 \
         --text "Here's what I found... [article summary]" \
         --play --file_prefix /tmp/audio_reply_1706123456
    4. Delete: rm -f /tmp/audio_reply_1706123456*.wav
    5. Confirm: "Done reading the article to you."

    Example 2: Talk to Me

    User: talk to me about what you can help with
    
    Assistant actions:
    1. Generate conversational response text
    2. Generate TTS:
       uv run mlx_audio.tts.generate \
         --model mlx-community/chatterbox-turbo-fp16 \
         --text "Hey! So I can help you with all kinds of things..." \
         --play --file_prefix /tmp/audio_reply_1706123789
    3. Delete: rm -f /tmp/audio_reply_1706123789*.wav
    4. (No text output needed - audio IS the response)

    Notes

    • First run may take longer as the model downloads (~500MB)
    • Audio quality is best for English; other languages may vary
    • For long content, consider chunking into multiple audio segments
    • The --play flag uses system audio - ensure volume is up