DevOps & CloudDocumentedScanned

podcast-generation

Generate AI-powered podcast-style audio narratives using Azure OpenAI's GPT Realtime Mini model via WebSocket.

Share:

Installation

npx clawhub@latest install podcast-generation

View the full skill documentation and source below.

Documentation

Podcast Generation with GPT Realtime Mini

Generate real audio narratives from text content using Azure OpenAI's Realtime API.

Quick Start

  • Configure environment variables for Realtime API

  • Connect via WebSocket to Azure OpenAI Realtime endpoint

  • Send text prompt, collect PCM audio chunks + transcript

  • Convert PCM to WAV format

  • Return base64-encoded audio to frontend for playback
  • Environment Configuration

    AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
    AZURE_OPENAI_AUDIO_ENDPOINT=
    AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini

    Note: Endpoint should NOT include /openai/v1/ - just the base URL.

    Core Workflow

    Backend Audio Generation

    from openai import AsyncOpenAI
    import base64
    
    # Convert HTTPS endpoint to WebSocket URL
    ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
    
    client = AsyncOpenAI(
        websocket_base_url=ws_url,
        api_key=api_key
    )
    
    audio_chunks = []
    transcript_parts = []
    
    async with client.realtime.connect(model="gpt-realtime-mini") as conn:
        # Configure for audio-only output
        await conn.session.update(session={
            "output_modalities": ["audio"],
            "instructions": "You are a narrator. Speak naturally."
        })
        
        # Send text to narrate
        await conn.conversation.item.create(item={
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": prompt}]
        })
        
        await conn.response.create()
        
        # Collect streaming events
        async for event in conn:
            if event.type == "response.output_audio.delta":
                audio_chunks.append(base64.b64decode(event.delta))
            elif event.type == "response.output_audio_transcript.delta":
                transcript_parts.append(event.delta)
            elif event.type == "response.done":
                break
    
    # Convert PCM to WAV (see scripts/pcm_to_wav.py)
    pcm_audio = b''.join(audio_chunks)
    wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)

    Frontend Audio Playback

    // Convert base64 WAV to playable blob
    const base64ToBlob = (base64, mimeType) => {
      const bytes = atob(base64);
      const arr = new Uint8Array(bytes.length);
      for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
      return new Blob([arr], { type: mimeType });
    };
    
    const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
    const audioUrl = URL.createObjectURL(audioBlob);
    new Audio(audioUrl).play();

    Voice Options

    VoiceCharacter
    alloyNeutral
    echoWarm
    fableExpressive
    onyxDeep
    novaFriendly
    shimmerClear

    Realtime API Events

    • response.output_audio.delta - Base64 audio chunk
    • response.output_audio_transcript.delta - Transcript text
    • response.done - Generation complete
    • error - Handle with event.error.message

    Audio Format

    • Input: Text prompt
    • Output: PCM audio (24kHz, 16-bit, mono)
    • Storage: Base64-encoded WAV

    References

    • Full architecture: See references/architecture.md for complete stack design
    • Code examples: See references/code-examples.md for production patterns
    • PCM conversion: Use scripts/pcm_to_wav.py for audio format conversion