AI & LLMsDocumentedScanned

gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI.

Share:

Installation

npx clawhub@latest install gemini-stt

View the full skill documentation and source below.

Documentation

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

  • Python 3.10+ (no external dependencies)
  • Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • .ogg / .opus (Telegram voice messages)
  • .mp3
  • .wav
  • .m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

OptionDescription
Path to the audio file (required)
--model, -mGemini model to use (default: gemini-2.0-flash-lite)
--vertex, -vForce use of Vertex AI with ADC
--project, -pGCP project ID (for Vertex, defaults to gcloud config)
--region, -rGCP region (for Vertex, default: us-central1)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

ModelNotes
gemini-2.0-flash-liteDefault. Fastest transcription speed.
gemini-2.0-flashFast and cost-effective.
gemini-2.5-flash-liteLightweight 2.5 model.
gemini-2.5-flashBalanced speed and quality.
gemini-2.5-proHigher quality, slower.
gemini-3-flash-previewLatest flash model.
gemini-3-pro-previewLatest pro model, best quality.
See [Gemini API Models]() for the latest list.

How It Works

  • Reads the audio file and base64 encodes it

  • Auto-detects authentication:

  • - If ADC is available (gcloud), uses Vertex AI endpoint
    - Otherwise, uses GEMINI_API_KEY with direct Gemini API
  • Sends to the selected Gemini model with transcription prompt

  • Returns the transcribed text
  • Example Integration

    For Clawdbot voice message handling:

    # Transcribe incoming voice message
    TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
    echo "User said: $TRANSCRIPT"

    Error Handling

    The script exits with code 1 and prints to stderr on:

    • No authentication available (neither ADC nor GEMINI_API_KEY)

    • File not found

    • API errors

    • Missing GCP project (when using Vertex)


    Notes

    • Uses Gemini 2.0 Flash Lite by default for fastest transcription
    • No external Python dependencies (uses stdlib only)
    • Automatically detects MIME type from file extension
    • Prefers Vertex AI with ADC when available (no API key management needed)