DevOps & CloudDocumentedScanned

senior-ml-engineer

ML engineering skill for productionizing models, building MLOps pipelines.

Share:

Installation

npx clawhub@latest install senior-ml-engineer

View the full skill documentation and source below.

Documentation

Senior ML Engineer

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.


Table of Contents


Model Deployment Workflow

Deploy a trained model to production with monitoring:

  • Export model to standardized format (ONNX, TorchScript, SavedModel)

  • Package model with dependencies in Docker container

  • Deploy to staging environment

  • Run integration tests against staging

  • Deploy canary (5% traffic) to production

  • Monitor latency and error rates for 1 hour

  • Promote to full production if metrics pass

  • Validation: p95 latency < 100ms, error rate < 0.1%
  • Container Template

    FROM python:3.11-slim
    
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    COPY model/ /app/model/
    COPY src/ /app/src/
    
    HEALTHCHECK CMD curl -f  || exit 1
    
    EXPOSE 8080
    CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

    Serving Options

    OptionLatencyThroughputUse Case
    FastAPI + UvicornLowMediumREST APIs, small models
    Triton Inference ServerVery LowVery HighGPU inference, batching
    TensorFlow ServingLowHighTensorFlow models
    TorchServeLowHighPyTorch models
    Ray ServeMediumHighComplex pipelines, multi-model

    MLOps Pipeline Setup

    Establish automated training and deployment:

  • Configure feature store (Feast, Tecton) for training data

  • Set up experiment tracking (MLflow, Weights & Biases)

  • Create training pipeline with hyperparameter logging

  • Register model in model registry with version metadata

  • Configure staging deployment triggered by registry events

  • Set up A/B testing infrastructure for model comparison

  • Enable drift monitoring with alerting

  • Validation: New models automatically evaluated against baseline
  • Feature Store Pattern

    from feast import Entity, Feature, FeatureView, FileSource
    
    user = Entity(name="user_id", value_type=ValueType.INT64)
    
    user_features = FeatureView(
        name="user_features",
        entities=["user_id"],
        ttl=timedelta(days=1),
        features=[
            Feature(name="purchase_count_30d", dtype=ValueType.INT64),
            Feature(name="avg_order_value", dtype=ValueType.FLOAT),
        ],
        online=True,
        source=FileSource(path="data/user_features.parquet"),
    )

    Retraining Triggers

    TriggerDetectionAction
    ScheduledCron (weekly/monthly)Full retrain
    Performance dropAccuracy < thresholdImmediate retrain
    Data driftPSI > 0.2Evaluate, then retrain
    New data volumeX new samplesIncremental update

    LLM Integration Workflow

    Integrate LLM APIs into production applications:

  • Create provider abstraction layer for vendor flexibility

  • Implement retry logic with exponential backoff

  • Configure fallback to secondary provider

  • Set up token counting and context truncation

  • Add response caching for repeated queries

  • Implement cost tracking per request

  • Add structured output validation with Pydantic

  • Validation: Response parses correctly, cost within budget
  • Provider Abstraction

    from abc import ABC, abstractmethod
    from tenacity import retry, stop_after_attempt, wait_exponential
    
    class LLMProvider(ABC):
        @abstractmethod
        def complete(self, prompt: str, **kwargs) -> str:
            pass
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
    def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
        return provider.complete(prompt)

    Cost Management

    ProviderInput CostOutput Cost
    GPT-4$0.03/1K$0.06/1K
    GPT-3.5$0.0005/1K$0.0015/1K
    Claude 3 Opus$0.015/1K$0.075/1K
    Claude 3 Haiku$0.00025/1K$0.00125/1K

    RAG System Implementation

    Build retrieval-augmented generation pipeline:

  • Choose vector database (Pinecone, Qdrant, Weaviate)

  • Select embedding model based on quality/cost tradeoff

  • Implement document chunking strategy

  • Create ingestion pipeline with metadata extraction

  • Build retrieval with query embedding

  • Add reranking for relevance improvement

  • Format context and send to LLM

  • Validation: Response references retrieved context, no hallucinations
  • Vector Database Selection

    DatabaseHostingScaleLatencyBest For
    PineconeManagedHighLowProduction, managed
    QdrantBothHighVery LowPerformance-critical
    WeaviateBothHighLowHybrid search
    ChromaSelf-hostedMediumLowPrototyping
    pgvectorSelf-hostedMediumMediumExisting Postgres

    Chunking Strategies

    StrategyChunk SizeOverlapBest For
    Fixed500-1000 tokens50-100General text
    Sentence3-5 sentences1 sentenceStructured text
    SemanticVariableBased on meaningResearch papers
    RecursiveHierarchicalParent-childLong documents

    Model Monitoring

    Monitor production models for drift and degradation:

  • Set up latency tracking (p50, p95, p99)

  • Configure error rate alerting

  • Implement input data drift detection

  • Track prediction distribution shifts

  • Log ground truth when available

  • Compare model versions with A/B metrics

  • Set up automated retraining triggers

  • Validation: Alerts fire before user-visible degradation
  • Drift Detection

    from scipy.stats import ks_2samp
    
    def detect_drift(reference, current, threshold=0.05):
        statistic, p_value = ks_2samp(reference, current)
        return {
            "drift_detected": p_value < threshold,
            "ks_statistic": statistic,
            "p_value": p_value
        }

    Alert Thresholds

    MetricWarningCritical
    p95 latency> 100ms> 200ms
    Error rate> 0.1%> 1%
    PSI (drift)> 0.1> 0.2
    Accuracy drop> 2%> 5%

    Reference Documentation

    MLOps Production Patterns

    references/mlops_production_patterns.md contains:

    • Model deployment pipeline with Kubernetes manifests
    • Feature store architecture with Feast examples
    • Model monitoring with drift detection code
    • A/B testing infrastructure with traffic splitting
    • Automated retraining pipeline with MLflow

    LLM Integration Guide

    references/llm_integration_guide.md contains:

    • Provider abstraction layer pattern
    • Retry and fallback strategies with tenacity
    • Prompt engineering templates (few-shot, CoT)
    • Token optimization with tiktoken
    • Cost calculation and tracking

    RAG System Architecture

    references/rag_system_architecture.md contains:

    • RAG pipeline implementation with code
    • Vector database comparison and integration
    • Chunking strategies (fixed, semantic, recursive)
    • Embedding model selection guide
    • Hybrid search and reranking patterns

    Tools

    Model Deployment Pipeline

    python scripts/model_deployment_pipeline.py --model model.pkl --target staging

    Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.

    RAG System Builder

    python scripts/rag_system_builder.py --config rag_config.yaml --analyze

    Scaffolds RAG pipeline with vector store integration and retrieval logic.

    ML Monitoring Suite

    python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

    Sets up drift detection, alerting, and performance dashboards.


    Tech Stack

    CategoryTools
    ML FrameworksPyTorch, TensorFlow, Scikit-learn, XGBoost
    LLM FrameworksLangChain, LlamaIndex, DSPy
    MLOpsMLflow, Weights & Biases, Kubeflow
    DataSpark, Airflow, dbt, Kafka
    DeploymentDocker, Kubernetes, Triton
    DatabasesPostgreSQL, BigQuery, Pinecone, Redis