DevOps & CloudDocumentedScanned

senior-ml-engineer

ML engineering skill for productionizing models, building MLOps pipelines.

Installation

npx clawhub@latest install senior-ml-engineer

View the full skill documentation and source below.

Documentation

Senior ML Engineer

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.

Model Deployment Workflow
MLOps Pipeline Setup
LLM Integration Workflow
RAG System Implementation
Model Monitoring
Reference Documentation
Tools

Model Deployment Workflow

Deploy a trained model to production with monitoring:

Export model to standardized format (ONNX, TorchScript, SavedModel)

Package model with dependencies in Docker container

Deploy to staging environment

Run integration tests against staging

Deploy canary (5% traffic) to production

Monitor latency and error rates for 1 hour

Promote to full production if metrics pass

Validation: p95 latency < 100ms, error rate < 0.1%

Container Template

FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f  || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Serving Options

Option

Latency

Throughput

Use Case

FastAPI + Uvicorn	Low	Medium	REST APIs, small models
Triton Inference Server	Very Low	Very High	GPU inference, batching
TensorFlow Serving	Low	High	TensorFlow models
TorchServe	Low	High	PyTorch models
Ray Serve	Medium	High	Complex pipelines, multi-model

MLOps Pipeline Setup

Establish automated training and deployment:

Configure feature store (Feast, Tecton) for training data

Set up experiment tracking (MLflow, Weights & Biases)

Create training pipeline with hyperparameter logging

Configure staging deployment triggered by registry events

Set up A/B testing infrastructure for model comparison

Enable drift monitoring with alerting

Validation: New models automatically evaluated against baseline

Feature Store Pattern

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

Retraining Triggers

Trigger

Detection

Action

Scheduled	Cron (weekly/monthly)	Full retrain
Performance drop	Accuracy < threshold	Immediate retrain
Data drift	PSI > 0.2	Evaluate, then retrain
New data volume	X new samples	Incremental update

LLM Integration Workflow

Integrate LLM APIs into production applications:

Create provider abstraction layer for vendor flexibility

Implement retry logic with exponential backoff

Configure fallback to secondary provider

Set up token counting and context truncation

Add response caching for repeated queries

Implement cost tracking per request

Add structured output validation with Pydantic

Validation: Response parses correctly, cost within budget

Provider Abstraction

from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)

Cost Management

Provider

Input Cost

Output Cost

GPT-4	$0.03/1K	$0.06/1K
GPT-3.5	$0.0005/1K	$0.0015/1K
Claude 3 Opus	$0.015/1K	$0.075/1K
Claude 3 Haiku	$0.00025/1K	$0.00125/1K

RAG System Implementation

Build retrieval-augmented generation pipeline:

Choose vector database (Pinecone, Qdrant, Weaviate)

Select embedding model based on quality/cost tradeoff

Implement document chunking strategy

Create ingestion pipeline with metadata extraction

Build retrieval with query embedding

Add reranking for relevance improvement

Format context and send to LLM

Validation: Response references retrieved context, no hallucinations

Vector Database Selection

Database

Hosting

Scale

Latency

Best For

Pinecone	Managed	High	Low	Production, managed
Qdrant	Both	High	Very Low	Performance-critical
Weaviate	Both	High	Low	Hybrid search
Chroma	Self-hosted	Medium	Low	Prototyping
pgvector	Self-hosted	Medium	Medium	Existing Postgres

Chunking Strategies

Strategy

Chunk Size

Overlap

Best For

Fixed	500-1000 tokens	50-100	General text
Sentence	3-5 sentences	1 sentence	Structured text
Semantic	Variable	Based on meaning	Research papers
Recursive	Hierarchical	Parent-child	Long documents

Model Monitoring

Monitor production models for drift and degradation:

Set up latency tracking (p50, p95, p99)

Configure error rate alerting

Implement input data drift detection

Track prediction distribution shifts

Log ground truth when available

Compare model versions with A/B metrics

Set up automated retraining triggers

Validation: Alerts fire before user-visible degradation

Drift Detection

from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }

Alert Thresholds

Metric

Warning

Critical

p95 latency	> 100ms	> 200ms
Error rate	> 0.1%	> 1%
PSI (drift)	> 0.1	> 0.2
Accuracy drop	> 2%	> 5%

Reference Documentation

MLOps Production Patterns

references/mlops_production_patterns.md contains:

Model deployment pipeline with Kubernetes manifests
Feature store architecture with Feast examples
Model monitoring with drift detection code
A/B testing infrastructure with traffic splitting
Automated retraining pipeline with MLflow

LLM Integration Guide

references/llm_integration_guide.md contains:

Provider abstraction layer pattern
Retry and fallback strategies with tenacity
Prompt engineering templates (few-shot, CoT)
Token optimization with tiktoken
Cost calculation and tracking

RAG System Architecture

references/rag_system_architecture.md contains:

RAG pipeline implementation with code
Vector database comparison and integration
Chunking strategies (fixed, semantic, recursive)
Embedding model selection guide
Hybrid search and reranking patterns

Tools

Model Deployment Pipeline

python scripts/model_deployment_pipeline.py --model model.pkl --target staging

Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.

RAG System Builder

python scripts/rag_system_builder.py --config rag_config.yaml --analyze

Scaffolds RAG pipeline with vector store integration and retrieval logic.

ML Monitoring Suite

python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

Sets up drift detection, alerting, and performance dashboards.

ML Frameworks	PyTorch, TensorFlow, Scikit-learn, XGBoost
LLM Frameworks	LangChain, LlamaIndex, DSPy
MLOps	MLflow, Weights & Biases, Kubeflow
Data	Spark, Airflow, dbt, Kafka
Deployment	Docker, Kubernetes, Triton
Databases	PostgreSQL, BigQuery, Pinecone, Redis