Azure Machine Learning SDK for Python: Setup, Usage & Best Practices
Azure Machine Learning SDK v2 for Python provides comprehensive ML lifecycle management through a unified client (MLClient) that orchestrates workspaces, training jobs, model registries, datasets, compute clusters, and pipelines. This skill enables Python-based ML workflows with infrastructure-as-code, versioned assets, and reproducible experiments without manual Azure Portal configuration.
What This Skill Does
This SDK abstracts Azure ML operations through a single client interface with property-based access to different resource types: ml_client.jobs for training runs, ml_client.models for model registry, ml_client.data for datasets, ml_client.compute for infrastructure, and ml_client.environments for containerized runtimes. Each operation supports CRUD patterns—create/update, get, list, delete—enabling programmatic control over the entire ML lifecycle from data preparation through model deployment.
The workflow centers on versioned assets and declarative job definitions. Register datasets with versions (my-dataset:1, my-dataset:2), train models referencing specific dataset versions, register trained models with lineage tracking, then deploy models from the registry. Jobs execute as command scripts (single-file training), pipelines (multi-step workflows), or sweeps (hyperparameter tuning). Compute auto-scales based on load, environments ensure reproducible dependencies, and all operations emit telemetry for debugging and cost tracking.
Unlike Azure ML SDK v1's fragmented APIs (Workspace, Experiment, Run classes), v2 unifies everything under MLClient with consistent patterns. This reduces boilerplate, simplifies authentication (one credential for all operations), and enables infrastructure-as-code workflows where workspace configuration lives in Python scripts or Jupyter notebooks version-controlled alongside training code.
Getting Started
Install the SDK:
pip install azure-ai-ml azure-identity
Configure environment variables:
export AZURE_SUBSCRIPTION_ID=<subscription-id>
export AZURE_RESOURCE_GROUP=<resource-group>
export AZURE_ML_WORKSPACE_NAME=<workspace-name>
Create an MLClient:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import os
ml_client = MLClient(
credential=DefaultAzureCredential(),
subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
workspace_name=os.environ["AZURE_ML_WORKSPACE_NAME"]
)
For local development, DefaultAzureCredential uses az login. In production, it supports managed identities and service principals.
Key Features
Unified Client Interface: Single client for all Azure ML operations. No separate classes for workspaces, experiments, runs—everything through MLClient properties.
Versioned Assets: Data, models, and environments use semantic versioning. Reference specific versions in jobs for reproducibility or use latest for development.
Declarative Jobs: Define training as Python objects (functions, not YAML), submit with ml_client.jobs.create_or_update(), and monitor via streaming logs.
Auto-Scaling Compute: Create clusters with min/max instance counts. Compute scales to zero when idle (configurable delay), reducing costs automatically.
Pipeline Orchestration: Build multi-step workflows with the @dsl.pipeline decorator. Steps declare inputs/outputs, SDK handles data flow and parallelization.
Environment Management: Package dependencies in Docker images or conda files. Reuse environments across jobs for consistency and faster startup times.
Model Registry: Central repository for trained models with lineage (which job produced it), metrics, and deployment history.
Usage Examples
Register Dataset:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
dataset = Data(
name="training-data",
version="1",
path="azureml://datastores/workspaceblobstore/paths/data/train.csv",
type=AssetTypes.URI_FILE
)
ml_client.data.create_or_update(dataset)
Submit Training Job:
from azure.ai.ml import command, Input
job = command(
code="./src",
command="python train.py --data ${{inputs.data}} --lr ${{inputs.lr}}",
inputs={
"data": Input(type="uri_folder", path="azureml:training-data:1"),
"lr": 0.01
},
environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
compute="cpu-cluster"
)
submitted_job = ml_client.jobs.create_or_update(job)
print(f"Job: {submitted_job.studio_url}")
Register Model:
from azure.ai.ml.entities import Model
model = Model(
name="my-classifier",
version="1",
path="./outputs/model/",
type=AssetTypes.CUSTOM_MODEL
)
ml_client.models.create_or_update(model)
Create Compute Cluster:
from azure.ai.ml.entities import AmlCompute
cluster = AmlCompute(
name="gpu-cluster",
size="Standard_NC6",
min_instances=0,
max_instances=4,
idle_time_before_scale_down=300
)
ml_client.compute.begin_create_or_update(cluster).result()
Best Practices
Version Everything: Use semantic versioning for data (v1.0.0), models, and environments. Enables rollback and reproducibility.
Tag Resources: Add tags ({"project": "fraud-detection", "env": "prod"}) to all resources for cost tracking and organization.
Set Idle Scale-Down: Configure idle_time_before_scale_down to balance responsiveness (short delay) versus cost (long delay). Typical: 120-300 seconds.
Stream Job Logs: Use ml_client.jobs.stream(job_name) to monitor training in real-time. Catches errors early without waiting for completion.
Use Environments: Don't inline pip install in commands. Create environments with conda_file or docker_file for reproducibility.
Store Credentials Securely: Use Azure Key Vault or environment variables. Never hardcode subscription IDs or keys in code.
Clean Up Experiments: Delete failed jobs and unused compute to avoid clutter and reduce costs.
When to Use / When NOT to Use
Use this skill when:
- You're building Python-based ML pipelines
- You need versioned datasets and models
- You want infrastructure-as-code for ML workflows
- You're orchestrating multi-step training pipelines
- You need auto-scaling compute for training
- You're implementing MLOps practices in Azure
- You want unified API for all Azure ML operations
Avoid this skill when:
- You're on AWS or GCP (use SageMaker/Vertex AI SDKs)
- You need real-time inference only (use Azure ML endpoints SDK)
- You're training on local hardware without cloud (use vanilla Python ML libraries)
- You prefer Azure ML Studio UI over programmatic control
- Your models fit in notebooks without productionization needs
Related Skills
- azure-ai-projects-py: Azure AI project orchestration
- agents-v2-py: Container-based AI agents
- azure-ai-openai-dotnet: Azure OpenAI integration
Source
Maintained by Microsoft. View on GitHub