Elasticsearch Expert
Elasticsearch transforms from a black box into a predictable, high-performance system once you understand shard mechanics, filter vs query context, and the difference between keyword and text field types. Most ES performance problems trace to one of three mistakes: too many shards, fielddata enabled on text fields, or missing filter context. This skill covers the mental models and concrete patterns that make ES fast and maintainable.
Core Mental Model
Elasticsearch is a distributed inverted index. Every document is analyzed (tokenized, filtered) at index time and stored in one of N shards across the cluster. Queries run in parallel across all relevant shards and results are merged. The critical insight: filter context (yes/no relevance) is cached and fast; query context (scored relevance) is not cached and slower — use filter context for everything that doesn't need scoring. Shard count is fixed at index creation; choose wrong and you're stuck with either too-small shards or the overhead of reindexing.
Index Design
Shard Sizing
Target shard size: 20–50 GB (absolute maximum 50 GB)
Shards per node: ≤ 20 × (heap_GB) — e.g. 30 GB heap → ≤ 600 shards per node
Number of primary shards:
total_data_size / target_shard_size
Round up to nearest power of 2 if using routing
Examples:
100 GB total, 25 GB target → 4 primary shards
1 TB total, 25 GB target → 40 primary shards
Time-series (ILM hot tier) → 1-2 shards per daily index (adjust via rollover)
// Index template with shard count and settings
PUT _index_template/events_template
{
"index_patterns": ["events-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.mapping.total_fields.limit": 2000,
"index.codec": "best_compression",
"index.routing.allocation.require.box_type": "hot"
},
"mappings": {
"dynamic": "strict",
"_source": { "enabled": true }
}
},
"priority": 100,
"composed_of": ["events_mappings", "events_settings"]
}
Mapping
Field Type Selection
keyword → exact match, aggregations, sorting, filtering. No analysis.
Use for: IDs, status, email, tags, enum values
text → full-text search. Analyzed (tokenized). Cannot aggregate.
Use for: body content, descriptions, user-generated text
keyword + text (multi-field) → for fields needing both
date → ISO 8601 or epoch millis. Always explicit — no dynamic mapping
integer/long/float/double → numeric. Use long for IDs
boolean → true/false
nested → array of objects preserving object relationships
object → sub-object (array of objects lose relationship context)
dense_vector → kNN similarity search
// Explicit mapping (always prefer over dynamic)
PUT /articles
{
"mappings": {
"dynamic": "strict",
"properties": {
"id": { "type": "keyword" },
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"keyword": { "type": "keyword", "ignore_above": 256 }
}
},
"body": {
"type": "text",
"analyzer": "english",
"index_options": "offsets"
},
"tags": { "type": "keyword" },
"author_id": { "type": "keyword" },
"status": { "type": "keyword" },
"published_at": { "type": "date" },
"view_count": { "type": "integer" },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine"
},
"comments": {
"type": "nested",
"properties": {
"author": { "type": "keyword" },
"text": { "type": "text" },
"created_at": { "type": "date" }
}
}
}
}
}
Query DSL
Bool Query Structure
// Bool query: the foundation of all complex ES queries
GET /articles/_search
{
"query": {
"bool": {
// must: must match, contributes to score (query context)
"must": [
{
"multi_match": {
"query": "elasticsearch performance tuning",
"fields": ["title^3", "body"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
],
// filter: must match, NO score contribution (cached)
"filter": [
{ "term": { "status": "published" } },
{ "range": { "published_at": { "gte": "2024-01-01" } } },
{ "terms": { "tags": ["elasticsearch", "performance"] } }
],
// should: nice-to-have, boosts score if matches
"should": [
{ "term": { "author_id": "top_author_123" } }
],
"minimum_should_match": 0,
// must_not: exclude, no score contribution
"must_not": [
{ "term": { "status": "draft" } }
]
}
}
}
// term vs match:
// term: exact, no analysis → "term": {"status": "published"}
// match: analyzed, tokenized → "match": {"body": "elasticsearch performance"}
// Use term for keywords/IDs, match for text fields
Nested Queries
// Query on nested objects (preserves object relationships)
GET /articles/_search
{
"query": {
"nested": {
"path": "comments",
"query": {
"bool": {
"must": [
{ "match": { "comments.text": "excellent" } },
{ "term": { "comments.author": "alice" } }
]
}
},
"inner_hits": {
"size": 3,
"highlight": {
"fields": { "comments.text": {} }
}
}
}
}
}
// function_score: custom scoring
GET /products/_search
{
"query": {
"function_score": {
"query": { "match": { "name": "laptop" } },
"functions": [
{
"filter": { "term": { "is_sponsored": true } },
"weight": 2
},
{
"field_value_factor": {
"field": "popularity_score",
"factor": 1.2,
"modifier": "log1p",
"missing": 1
}
}
],
"score_mode": "multiply",
"boost_mode": "multiply"
}
}
}
Aggregations
// Composite aggregation: paginate through large cardinality bucketing
GET /orders/_search
{
"size": 0,
"aggs": {
"sales_by_region_date": {
"composite": {
"size": 1000,
"after": { "region": "US", "month": 1704067200000 },
"sources": [
{ "region": { "terms": { "field": "region" } } },
{ "month": { "date_histogram": { "field": "created_at", "calendar_interval": "month" } } }
]
},
"aggs": {
"total_revenue": { "sum": { "field": "amount" } },
"order_count": { "value_count": { "field": "id" } },
"avg_order": { "avg": { "field": "amount" } },
"p95_amount": {
"percentiles": {
"field": "amount",
"percents": [50, 90, 95, 99]
}
}
}
}
}
}
// Date histogram with pipeline aggregations
GET /events/_search
{
"size": 0,
"query": {
"range": { "timestamp": { "gte": "now-30d/d" } }
},
"aggs": {
"per_day": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day",
"time_zone": "America/Chicago"
},
"aggs": {
"unique_users": { "cardinality": { "field": "user_id" } },
"revenue": { "sum": { "field": "amount" } },
"7day_avg": {
"moving_avg": {
"buckets_path": "revenue",
"window": 7,
"model": "simple"
}
}
}
}
}
}
// Significant terms: statistically unusual terms in subset vs background
GET /reviews/_search
{
"query": { "term": { "product_category": "electronics" } },
"aggs": {
"significant_keywords": {
"significant_terms": {
"field": "review_text.keyword",
"size": 20,
"min_doc_count": 5
}
}
}
}
Index Lifecycle Management (ILM)
// ILM policy for time-series data (hot → warm → cold → delete)
PUT _ilm/policy/events_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "25gb",
"max_age": "1d"
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 },
"readonly": {}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": { "box_type": "cold" }
},
"set_priority": { "priority": 0 }
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
// Attach ILM policy to index template
PUT _index_template/events_template
{
"index_patterns": ["events-*"],
"template": {
"settings": {
"index.lifecycle.name": "events_policy",
"index.lifecycle.rollover_alias": "events"
}
}
}
// Bootstrap the initial index
PUT events-000001
{
"aliases": {
"events": { "is_write_index": true }
}
}
Bulk Indexing
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(["http://localhost:9200"])
def bulk_index_documents(documents: list[dict], index: str):
def generate_actions():
for doc in documents:
yield {
"_index": index,
"_id": doc.get("id"), # omit for auto-generated
"_source": doc
}
# helpers.parallel_bulk for high throughput
success, errors = helpers.bulk(
es,
generate_actions(),
chunk_size=500, # docs per bulk request
max_chunk_bytes=10 * 1024 * 1024, # 10MB per request
raise_on_error=False, # collect errors instead of raising
max_retries=3,
initial_backoff=2,
request_timeout=60
)
if errors:
# Handle errors: retry, DLQ, alert
for error in errors:
log_bulk_error(error)
return success, len(errors)
# Performance settings for bulk indexing
es.indices.put_settings(
index=index,
body={
"index.refresh_interval": "-1", # disable refresh during bulk
"index.number_of_replicas": 0 # disable replication during bulk
}
)
bulk_index_documents(large_dataset, index="products")
# Restore after bulk
es.indices.put_settings(
index=index,
body={
"index.refresh_interval": "30s",
"index.number_of_replicas": 1
}
)
es.indices.forcemerge(index=index, max_num_segments=1)
Search-as-You-Type
// search_as_you_type field type (built-in for completion)
PUT /products
{
"mappings": {
"properties": {
"name": {
"type": "search_as_you_type",
"analyzer": "standard"
}
}
}
}
GET /products/_search
{
"query": {
"multi_match": {
"query": "elast sear",
"type": "bool_prefix",
"fields": ["name", "name._2gram", "name._3gram"]
}
}
}
// edge_ngram for custom completion
PUT /suggest_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete_filter"]
},
"autocomplete_search": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
}
}
}
}
Index Aliases for Zero-Downtime Reindex
// Current live alias points to old index
// Step 1: Create new index with updated mapping
PUT /articles_v2
{
"mappings": { /* updated mapping */ },
"settings": { "number_of_shards": 4 }
}
// Step 2: Reindex data
POST /_reindex?wait_for_completion=false
{
"source": { "index": "articles_v1" },
"dest": { "index": "articles_v2" }
}
// Monitor reindex progress
GET /_tasks?detailed&actions=*reindex
// Step 3: Atomic alias swap (zero downtime)
POST /_aliases
{
"actions": [
{ "remove": { "index": "articles_v1", "alias": "articles" } },
{ "add": { "index": "articles_v2", "alias": "articles" } }
]
}
// Applications always query the alias, never the versioned index
GET /articles/_search ← always works regardless of version
Anti-Patterns
// ❌ Dynamic mapping in production (unexpected field types, mapping explosion)
PUT /events { "mappings": { "dynamic": true } }
// ✅ Always use dynamic: "strict" and define all fields explicitly
// ❌ Enabling fielddata on text fields (out-of-memory risk)
PUT /articles/_mapping
{ "properties": { "title": { "type": "text", "fielddata": true } } }
// ✅ Use .keyword subfield for aggregations on text fields
// ❌ wildcard queries on large indexes (full scan, very slow)
{ "query": { "wildcard": { "name": "*phone*" } } }
// ✅ Use match with n-gram analyzer or search_as_you_type
// ❌ Not using filter context for non-scoring conditions
{ "query": { "match": { "status": "published" } } } // scored, not cached
// ✅
{ "query": { "bool": { "filter": [{ "term": { "status": "published" } }] } } }
// ❌ Deep pagination (from + size > 10000 is expensive)
GET /articles/_search { "from": 50000, "size": 100 }
// ✅ Use search_after for deep pagination
GET /articles/_search {
"size": 100,
"sort": [{ "created_at": "desc" }, { "_id": "asc" }],
"search_after": ["2024-01-15T10:00:00", "doc_id_xyz"]
}
Quick Reference
Field Type:
ID, status, email, enum → keyword
Full text / body → text (+ keyword subfield)
Date → date
Numeric → integer/long/float
Object array (related) → nested
Embedded object → object (default)
Vectors → dense_vector
Query Context (scored): must, should
Filter Context (cached): filter, must_not
Aggregation Types:
Terms / cardinality → keywords stats
Date histogram → time-series
Composite → paginated multi-dimension
Percentiles / stats → distributions
Performance Rules:
1. Filter context > query context for booleans
2. Small shard size (20–50 GB)
3. Disable refresh + replicas during bulk load
4. Use search_after, not from+size for deep pagination
5. Never KEYS / fielddata on text in production
ILM Phases: hot (active writes) → warm (read-only, shrink) → cold (archive) → deleteSkill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDenbigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDen