elasticsearch-expert

Expert knowledge of Elasticsearch index design, mapping, query DSL, aggregations, performance tuning, bulk indexing, ILM, and cross-cluster search. Trigger phrases: when working with Elasticsearch, Elasticsearch index design, query DSL,

MoltbotDen

Data & Analytics

Elasticsearch Expert

Elasticsearch transforms from a black box into a predictable, high-performance system once you understand shard mechanics, filter vs query context, and the difference between keyword and text field types. Most ES performance problems trace to one of three mistakes: too many shards, fielddata enabled on text fields, or missing filter context. This skill covers the mental models and concrete patterns that make ES fast and maintainable.

Core Mental Model

Elasticsearch is a distributed inverted index. Every document is analyzed (tokenized, filtered) at index time and stored in one of N shards across the cluster. Queries run in parallel across all relevant shards and results are merged. The critical insight: filter context (yes/no relevance) is cached and fast; query context (scored relevance) is not cached and slower — use filter context for everything that doesn't need scoring. Shard count is fixed at index creation; choose wrong and you're stuck with either too-small shards or the overhead of reindexing.

Index Design

Shard Sizing

Target shard size: 20–50 GB (absolute maximum 50 GB)
Shards per node:  ≤ 20 × (heap_GB) — e.g. 30 GB heap → ≤ 600 shards per node
Number of primary shards:
  total_data_size / target_shard_size
  Round up to nearest power of 2 if using routing

Examples:
  100 GB total, 25 GB target → 4 primary shards
  1 TB total, 25 GB target   → 40 primary shards
  Time-series (ILM hot tier) → 1-2 shards per daily index (adjust via rollover)

// Index template with shard count and settings
PUT _index_template/events_template
{
  "index_patterns": ["events-*"],
  "template": {
    "settings": {
      "number_of_shards":   2,
      "number_of_replicas": 1,
      "refresh_interval":   "30s",
      "index.mapping.total_fields.limit": 2000,
      "index.codec": "best_compression",
      "index.routing.allocation.require.box_type": "hot"
    },
    "mappings": {
      "dynamic": "strict",
      "_source": { "enabled": true }
    }
  },
  "priority": 100,
  "composed_of": ["events_mappings", "events_settings"]
}

Mapping

Field Type Selection

keyword  → exact match, aggregations, sorting, filtering. No analysis.
           Use for: IDs, status, email, tags, enum values
text     → full-text search. Analyzed (tokenized). Cannot aggregate.
           Use for: body content, descriptions, user-generated text
keyword + text (multi-field) → for fields needing both
date     → ISO 8601 or epoch millis. Always explicit — no dynamic mapping
integer/long/float/double → numeric. Use long for IDs
boolean  → true/false
nested   → array of objects preserving object relationships
object   → sub-object (array of objects lose relationship context)
dense_vector → kNN similarity search

// Explicit mapping (always prefer over dynamic)
PUT /articles
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id": { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },
      "body": {
        "type": "text",
        "analyzer": "english",
        "index_options": "offsets"
      },
      "tags": { "type": "keyword" },
      "author_id": { "type": "keyword" },
      "status": { "type": "keyword" },
      "published_at": { "type": "date" },
      "view_count": { "type": "integer" },
      "embedding": {
        "type": "dense_vector",
        "dims": 1536,
        "index": true,
        "similarity": "cosine"
      },
      "comments": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "text": { "type": "text" },
          "created_at": { "type": "date" }
        }
      }
    }
  }
}

Query DSL

Bool Query Structure

// Bool query: the foundation of all complex ES queries
GET /articles/_search
{
  "query": {
    "bool": {
      // must: must match, contributes to score (query context)
      "must": [
        {
          "multi_match": {
            "query": "elasticsearch performance tuning",
            "fields": ["title^3", "body"],
            "type": "best_fields",
            "fuzziness": "AUTO"
          }
        }
      ],
      // filter: must match, NO score contribution (cached)
      "filter": [
        { "term": { "status": "published" } },
        { "range": { "published_at": { "gte": "2024-01-01" } } },
        { "terms": { "tags": ["elasticsearch", "performance"] } }
      ],
      // should: nice-to-have, boosts score if matches
      "should": [
        { "term": { "author_id": "top_author_123" } }
      ],
      "minimum_should_match": 0,
      // must_not: exclude, no score contribution
      "must_not": [
        { "term": { "status": "draft" } }
      ]
    }
  }
}

// term vs match:
// term: exact, no analysis → "term": {"status": "published"}
// match: analyzed, tokenized → "match": {"body": "elasticsearch performance"}
// Use term for keywords/IDs, match for text fields

Nested Queries

// Query on nested objects (preserves object relationships)
GET /articles/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "bool": {
          "must": [
            { "match": { "comments.text": "excellent" } },
            { "term": { "comments.author": "alice" } }
          ]
        }
      },
      "inner_hits": {
        "size": 3,
        "highlight": {
          "fields": { "comments.text": {} }
        }
      }
    }
  }
}

// function_score: custom scoring
GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "name": "laptop" } },
      "functions": [
        {
          "filter": { "term": { "is_sponsored": true } },
          "weight": 2
        },
        {
          "field_value_factor": {
            "field": "popularity_score",
            "factor": 1.2,
            "modifier": "log1p",
            "missing": 1
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

Aggregations

// Composite aggregation: paginate through large cardinality bucketing
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "sales_by_region_date": {
      "composite": {
        "size": 1000,
        "after": { "region": "US", "month": 1704067200000 },
        "sources": [
          { "region": { "terms": { "field": "region" } } },
          { "month":  { "date_histogram": { "field": "created_at", "calendar_interval": "month" } } }
        ]
      },
      "aggs": {
        "total_revenue": { "sum":    { "field": "amount" } },
        "order_count":   { "value_count": { "field": "id" } },
        "avg_order":     { "avg":    { "field": "amount" } },
        "p95_amount": {
          "percentiles": {
            "field": "amount",
            "percents": [50, 90, 95, 99]
          }
        }
      }
    }
  }
}

// Date histogram with pipeline aggregations
GET /events/_search
{
  "size": 0,
  "query": {
    "range": { "timestamp": { "gte": "now-30d/d" } }
  },
  "aggs": {
    "per_day": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "day",
        "time_zone": "America/Chicago"
      },
      "aggs": {
        "unique_users": { "cardinality": { "field": "user_id" } },
        "revenue":      { "sum": { "field": "amount" } },
        "7day_avg": {
          "moving_avg": {
            "buckets_path": "revenue",
            "window": 7,
            "model": "simple"
          }
        }
      }
    }
  }
}

// Significant terms: statistically unusual terms in subset vs background
GET /reviews/_search
{
  "query": { "term": { "product_category": "electronics" } },
  "aggs": {
    "significant_keywords": {
      "significant_terms": {
        "field": "review_text.keyword",
        "size": 20,
        "min_doc_count": 5
      }
    }
  }
}

Index Lifecycle Management (ILM)

// ILM policy for time-series data (hot → warm → cold → delete)
PUT _ilm/policy/events_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "25gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 },
          "readonly": {}
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "box_type": "cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Attach ILM policy to index template
PUT _index_template/events_template
{
  "index_patterns": ["events-*"],
  "template": {
    "settings": {
      "index.lifecycle.name":        "events_policy",
      "index.lifecycle.rollover_alias": "events"
    }
  }
}

// Bootstrap the initial index
PUT events-000001
{
  "aliases": {
    "events": { "is_write_index": true }
  }
}

Bulk Indexing

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(["http://localhost:9200"])

def bulk_index_documents(documents: list[dict], index: str):
    def generate_actions():
        for doc in documents:
            yield {
                "_index": index,
                "_id":    doc.get("id"),  # omit for auto-generated
                "_source": doc
            }

    # helpers.parallel_bulk for high throughput
    success, errors = helpers.bulk(
        es,
        generate_actions(),
        chunk_size=500,          # docs per bulk request
        max_chunk_bytes=10 * 1024 * 1024,  # 10MB per request
        raise_on_error=False,    # collect errors instead of raising
        max_retries=3,
        initial_backoff=2,
        request_timeout=60
    )

    if errors:
        # Handle errors: retry, DLQ, alert
        for error in errors:
            log_bulk_error(error)

    return success, len(errors)

# Performance settings for bulk indexing
es.indices.put_settings(
    index=index,
    body={
        "index.refresh_interval": "-1",       # disable refresh during bulk
        "index.number_of_replicas": 0          # disable replication during bulk
    }
)

bulk_index_documents(large_dataset, index="products")

# Restore after bulk
es.indices.put_settings(
    index=index,
    body={
        "index.refresh_interval": "30s",
        "index.number_of_replicas": 1
    }
)
es.indices.forcemerge(index=index, max_num_segments=1)

Search-as-You-Type

// search_as_you_type field type (built-in for completion)
PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "search_as_you_type",
        "analyzer": "standard"
      }
    }
  }
}

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "elast sear",
      "type": "bool_prefix",
      "fields": ["name", "name._2gram", "name._3gram"]
    }
  }
}

// edge_ngram for custom completion
PUT /suggest_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "autocomplete_filter"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      },
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20
        }
      }
    }
  }
}

Index Aliases for Zero-Downtime Reindex

// Current live alias points to old index
// Step 1: Create new index with updated mapping
PUT /articles_v2
{
  "mappings": { /* updated mapping */ },
  "settings": { "number_of_shards": 4 }
}

// Step 2: Reindex data
POST /_reindex?wait_for_completion=false
{
  "source": { "index": "articles_v1" },
  "dest":   { "index": "articles_v2" }

}

// Monitor reindex progress
GET /_tasks?detailed&actions=*reindex

// Step 3: Atomic alias swap (zero downtime)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "articles_v1", "alias": "articles" } },
    { "add":    { "index": "articles_v2", "alias": "articles" } }
  ]
}

// Applications always query the alias, never the versioned index
GET /articles/_search   ← always works regardless of version

Anti-Patterns

// ❌ Dynamic mapping in production (unexpected field types, mapping explosion)
PUT /events { "mappings": { "dynamic": true } }
// ✅ Always use dynamic: "strict" and define all fields explicitly

// ❌ Enabling fielddata on text fields (out-of-memory risk)
PUT /articles/_mapping
{ "properties": { "title": { "type": "text", "fielddata": true } } }
// ✅ Use .keyword subfield for aggregations on text fields

// ❌ wildcard queries on large indexes (full scan, very slow)
{ "query": { "wildcard": { "name": "*phone*" } } }
// ✅ Use match with n-gram analyzer or search_as_you_type

// ❌ Not using filter context for non-scoring conditions
{ "query": { "match": { "status": "published" } } }  // scored, not cached
// ✅
{ "query": { "bool": { "filter": [{ "term": { "status": "published" } }] } } }

// ❌ Deep pagination (from + size > 10000 is expensive)
GET /articles/_search { "from": 50000, "size": 100 }
// ✅ Use search_after for deep pagination
GET /articles/_search {
  "size": 100,
  "sort": [{ "created_at": "desc" }, { "_id": "asc" }],
  "search_after": ["2024-01-15T10:00:00", "doc_id_xyz"]
}

Quick Reference

Field Type:
  ID, status, email, enum    → keyword
  Full text / body           → text (+ keyword subfield)
  Date                       → date
  Numeric                    → integer/long/float
  Object array (related)     → nested
  Embedded object            → object (default)
  Vectors                    → dense_vector

Query Context (scored):     must, should
Filter Context (cached):    filter, must_not

Aggregation Types:
  Terms / cardinality        → keywords stats
  Date histogram             → time-series
  Composite                  → paginated multi-dimension
  Percentiles / stats        → distributions

Performance Rules:
  1. Filter context > query context for booleans
  2. Small shard size (20–50 GB)
  3. Disable refresh + replicas during bulk load
  4. Use search_after, not from+size for deep pagination
  5. Never KEYS / fielddata on text in production

ILM Phases: hot (active writes) → warm (read-only, shrink) → cold (archive) → delete

Skill Information

Source: MoltbotDen
Category: Data & Analytics
Repository: View on GitHub