Skip to main content
Data & AnalyticsDocumented

mongodb-expert

Deep expertise in MongoDB document modeling, aggregation pipelines, indexing strategy, sharding, Atlas Search, change streams, and performance tuning. Trigger phrases: when working with MongoDB, aggregation pipeline design, MongoDB index strategy,

Share:

Installation

npx clawhub@latest install mongodb-expert

View the full skill documentation and source below.

Documentation

MongoDB Expert

MongoDB's document model is both its greatest strength and the source of most performance problems. The difference between a fast MongoDB application and a slow one is almost always document design and index strategy — not hardware or configuration. Understanding when to embed vs reference, how aggregation pipelines execute, and how to read explain() output unlocks order-of-magnitude improvements.

Core Mental Model

MongoDB optimizes for read patterns at write time — you design documents around the queries you will run, not around normalization. Every collection scan is a red flag; every query should hit an index. The aggregation pipeline is a composable transformation: stages execute in sequence, each passing its output as the next stage's input. Index field order in compound indexes follows the ESR rule: Equality fields first, Sort fields second, Range fields last.


Document Design: Embedding vs Referencing

The Core Questions

  • How is the data accessed? If you always load the parent with the child, embed. If you often load children without the parent, reference.
  • What is the cardinality? One-to-few (< ~100 items): embed. One-to-many: reference. Many-to-many: reference with arrays.
  • Does the child change independently? If so, embedding creates update complexity.
  • Is the document at risk of exceeding 16MB? Unbounded arrays must be referenced.
  • // ✅ EMBED: Blog post with comments (one-to-few, loaded together)
    {
      _id: ObjectId("..."),
      title: "Understanding MVCC",
      body: "...",
      author_id: ObjectId("..."),
      comments: [            // embedded — always loaded with post
        { author: "Alice", text: "Great article", created_at: ISODate("...") },
        { author: "Bob",   text: "Very helpful",  created_at: ISODate("...") }
      ]
    }
    
    // ✅ REFERENCE: Orders with line items (many-to-many, large sets)
    // orders collection
    {
      _id: ObjectId("..."),
      customer_id: ObjectId("..."),   // reference
      line_item_ids: [ObjectId("..."), ObjectId("...")],  // reference array
      total: 149.99,
      status: "shipped"
    }
    
    // ✅ HYBRID: User with recent activity (bounded embed + overflow reference)
    {
      _id: ObjectId("..."),
      username: "alice",
      recent_activity: [   // last 10 only — bounded
        { type: "login", ts: ISODate("...") }
      ],
      // overflow goes to separate activity_log collection
    }

    Aggregation Pipeline

    Stages execute in order. Order matters for performance — filter early with $match, always.

    Core Stages

    // Full analytics pipeline example: top customers by revenue in last 30 days
    db.orders.aggregate([
      // Stage 1: Filter first — uses indexes, reduces docs for all downstream stages
      {
        $match: {
          created_at: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
          status: { $in: ["completed", "shipped"] }
        }
      },
    
      // Stage 2: Join with customers collection (like SQL JOIN)
      {
        $lookup: {
          from: "customers",
          localField: "customer_id",
          foreignField: "_id",
          as: "customer",
          // pipeline lookup for filtered join (MongoDB 3.6+)
          pipeline: [
            { $project: { name: 1, email: 1, tier: 1 } }
          ]
        }
      },
      { $unwind: "$customer" },   // flatten the array from $lookup
    
      // Stage 3: Group by customer
      {
        $group: {
          _id: "$customer_id",
          customer_name:  { $first: "$customer.name" },
          customer_email: { $first: "$customer.email" },
          total_revenue:  { $sum: "$total" },
          order_count:    { $sum: 1 },
          avg_order:      { $avg: "$total" },
          last_order:     { $max: "$created_at" }
        }
      },
    
      // Stage 4: Sort and limit
      { $sort: { total_revenue: -1 } },
      { $limit: 100 },
    
      // Stage 5: Shape the output
      {
        $project: {
          _id: 0,
          customer_id:    "$_id",
          name:           "$customer_name",
          email:          "$customer_email",
          revenue:        { $round: ["$total_revenue", 2] },
          orders:         "$order_count",
          avg_order_value: { $round: ["$avg_order", 2] },
          last_active:    "$last_order"
        }
      }
    ]);
    
    // $facet: multiple aggregations in one pass (for faceted search UI)
    db.products.aggregate([
      { $match: { category: "electronics", in_stock: true } },
      {
        $facet: {
          by_brand: [
            { $group: { _id: "$brand", count: { $sum: 1 } } },
            { $sort: { count: -1 } },
            { $limit: 10 }
          ],
          price_ranges: [
            {
              $bucket: {
                groupBy: "$price",
                boundaries: [0, 50, 100, 250, 500, 1000, Infinity],
                default: "Other",
                output: { count: { $sum: 1 }, avg_price: { $avg: "$price" } }
              }
            }
          ],
          total_count: [{ $count: "n" }]
        }
      }
    ]);

    Index Strategy

    ESR Rule for Compound Indexes

    Equality → Sort → Range
    // Query: find active users in a tier, sorted by join date, for a date range
    db.users.find({
      status: "active",          // E: equality
      joined_at: {               // R: range
        $gte: ISODate("2024-01-01"),
        $lt:  ISODate("2025-01-01")
      }
    }).sort({ joined_at: 1 });  // S: sort
    
    // Correct index order: E, S, R
    db.users.createIndex(
      { status: 1, joined_at: 1 },  // equality, then sort/range on same field
      { name: "idx_status_joined" }
    );
    
    // Partial index: index only active users (smaller, faster)
    db.users.createIndex(
      { joined_at: 1 },
      {
        partialFilterExpression: { status: "active" },
        name: "idx_active_users_joined"
      }
    );
    
    // Text index for full-text search (use Atlas Search for production)
    db.articles.createIndex({ title: "text", body: "text" });
    db.articles.find({ $text: { $search: "mongodb performance" } },
                     { score: { $meta: "textScore" } })
               .sort({ score: { $meta: "textScore" } });
    
    // 2dsphere for geospatial
    db.locations.createIndex({ coordinates: "2dsphere" });
    db.locations.find({
      coordinates: {
        $near: {
          $geometry: { type: "Point", coordinates: [-73.9857, 40.7484] },
          $maxDistance: 5000  // meters
        }
      }
    });
    
    // Wildcard index for dynamic fields (JSONB equivalent)
    db.events.createIndex({ "metadata.$**": 1 });

    Reading explain() Output

    // Always use executionStats for performance analysis
    const result = db.orders.explain("executionStats").find({
      customer_id: ObjectId("..."),
      status: "pending"
    });
    
    // Key fields to check:
    // executionStats.totalDocsExamined  → should be close to nReturned
    // executionStats.totalKeysExamined  → index scans
    // executionStats.executionTimeMillis → total query time
    // winningPlan.stage:
    //   COLLSCAN = no index (bad unless expected)
    //   IXSCAN   = index used (good)
    //   FETCH    = fetching docs from heap after index scan
    //   PROJECTION_SIMPLE = efficient projection
    // inputStage.indexName → which index was chosen
    
    // Ratio to check: docsExamined / nReturned
    // If > 10: selectivity problem — add a better index or compound index

    Sharding

    Shard Key Selection

    Bad shard key = hot shards, uneven distribution, poor query routing.
    Good shard key properties:
      ✅ High cardinality (many distinct values)
      ✅ Even distribution (no hot values)
      ✅ Query isolation (most queries include the shard key)
      ✅ Monotonic growth is OK with hashed sharding
    
    Bad shard key patterns:
      ❌ Low cardinality (status, boolean, small enum)
      ❌ Monotonically increasing without hashing (ObjectId, timestamps → hot last chunk)
      ❌ Rarely included in queries (causes scatter-gather across all shards)
    // Hash sharding: even distribution, but loses range query locality
    sh.shardCollection("mydb.users", { _id: "hashed" });
    
    // Range sharding: preserves locality for range queries
    sh.shardCollection("mydb.events", { tenant_id: 1, created_at: 1 });
    // Good for: WHERE tenant_id = X (query routed to one shard)
    
    // Compound shard key: tenant isolation + distribution
    sh.shardCollection("mydb.messages", { tenant_id: 1, _id: "hashed" });
    // tenant_id prefix routes tenant queries to subset of shards
    // hashed _id prevents hot spots within tenant

    Change Streams

    // Watch a collection for changes (event-driven triggers)
    const changeStream = db.orders.watch([
      // Filter to specific operations
      {
        $match: {
          operationType: { $in: ["insert", "update"] },
          "fullDocument.status": "payment_complete"
        }
      }
    ], {
      fullDocument: "updateLookup"  // include the full doc after update
    });
    
    // Async iterator pattern (Node.js)
    for await (const change of changeStream) {
      const { operationType, fullDocument, updateDescription } = change;
    
      if (operationType === "insert") {
        await processNewOrder(fullDocument);
      } else if (operationType === "update") {
        const updatedFields = updateDescription.updatedFields;
        await handleOrderUpdate(fullDocument, updatedFields);
      }
    
      // Resume token for fault tolerance
      await saveResumeToken(change._id);  // persist this
    }
    
    // Resume after crash using saved token
    const savedToken = await loadResumeToken();
    const stream = db.orders.watch([], {
      resumeAfter: savedToken,
      fullDocument: "updateLookup"
    });

    Transactions

    // Transactions in MongoDB (4.0+ replica sets, 4.2+ sharded)
    // Use sparingly — they have 3-5x performance overhead vs single-doc ops
    // Best for: cross-collection writes that must be atomic
    
    const session = client.startSession();
    
    try {
      await session.withTransaction(async () => {
        const orders = client.db("shop").collection("orders");
        const inventory = client.db("shop").collection("inventory");
    
        // Both writes succeed or both roll back
        await orders.insertOne(
          { customer_id: userId, items: cart, total: 99.99 },
          { session }
        );
    
        await inventory.updateMany(
          { _id: { $in: cart.map(i => i.product_id) } },
          { $inc: { stock: -1 } },
          { session }
        );
      }, {
        readConcern:  { level: "snapshot" },
        writeConcern: { w: "majority" }
      });
    } finally {
      await session.endSession();
    }
    
    // When NOT to use transactions:
    // - Single document updates (already atomic)
    // - Read-heavy operations
    // - High-throughput inserts (kills write throughput)
    // - When document redesign (embedding) can eliminate the need

    Connection Pool Tuning

    // Node.js MongoDB driver connection options
    const client = new MongoClient(uri, {
      maxPoolSize:         50,    // default 100 — limit to avoid overwhelming MongoDB
      minPoolSize:         5,     // keep warm connections
      maxIdleTimeMS:       30000, // close idle connections after 30s
      serverSelectionTimeoutMS: 5000,  // fail fast on topology changes
      socketTimeoutMS:     45000,
      connectTimeoutMS:    10000,
      heartbeatFrequencyMS: 10000,
      retryWrites:         true,  // auto-retry transient write errors
      retryReads:          true,
      readPreference:      "primaryPreferred",  // or "secondaryPreferred" for reads
      writeConcern: { w: "majority", wtimeoutMS: 5000 }
    });

    Atlas Search

    // Atlas Search: full-text with relevance scoring, superior to text indexes
    db.articles.aggregate([
      {
        $search: {
          index: "articles_search",
          compound: {
            must: [{
              text: {
                query: "machine learning transformers",
                path: ["title", "body"],
                fuzzy: { maxEdits: 1 }  // typo tolerance
              }
            }],
            should: [{
              text: {
                query: "machine learning transformers",
                path: "title",
                score: { boost: { value: 3 } }  // title matches rank higher
              }
            }],
            filter: [{
              range: {
                path: "published_at",
                gte: new Date("2024-01-01")
              }
            }]
          }
        }
      },
      { $limit: 20 },
      {
        $project: {
          title: 1,
          summary: 1,
          score: { $meta: "searchScore" }
        }
      }
    ]);

    Anti-Patterns

    // ❌ Unbounded array growth (document exceeds 16MB, index grows unbounded)
    // Never do this for high-volume collections:
    { user_id: 1, events: [...thousands of events...] }
    // ✅ Separate collection with reference
    
    // ❌ Using _id as a shard key with range sharding (hot last chunk)
    sh.shardCollection("db.col", { _id: 1 });  // monotonically increasing → hot shard
    // ✅ Hash the _id
    sh.shardCollection("db.col", { _id: "hashed" });
    
    // ❌ $where clause (executes JavaScript, can't use index, security risk)
    db.users.find({ $where: "this.age > 30" });
    // ✅ Use standard query operators
    db.users.find({ age: { $gt: 30 } });
    
    // ❌ Large $in arrays (> ~1000 values) — query planning overhead
    db.products.find({ _id: { $in: [...10000 ids...] } });
    // ✅ Paginate or use aggregation $lookup
    
    // ❌ Missing $match at pipeline start
    db.orders.aggregate([
      { $group: { _id: "$customer_id", total: { $sum: "$amount" } } },
      { $match: { "_id": someId } }
    ]);
    // ✅ Filter before grouping
    db.orders.aggregate([
      { $match: { customer_id: someId } },  // index hit, far fewer docs to group
      { $group: { _id: "$customer_id", total: { $sum: "$amount" } } }
    ]);

    Quick Reference

    Document Design Decision Tree:
      Always loaded together?     → embed
      One-to-few (< ~100)?        → embed
      One-to-many?                → reference
      Many-to-many?               → reference arrays
      Unbounded growth possible?  → reference (never embed)
      Child changes independently? → reference
    
    Compound Index (ESR Rule):
      1. Equality fields first    (status = "active")
      2. Sort fields second       (.sort({ created_at: 1 }))
      3. Range fields last        ($gte, $lte, $in)
    
    explain() Red Flags:
      COLLSCAN on large collection → missing index
      docsExamined >> nReturned   → poor index selectivity
      fetchedDocs >> nReturned    → index covers too broadly
    
    Shard Key Selection:
      High cardinality?            ✅ required
      Not monotonically increasing? ✅ or use hashed
      Included in most queries?    ✅ required
      Avoid: status, boolean, date alone
    
    Change Streams:
      Always persist resume token → fault-tolerant resume
      Use fullDocument:updateLookup → get full doc on update
      Filter at stream level → reduce application load