Tuning

Links#

https://www.mongodb.com/docs/manual/administration/analyzing-mongodb-performance/
https://www.mongodb.com/docs/manual/applications/indexes/
https://www.mongodb.com/docs/manual/core/query-optimization/
https://www.mongodb.com/docs/manual/reference/explain-results/
https://www.mongodb.com/docs/manual/tutorial/manage-the-database-profiler/
https://www.mongodb.com/docs/manual/core/wiredtiger/
https://www.mongodb.com/docs/manual/troubleshooting/connection-storms/
https://www.mongodb.com/docs/manual/replication/

1. Tuning Principles#

调优顺序:
    1. 先确认业务症状
        latency high
        timeout
        CPU high
        disk IO high
        replication lag
        connection exhausted

    2. 再找证据
        slow query log
        profiler
        explain("executionStats")
        serverStatus
        currentOp
        mongostat / mongotop
        exporter metrics

    3. 再改最小范围
        index
        query shape
        schema
        connection pool
        write concern
        hardware / storage
        shard key

    4. 最后压测验证
        p95 / p99 latency
        docs examined / returned
        keys examined / returned
        CPU / disk / cache
        replication lag

do not tune by guess:
    不要看到慢就先加机器
    不要看到慢就盲目加 index
    不要看到 connection 多就只调 maxIncomingConnections
    不要把所有读都切到 secondary
    不要在没有 explain 的情况下改 query

2. Common Tuning Methods#

slow query#

db.setProfilingLevel(1, { slowms: 100 })

db.system.profile.find({
  millis: { $gte: 100 }
}).sort({ ts: -1 }).limit(20).pretty()

watch:
    ns
    op
    millis
    command.filter
    command.sort
    keysExamined
    docsExamined
    nreturned
    planSummary

explain#

db.orders.find({
  user_id: "u-1001",
  status: "PAID"
}).sort({ created_at: -1 }).limit(50).explain("executionStats")

good:
    winningPlan has IXSCAN
    totalDocsExamined close to nReturned
    totalKeysExamined reasonable
    executionTimeMillis within SLO

bad:
    COLLSCAN
    SORT stage with large input
    docsExamined much larger than nReturned
    many rejected plans

index tuning#

index rules:
    build index for query shape, not for every field
    compound index order usually follows ESR:
        Equality
        Sort
        Range
    remove unused indexes
    avoid many overlapping indexes
    partial index for sparse business states
    TTL index for expiry data

db.orders.createIndex(
  { user_id: 1, status: 1, created_at: -1 },
  { name: "idx_user_status_created_at" }
)

db.orders.createIndex(
  { created_at: 1 },
  {
    name: "idx_pending_created_at",
    partialFilterExpression: { status: "PENDING" }
  }
)

db.orders.aggregate([{ $indexStats: {} }])

query tuning#

query rules:
    filter early
    project only needed fields
    limit result size
    avoid large skip
    use cursor-based pagination
    avoid unbounded $in
    avoid regex without index strategy
    avoid sorting without supporting index

db.orders.find(
  {
    user_id: "u-1001",
    created_at: { $lt: ISODate("2026-05-29T10:00:00Z") }
  },
  {
    _id: 1,
    order_id: 1,
    status: 1,
    amount: 1,
    created_at: 1
  }
).sort({ created_at: -1, _id: -1 }).limit(50)

schema tuning#

schema rules:
    embed small data read together
    reference data that grows without bound
    avoid unbounded arrays
    avoid huge document close to 16 MB
    split hot fields and cold fields when update/read pattern differs
    keep frequently updated counters isolated if they cause document contention

aggregation tuning#

aggregation rules:
    $match early
    $project early to reduce payload
    use index before $sort
    avoid $lookup on huge unbounded sets
    avoid memory-heavy $group on online path
    move analytics to offline pipeline when needed

db.orders.aggregate([
  {
    $match: {
      tenant_id: "t-1001",
      created_at: { $gte: ISODate("2026-05-01T00:00:00Z") }
    }
  },
  {
    $project: {
      _id: 0,
      status: 1,
      amount: 1
    }
  },
  {
    $group: {
      _id: "$status",
      total: { $sum: "$amount" },
      count: { $sum: 1 }
    }
  }
])

connection pool#

connection tuning:
    use one MongoClient per process, reuse it
    do not create client per request
    configure maxPoolSize based on app concurrency
    configure minPoolSize for traffic spikes if connection storm appears
    monitor connections.current and rejected connections
    check app timeout before increasing database limits

example URI options:
    maxPoolSize=100
    minPoolSize=10
    maxIdleTimeMS=60000
    serverSelectionTimeoutMS=5000

WiredTiger cache#

WiredTiger:
    uses internal cache + filesystem cache
    default cache is usually automatically calculated
    container / mixed workload should set explicit cacheSizeGB

watch:
    cache dirty bytes
    bytes read into cache
    eviction activity
    disk read latency
    index size vs memory

storage:
  wiredTiger:
    engineConfig:
      cacheSizeGB: 4

write tuning#

write path:
    batch writes where possible
    avoid too many indexes on write-heavy collection
    use unordered bulkWrite when order is not required
    choose writeConcern per business criticality
    avoid multi-document transaction on hot path if schema can avoid it

db.orders.bulkWrite(
  [
    {
      updateOne: {
        filter: { order_id: "o-1001" },
        update: { $set: { status: "PAID" } }
      }
    },
    {
      updateOne: {
        filter: { order_id: "o-1002" },
        update: { $set: { status: "CANCELLED" } }
      }
    }
  ],
  { ordered: false }
)

replication tuning#

replication:
    monitor replication lag
    keep oplog window larger than maintenance window
    avoid slow secondary due to weak disk
    avoid heavy reads on secondary that delay replication
    use majority writeConcern for critical data

rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()

sharding tuning#

sharding:
    choose shard key from real workload
    avoid low-cardinality shard key
    avoid monotonically increasing hot shard
    prefer targeted query over scatter-gather
    monitor chunk distribution and jumbo chunks

3. Classic Tuning Cases#

case 1: COLLSCAN#

problem:
    API p95 latency high
    explain shows COLLSCAN
    docsExamined much larger than nReturned

fix:
    create index matching filter and sort

db.orders.createIndex(
  { user_id: 1, created_at: -1 },
  { name: "idx_user_created_at" }
)

verify:
    winningPlan uses IXSCAN
    docsExamined close to nReturned

case 2: sort without index#

problem:
    query filter is indexed but sort is slow
    explain has SORT stage

fix:
    compound index includes equality fields first, then sort field

db.orders.createIndex(
  { tenant_id: 1, status: 1, created_at: -1 },
  { name: "idx_tenant_status_created_at" }
)

case 3: large skip pagination#

problem:
    page 1 fast, page 1000 slow
    query uses skip(50000)

fix:
    use cursor-based pagination

db.orders.find({
  user_id: "u-1001",
  created_at: { $lt: ISODate("2026-05-29T10:00:00Z") }
}).sort({ created_at: -1 }).limit(50)

case 4: too many indexes slow writes#

problem:
    insert/update latency high
    collection has many overlapping indexes
    index size grows faster than data size

fix:
    review $indexStats
    drop unused indexes
    merge overlapping compound indexes

db.orders.aggregate([{ $indexStats: {} }])

db.orders.dropIndex("idx_unused_field")

case 5: unbounded array#

problem:
    document grows continuously
    update latency increases
    document may approach 16 MB

fix:
    move growing child records to separate collection
    or use bucket pattern

bad:
    user document contains all login events

good:
    users
    user_login_events

case 6: connection storm#

problem:
    traffic spike causes connection spike
    connections.current near limit
    app creates MongoClient per request

fix:
    create one MongoClient per process
    tune maxPoolSize / minPoolSize
    set app timeout

bad:
    new MongoClient() inside every HTTP request

good:
    create MongoClient during application startup
    reuse client for all requests

case 7: hot working set larger than memory#

problem:
    CPU not high, but latency high
    disk read latency high
    WiredTiger reads into cache high

fix:
    reduce working set
    add memory
    archive cold data
    keep hot indexes in memory
    review cacheSizeGB in container

classic solution:
    split hot recent orders and cold history
    or add time-based query boundary

case 8: aggregation too heavy#

problem:
    online request runs large $group / $lookup
    memory and CPU spike

fix:
    add early $match
    project only needed fields
    precompute summary
    move report to offline job

classic solution:
    order_stats_daily collection
    update by async worker
    API reads precomputed summary

case 9: replication lag#

problem:
    secondary lag increases
    reads from secondary return stale data

fix:
    check slow secondary disk
    reduce heavy secondary reads
    check index build / migration
    increase oplog size if oplog window too short

rs.printSecondaryReplicationInfo()
rs.printReplicationInfo()

case 10: bad shard key#

problem:
    sharded cluster still has one hot shard
    many queries are scatter-gather

fix:
    choose shard key with high cardinality and query targeting
    avoid created_at only
    consider hashed or compound shard key

bad:
    shard key = { created_at: 1 }

better:
    shard key = { tenant_id: 1, order_id: "hashed" }
    exact choice depends on workload

case 11: regex query slow#

problem:
    search query uses unanchored regex
    index cannot help effectively

fix:
    use anchored prefix when possible
    normalize search field
    use text/search engine for real search

// better than /abc/ if prefix search is acceptable
db.users.find({ username: /^abc/ }).limit(20)

case 12: large document / blob#

problem:
    document read/write is slow
    only small metadata is needed on most requests

fix:
    store large file in object storage or GridFS
    keep metadata and pointer in MongoDB
    split hot metadata from cold payload

4. Tuning Checklist#

query:
    slow query enabled
    explain checked
    no COLLSCAN on hot path
    docsExamined / nReturned reasonable
    sort uses index
    projection used

index:
    compound indexes match query shapes
    unused indexes removed
    index size monitored
    partial / TTL indexes used where appropriate

schema:
    no unbounded arrays
    no huge documents
    hot and cold fields separated when needed
    transaction usage reviewed

runtime:
    connection pool reused and bounded
    WiredTiger cache reviewed
    disk latency monitored
    ulimit / file descriptors reviewed

replication:
    replication lag monitored
    oplog window enough
    secondary reads controlled

sharding:
    shard key tested
    targeted queries preferred
    chunk distribution monitored