The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79%


TL;DR

SaaS woke up to a silent auto‑scale from M20 → M60, adding 20 % to their cloud bill overnight. In a frantic 48‑hour sprint we:

  • flattened N + 1 waterfalls with $lookup ,
  • tamed unbounded cursors with projection, limit() and TTL,
  • split 16 MB “jumbo” docs into lean metadata + GridFS blobs,
  • reordered a handful of sleepy indexes

And watched $15 284 → $3 210/mo (‑79 %) while p95 latency toppled from 1.9 s → 140 ms.

All on a plain replica set.


Step 1: The Day the Invoice Went Supernova

02:17 a.m. — The on‑call phone lit up like a pinball machine. Atlas had quietly hot‑swapped our trusty M20 for a maxed‑out M60. Slack filled with 🟥 BILL SHOCK alerts while Grafana’s red‑lined graphs painted a horror movie in real time.

“Finance says the new spend wipes out nine months of runway. We need a fix before stand‑up.”
— COO, 02:38

Half‑awake, the engineer cracked open the profiler. Three culprits leapt off the screen:

  • Query waterfall — every order API call triggered an extra fetch for its lines. 1 000 orders? 1 001 round‑trips.
  • Fire‑hose cursor — a click‑stream endpoint streamed 30 months of events on every page load.
  • Jumbo docs — 16 MB invoices (complete with PDFs) blew the cache to bits and back.

Atlas tried to help by throwing hardware at the fire—upgrading from 64 GB RAM to 320 GB, boosting IOPS, and, of course, boosting the bill.

By breakfast, the war‑room rules were clear: cut 70 % of spend in 48 hours, zero downtime, no schema nukes. The play‑by‑play starts below.


Step 2: Three Shape Crimes & How to Fix Them

2.1 N + 1 Query Tsunami

Symptom: For each order the API fired a second query for its line items. 1 000 orders ⇒ 1 001 round‑trips.

// Old (painful)
const orders = await db.orders.find( userId ).toArray();
for (const o of orders) 
  o.lines = await db.orderLines.find( orderId: o._id ).toArray();

Hidden fees: 1 000 index walks, 1 000 TLS handshakes, 1 000 context switches.

Remedy (4 lines):

// New (single pass)
db.orders.aggregate([
  $match:  userId  ,
  $lookup: 
   from: 'orderLines',
   localField: '_id',
   foreignField: 'orderId',
   as: 'lines'
  ,
  $project:  lines: 1, total: 1, ts: 1  
]);

Latency p95: 2 300 ms → 160 ms. Read ops: 101 → 1 (‑99 %).

2.2 Unbounded Query Fire‑Hose

Symptom: One endpoint streamed 30 months of click history in a single cursor.

// Before
const events = db.events.find( userId ).toArray();

Fix: Cap the window and project only rendered fields.

const events = db.events.find(
  
   userId,
   ts:  $gte: new Date(Date.now() - 30*24*3600*1000) 
  ,
   _id: 0, ts: 1, page: 1, ref: 1 
).sort( ts: -1 ).limit(1_000);

Then let Mongo prune for you:

// 90‑day TTL
db.events.createIndex( ts: 1 ,  expireAfterSeconds: 90*24*3600 );

A fintech client clipped 72 % off their storage overnight using nothing but TTL.

2.3 Jumbo Document Money Pit

Anything above 256 KB already strains cache lines; one collection stored multi‑MB invoices complete with PDFs and 1 200‑row histories.

Solution: split by access pattern—hot metadata in invoices , cold BLOBs in S3/GridFS.

graph TD
  Invoice[(invoices <2 kB)] -->|ref| Hist[history <1 kB * N]
  Invoice -->|ref| Bin[pdf‑store (S3/GridFS)]

SSD spend snow‑dropped; cache hit ratio jumped 22 p.p.


Step 3: Four Shape Sins Hiding in Plain Sight

Shape isn’t just about document size—it’s how queries, indexes and access patterns intertwine.

These four anti‑patterns lurk in most production clusters and silently drain cash.

3.1 Low‑Cardinality Leading Index Key

Symptom The index starts with a field that has < 10 % distinct values, e.g. type: 1, ts: -1 . The planner must traverse huge swaths before applying the selective part.

Cost High B‑tree fan‑out, poor cache locality, extra disk seeks.

Fix Move the selective key (userId , orgId , tenantId) first: userId: 1, ts: -1 . Rebuild online, then drop the old index.

3.2 Blind $regex Scan

Symptom $regex: /foo/i on a non‑indexed field forces a full collection scan; CPU spikes, cache churns.

Cost Each pattern match walks every document and decodes BSON in the hot path.

Fix Prefer anchored patterns ( /^foo/ ) with a supporting index, or add a searchable slug field ( lower(name) ) and index that instead.

3.3 findOneAndUpdate as a Message Queue

Symptom Workers poll with findOneAndUpdate( status: 'new' , $set: status: 'taken' ).

Cost Document‑level locks serialize writers; throughput collapses beyond a few thousand ops/s.

Fix Use a purpose‑built queue (Redis Streams, Kafka, SQS) or Mongodb’s native change streams to push events, keeping writes append‑only.

3.4 Offset Pagination Trap

Symptom find().skip(N).limit(20) where N can reach six‑figure offsets.

Cost Mongo still counts and discards all skipped docs—linear time. Latency balloons and billing counts each read.

Fix Switch to range cursors using compound index (ts, _id) :

// page after the last item of previous page
find( ts:  $lt: lastTs  )
 .sort( ts: -1, _id: -1 )
 .limit(20);

Master these four and you’ll reclaim RAM, lower read units, and postpone sharding by quarters.


Step 4: Cost Anatomy 101

Metric Before Unit $ Cost After Δ %
Reads (3 k/s) 7.8 B 0.09/M $702 2.3 B -70
Writes (150/s) 380 M 0.225/M $86 380 M 0
Xfer 1.5 TB 0.25/GB $375 300 GB -80
Storage 2 TB 0.24/GB $480 800 GB -60
Total $1,643 -66

Step 5: 48‑Hour Rescue Timeline

Hour Action Tool Win
0‑2 Enable profiler (slowms = 50) mongo shell Top 10 slow ops located
2‑6 Replace N + 1 with $lookup VS Code + tests 90 % fewer reads
6‑10 Add projections & limit() API layer RAM steady, API 4× faster
10‑16 Split jumbo docs Scripted ETL Working set fits in RAM
16‑22 Drop/re‑order weak indexes Compass Disk shrinks, cache hits ↑
22‑30 Create TTLs / Online Archive Atlas UI −60 % storage
30‑36 Wire Grafana panels Prometheus Early warnings live
36‑48 Load‑test with k6 k6 + Atlas p95 < 150 ms @ 2× load

Step 6: Self‑Audit Checklist

  • Largest doc ÷ median > 10? → Refactor.
  • Any cursor > 1 000 docs? → Paginate.
  • TTL on every event collection? (Y/N)
  • Index cardinality < 10 %? → Drop or reorder.
  • Profiler “slow” ops > 1 %? → Optimize or cache.

Tape this to your monitor before Friday deploys.


Step 7: Why Shape > Indexes (Most Days)

Adding an index is like buying a faster forklift for the warehouse: it speeds up picking, but it does nothing if the aisles are cluttered with oversized boxes. In MongoDB terms the planner’s cost formula is roughly:

workUnits = ixScans + fetches + sorts + returnedDocs

Indexes trim , yet and can still dominate when documents are bloated, sparsely‑accessed, or poorly grouped.

A Tale of Two Queries

Skinny Doc (2 kB) Jumbo Doc (16 MB)
ixScans 1 000 1 000
fetches 1 000×2 kB = 2 MB 1 000×16 MB = 16 GB
Net time 80 ms 48 s + eviction storms

Same index, same query pattern — the only difference is shape.

The Rule of Thumb

Fix shape first, then index once.
– Every reshaped document shrinks every future fetch, cache line, and replication packet.

Three shape wins easily beat a dozen extra B‑trees.


Step 8: Live Metrics You Should Alert On (PromQL)

# Cache miss ratio (>10 % for 5 m triggers alert)
 (rate(wiredtiger_blockmanager_blocks_read[1m]) /
 (rate(wiredtiger_blockmanager_blocks_read[1m]) +
 rate(wiredtiger_blockmanager_blocks_read_from_cache[1m]))) > 0.10

# Docs scanned vs returned (>100 triggers alert)
 rate(mongodb_ssm_metrics_documentsstate="scanned"[1m]) /
 rate(mongodb_ssm_metrics_documentsstate="returned"[1m]) > 100

Step 9: Thin‑Slice Migration Script

Need to break a 1‑TB events collection into sub‑collections without downtime? Use double‑writes + backfill:

// 1) Forward writes
const cs = db.events.watch([],  fullDocument: 'updateLookup' );
cs.on('change', ev => 
 db[`$ev.fullDocument.types`].insertOne(ev.fullDocument);
);

// 2) Backfill history
let lastId = ObjectId("000000000000000000000000");
while (true) 
 const batch = db.events
  .find( _id:  $gt: lastId  )
  .sort( _id: 1 )
  .limit(10_000)
  .toArray();
 if (!batch.length) break;
 db[batch[0].type + 's'].insertMany(batch);
 lastId = batch[batch.length - 1]._id;

Step 11: When Sharding Is Actually Required

Sharding is a last‑mile tactic, not a first‑line cure. It fractures data, multiplies failure modes, and complicates every migration. Exhaust vertical upgrades and shape‑based optimizations first. Reach for a shard key only when at least one of the thresholds below is sustained under real load and cannot be solved cheaper.

Hard Capacity Ceilings

Symptom Rule of Thumb Why Horizontal Split Helps
Working set sits above 80 % of physical RAM for 24 h+ < 60 % is healthy; 60–80 % can be masked by a bigger box; > 80 % pages constantly Splitting puts hot partitions on separate nodes, restoring cache‑hit ratio
Primary write throughput > 15 000 ops/s after index tuning Below 10 000 ops/s you can often survive by batching or bulk upserts Isolating high‑velocity chunks reduces journal lag and lock contention
Multi‑region product needs < 70 ms p95 read latency Speed‑of‑light sets ~80 ms US↔EU floor Zone sharding pins data near users without resorting to edge caches

Soft Signals Sharding Is Approaching

  • Index builds exceed maintenance windows even with online indexing.
  • Compaction time eats into disaster‑recovery SLA.
  • A single tenant owns > 25 % of cluster volume.
  • Profiler shows > 500 ms lock spikes from long transactions.

Checklist Before You Cut

  • Reshape documents: if the largest doc is 20 × the median, refactor first.
  • Enable compression ( zstd or snappy ) often buys 30 % storage headroom.
  • Archive cold data via Online Archive or tiered S3 storage.
  • Rewrite hottest endpoints in Go/Rust if JSON parsing dominates CPU.
  • Run mongo‑perf; if workload fits a single replica set post‑fixes, abort the shard plan.

Choosing a Shard Key

  • Use high‑cardinality, monotonically increasing fields ( ObjectId , timestamp prefix).
  • Avoid low‑entropy keys (status , country ) that funnel writes to a few chunks.
  • Put the most common query predicate first to avoid scatter‑gather.

Sharding is surgery; once you cut, you live with the scar. Make sure the patient truly needs the operation.


Conclusion — Shaping Up Before the Bill Comes Due

When the M60 upgrade landed with a silent boom, it wasn’t the hardware’s fault it was a wake-up call. This wasn’t about CPU, memory, or disk it was about shape. Shape of the documents. Shape of the queries. Shape of the assumptions that quietly bloated over months of “just ship it” sprints.

Fixing it didn’t take a new database, a weekend migration, or an army of consultants. It took a team willing to look inward, to trade panic for profiling, and to reshape what they already had.

The results were undeniable: latency down by 92%, costs cut by nearly 80%, and a codebase now lean enough to breathe.

But here’s the real takeaway: technical debt on shape isn’t just a performance issue it’s a financial one. And unlike indexes or caching tricks, shaping things right up front pays off every single time your query runs, every time your data replicates, every time you scale.

So before your next billing cycle spikes, ask yourself:

  • Does every endpoint need the full document?
  • Are we designing for reads, or just writing fast?
  • Are our indexes working, or just working hard?

Shape-first isn’t a technique — it’s a mindset. A habit. And the earlier you adopt it, the longer your system — and your runway — will last.


Sources & Further Reading

About the Author

Hayk Ghukasyan is a Chief of Engineering at Hexact, where he helps build automation platforms like Hexomatic and Hexospark. He has over 20 years of experience in large-scale systems architecture, real-time databases, and optimization engineering.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here