Large data volumes

Large Data Volume Architecture Patterns in Salesforce (2026)

Published 12 June 2026 · 13 min read · Advanced

Large data volume architecture is where Salesforce stops being forgiving. Designs that run fine at 100,000 records quietly degrade at 10 million, then fail at 50 million — slow queries, sharing recalculation storms, record-locking errors, and reports that time out. The failures are predictable and the patterns to avoid them are well established. This guide covers the architecture-level decisions for LDV orgs: skew, selectivity, skinny tables, mashup vs replication, and archiving. Current to Summer ‘26.

The query-writing tactics that complement these patterns live in SOQL Best Practices for Large Data Volumes; this guide is the architecture layer above them, and it assumes the governor limits context.

There is no single “LDV” number

LDV is not a row count, it is a set of symptoms. They appear when row count, related-record fan-out, and sharing complexity combine — an object with 3 million records and deep sharing can hurt more than one with 30 million flat, owner-less records. Treat any object heading past a few million rows, or any query touching tens of millions, as LDV territory and design accordingly. The cost drivers to watch are always the same three: how many rows, how they fan out under parents and owners, and how complex the sharing is.

Data skew — the silent killer

Skew is uneven distribution that concentrates records, and it produces the most baffling LDV incidents because the symptom (lock errors, slow saves) is far from the cause (distribution).

Ownership skew — one user owns a very large number of records. Every sharing recalculation touching that user becomes expensive, and operations on their records serialize. The classic mitigation is to own skewed records with a user placed outside the role hierarchy (so there are no hierarchy-based sharing recalculations to cascade), or to distribute ownership across many users.

Account / lookup skew — thousands of child records under a single parent. When those children are updated in parallel (a common bulk-load or integration pattern), they all contend for a lock on the shared parent, producing UNABLE_TO_LOCK_ROW. Mitigation is distributing children across more parents where the data model allows, or serializing updates to the skewed parent’s children.

Skew is a design property, not a tuning knob — the time to prevent it is in the data model, not after the lock errors start.

Selectivity and indexing — why queries scan

The Salesforce query optimizer uses an index only when the filter is selective: roughly under 10% of the first million rows and 5% beyond, on an indexed field. Miss that, and the optimizer falls back to a full table scan — the single most common cause of slow LDV queries and QUERY_TIMEOUT.

Two consequences that surprise people:

An indexed field is not automatically fast. Querying an indexed Status__c for a value held by 60% of rows is non-selective; the index is useless and the table scans. Selectivity is about the value, not just the field.
Standard indexes exist on Id, Name, owner, lookups, and audit fields; everything else needs a custom index (requested via Support or, for External IDs, automatic). Filtering on a high-cardinality field you query often is the case for a custom index.

The diagnostic tool is the Query Plan in the Developer Console — it shows the cost and whether an index is used. Any LDV query investigation starts there, not with guesswork.

Skinny tables — managed read acceleration

A skinny table is a Salesforce-maintained copy of an object holding a subset of frequently-read fields, kept in sync automatically and free of the standard-to-custom-field join that normal queries pay. For read-heavy LDV objects, they can dramatically speed queries and reports.

The constraints matter: they are created by Salesforce Support, not self-service; they hold only certain field types; and they are kept in sync by the platform. They are a targeted tool for a proven read-performance problem on a specific object, not a general optimization — reach for them after selectivity and indexing are addressed, not before.

Mashup vs replication — where external data lives

When LDV data originates in another system, the architecture choice is whether to bring it in:

Mashup leaves data in the source and surfaces it on demand — external objects (OData/Salesforce Connect) or callouts. No storage cost, no sync, always current, but limited native reporting/automation and dependent on the external system’s availability and latency.
Replication copies data into Salesforce for full platform capability — native reporting, SOQL, automation, sharing — at the cost of storage, a synchronization mechanism, and staleness between syncs.

The deciding question is functional: does this data need native Salesforce reporting, automation, or sharing? If yes, replicate. If it is reference data viewed occasionally, mashup avoids paying LDV storage and sync costs for data that does not need to be local. Many orgs over-replicate by default and inherit an LDV problem they could have left in the source system.

Archiving with Big Objects

Active objects bloated with years of historical records carry an LDV penalty on every query and recalculation. Big Objects are the archive tier: they scale to billions of rows but support only a limited, async query model (SOQL via a defined index, or Async SOQL) and none of the standard transactional features.

The pattern: keep active, transactable records in the standard object; move aged, rarely-queried history (closed cases past retention, old activity, audit trails) into a Big Object. The standard object stays lean and fast; the archive remains queryable when compliance or investigation needs it. The fit is precise — Big Objects are wrong for anything still transacting, and right for high-volume data whose job is to exist and occasionally be read.

The LDV design checklist

When architecting or auditing an LDV object, work through:

Skew — is ownership or any parent concentrated? Design distribution before load.
Selectivity — do the hot queries filter on selective, indexed fields? Verify with Query Plan.
Indexing — are high-cardinality, frequently-filtered fields indexed (custom index via Support where needed)?
Read acceleration — would a skinny table help a proven read bottleneck?
External data — should this be mashup rather than replicated and stored?
Archiving — is aged history inflating the active object when it belongs in a Big Object?
Sharing complexity — can the sharing model be simplified, or skewed records moved outside the hierarchy?

LDV performance is won at design time. By the time the lock errors and query timeouts arrive in production, the cheap fixes are gone and you are left with data migrations and Support cases — which is exactly why the checklist belongs in the design review, not the incident retro.

Test your knowledge — Large data volumes

10 questions · Basic to Advanced

0 / 10 correct

Frequently asked questions

What counts as large data volume in Salesforce?

There is no single threshold, but objects above roughly a few million records, or queries against tens of millions, are where standard approaches start to degrade and LDV-specific design becomes necessary. Performance problems scale with row count, related-record fan-out, and sharing complexity, not just raw size.

What is data skew in Salesforce?

Data skew is uneven distribution that concentrates records under one parent or owner. Ownership skew (one user owning huge numbers of records) and lookup/account skew (many children under one parent) both cause sharing recalculation and record-locking problems at scale.

What makes a SOQL query selective in Salesforce?

A query is selective when its filter uses an indexed field whose value matches below the selectivity threshold — roughly 10% of the first million rows and 5% beyond. The query optimizer uses an index only when the filter is selective enough; otherwise it falls back to a full table scan.

What is a skinny table in Salesforce?

A skinny table is a Salesforce-managed copy of an object containing a subset of frequently used fields, kept in sync automatically and free of the standard/custom field join. Salesforce Support creates them for specific LDV read-performance cases; they are not self-service.

When should I use a Big Object for archiving?

Use a Big Object when you must retain very large volumes of historical data that is queried rarely and never needs standard transactional features. Big Objects scale to billions of rows but support only a limited query model, so they fit archive and audit data, not active records.

What is the difference between data mashup and replication?

Mashup leaves data in the external system and surfaces it on demand via external objects or callouts, avoiding storage and sync cost. Replication copies external data into Salesforce for full platform functionality at the cost of storage and synchronization. Choose based on whether the data needs native reporting and automation.