Database Design Patterns Every Developer Should Know
Early in my career, I designed a database that seemed elegant: perfectly normalized, no data duplication, beautiful foreign key relationships. Then the application launched, and queries that should have returned in milliseconds took 12 seconds because they joined 8 tables. I learned the hard way that database design isn't about theoretical purity — it's about understanding the trade-offs between competing priorities.
At BirJob, we manage a database that ingests thousands of job listings daily from 80+ sources, serves real-time search queries, and handles analytics workloads. Getting the schema right was critical — and "right" turned out to be a carefully considered mix of patterns, not adherence to any single approach.
This article covers the database design patterns that I believe every developer should understand: normalization and when to break it, denormalization strategies, sharding, partitioning, CQRS, read replicas, and the emerging patterns for modern applications. We'll focus on relational databases (PostgreSQL, MySQL) because they remain the backbone of most applications, with notes on when NoSQL alternatives make sense.
Normalization: The Foundation (and Its Limits)
Normalization is the process of organizing data to reduce redundancy and improve integrity. The normal forms (1NF through 5NF) provide increasingly strict rules for how data should be structured. Most practical applications aim for Third Normal Form (3NF), which eliminates transitive dependencies.
When Normalization Is Right
- Write-heavy applications: When data is updated frequently, normalization prevents update anomalies. If a customer's address is stored in one place (the customers table), updating it is a single operation. If it's duplicated across 50 order records, you have 50 updates.
- Data integrity is paramount: Financial systems, healthcare records, legal documents — anywhere incorrect data has serious consequences.
- Storage efficiency matters: Normalized schemas store less redundant data. At petabyte scale, this matters.
When Normalization Hurts
- Read-heavy applications: Normalized schemas require joins to reconstruct data. A query that retrieves a complete order (order details + customer info + product info + shipping address) might join 4-5 tables. At scale, this degrades performance.
- Analytics workloads: Analytical queries scan large datasets and aggregate across dimensions. Joins on billions of rows are prohibitively expensive. This is why data warehouses use denormalized star schemas.
- Microservices architectures: Cross-service joins are impossible when each service owns its own database. Denormalization (storing redundant data in each service's database) is often the only practical approach.
According to Use The Index, Luke (one of the best SQL performance resources), "Normalization is a tool, not a religion. The right level of normalization depends on your query patterns, not on theoretical correctness."
Denormalization Strategies
Denormalization is the deliberate introduction of data redundancy to improve read performance. It's not "bad design" — it's a conscious trade-off. The key is knowing where and how to denormalize.
Strategy 1: Computed Columns
Store the result of a computation alongside the source data. Example: an orders table that stores total_amount as a column, even though it could be computed from order_items. This eliminates the need for a GROUP BY and SUM on every order retrieval.
Trade-off: The computed value can become stale if order_items are modified without updating total_amount. Use triggers or application-level logic to maintain consistency.
Strategy 2: Materialized Views
Materialized views store the result of a complex query as a physical table. They're particularly useful for dashboard queries that aggregate across multiple tables. PostgreSQL's materialized view documentation describes the feature and its refresh strategies.
Trade-off: Materialized views need to be refreshed. Concurrent refresh (available since PostgreSQL 9.4) allows the view to remain queryable during refresh, but the data can be stale between refreshes.
Strategy 3: JSON Columns
Store related data as a JSON blob within a row, avoiding the need for a separate table and join. PostgreSQL's jsonb type supports indexing, querying, and partial updates on JSON data.
Example: A products table with a specifications jsonb column that stores variable attributes (size, color, weight) instead of a separate product_attributes table with an EAV (Entity-Attribute-Value) pattern.
Trade-off: JSON columns bypass schema enforcement. A typo in a JSON key won't be caught by the database. Use CHECK constraints or application-level validation.
Strategy 4: Prejoined Tables
Create a table that stores the result of a frequently-used join. Example: an order_summary table that combines data from orders, customers, and products into a single flat table optimized for reporting.
Trade-off: This requires maintaining two data models (the normalized source tables and the denormalized summary table). Use database triggers, Change Data Capture (CDC), or scheduled ETL jobs to keep them in sync.
Sharding: Horizontal Scaling for Massive Data
Sharding is the practice of splitting a database across multiple servers, where each server (shard) holds a subset of the data. It's the primary strategy for scaling databases beyond the capacity of a single machine.
Sharding Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Hash-based | Hash the shard key (e.g., user_id) to determine which shard stores the data | Even distribution, simple routing | Cross-shard queries are expensive, adding shards requires resharding |
| Range-based | Split data by value ranges (e.g., users A-M on shard 1, N-Z on shard 2) | Range queries within a shard are efficient | Hotspots if data isn't uniformly distributed |
| Geographic | Data is stored in the shard closest to the user's region | Low latency for regional users, data locality compliance | Cross-region queries, complex setup |
| Tenant-based | Each tenant (customer) gets their own shard or group of shards | Natural isolation, per-tenant scaling | Uneven shard sizes if tenants vary in size |
When to Shard
The honest answer: as late as possible. Sharding is operationally complex, makes cross-shard queries painful, and complicates transactions. According to Instagram Engineering, they ran on a single PostgreSQL instance until they had hundreds of millions of rows. GitHub Engineering has written about delaying sharding as long as possible through vertical scaling and query optimization.
Consider sharding only when:
- Your single database instance can't handle the write volume (even with connection pooling and write-ahead log optimization)
- Your dataset doesn't fit on a single server's storage (including indexes)
- You need geographic data locality for compliance (GDPR, data residency laws)
Before sharding, exhaust these alternatives:
- Read replicas: Offload read queries to replicas (see below)
- Connection pooling: PgBouncer or PgPool reduce connection overhead
- Partitioning: Split large tables within a single database (see below)
- Caching: Redis or Memcached for frequently-accessed data
- Query optimization: Proper indexes, query rewrites, EXPLAIN ANALYZE
- Vertical scaling: Bigger machines are simpler than distributed systems
Partitioning: Sharding's Simpler Cousin
Table partitioning splits a large table into smaller physical pieces within the same database. Unlike sharding, there's no distributed system complexity — the database engine handles routing queries to the correct partition transparently.
PostgreSQL supports three partitioning strategies (documented in the official partitioning documentation):
Range Partitioning
Split by value range. Most commonly used for time-series data:
Example: An events table partitioned by month. Queries that filter by date automatically scan only the relevant partition(s), not the entire table.
List Partitioning
Split by a discrete set of values. Useful for multi-tenant applications:
Example: An orders table partitioned by country_code. Each country's data is in its own partition.
Hash Partitioning
Split by hash of a column value. Useful when there's no natural range or list:
Example: A sessions table hash-partitioned on user_id across 16 partitions for even distribution.
When to Partition
- Large tables (100M+ rows): Sequential scans on 100M-row tables are slow. Partitioning enables partition pruning, where the database skips irrelevant partitions.
- Time-series data: Logs, events, metrics — data that grows continuously and is queried by time range. Old partitions can be dropped without vacuuming.
- Maintenance operations: VACUUM, REINDEX, and ANALYZE run faster on smaller partitions than on large monolithic tables.
CQRS: Separate Models for Reads and Writes
CQRS (Command Query Responsibility Segregation) uses different database schemas — or even different databases — for reads and writes. We covered the concept in our Event-Driven Architecture article; here we'll focus on the database design implications.
The Write Model
The write model is optimized for data integrity and consistency. It's typically a normalized relational schema with proper constraints, foreign keys, and transactions. The write model enforces business rules and is the source of truth.
The Read Model
The read model is optimized for query performance. It can be:
- A denormalized table in the same database (materialized view)
- A search index (Elasticsearch) for full-text search
- A cache layer (Redis) for frequently-accessed data
- A data warehouse (BigQuery, Redshift) for analytics
- A vector database (pgvector, Pinecone) for semantic search
Synchronization Patterns
| Pattern | Mechanism | Latency | Complexity |
|---|---|---|---|
| Dual Write | Application writes to both models | Synchronous | Low but risky (inconsistency on failure) |
| CDC (Change Data Capture) | Database log streaming (Debezium, AWS DMS) | Near real-time (~seconds) | Medium |
| Event Sourcing | Events published on write, consumed by read model | Near real-time (~seconds) | High |
| ETL/Batch | Scheduled jobs sync data periodically | Minutes to hours | Low |
My recommendation: Start with materialized views in the same database. Move to CDC (Debezium + Kafka) when you need near real-time synchronization across different databases. Use event sourcing only when you need the full event history, not just the current state.
Read Replicas: The Simplest Scaling Pattern
Read replicas are copies of your primary database that serve read queries. All writes go to the primary; the primary replicates changes to one or more replicas. This is the simplest and most effective way to scale read-heavy applications.
Every major cloud database supports read replicas natively: RDS Read Replicas, Cloud SQL Replicas, Azure PostgreSQL Replicas.
Implementation Considerations
Replication lag: Read replicas are eventually consistent. There's a delay (typically milliseconds to seconds) between a write on the primary and its appearance on the replica. For most queries, this is acceptable. For queries that must return freshly-written data (e.g., "show me my just-submitted order"), route those reads to the primary.
Connection routing: Your application needs to know which queries go to the primary and which go to replicas. Options include:
- Application-level routing: Mark read-only queries in your ORM or data access layer
- Proxy-based routing: PgBouncer or ProxySQL route queries based on type (SELECT → replica, others → primary)
- Framework support: Django, Rails, and most modern frameworks support read/write splitting natively
Number of replicas: Start with 1-2 replicas. Add more based on read load. Each replica adds to replication overhead on the primary, so more isn't always better.
Anti-Patterns to Avoid
1. The God Table
A single table with 100+ columns that stores everything about an entity. Common in legacy systems. Leads to wide rows (poor cache efficiency), nullable columns everywhere, and impossible-to-understand schemas.
Fix: Decompose into related tables. Use inheritance (PostgreSQL table inheritance) for entity subtypes.
2. Entity-Attribute-Value (EAV)
A three-column table (entity_id, attribute_name, attribute_value) used to store arbitrary key-value pairs. Seems flexible but is a query performance disaster — retrieving an entity requires pivoting multiple rows.
Fix: Use PostgreSQL's jsonb columns for semi-structured data. They provide similar flexibility with much better query performance and indexing.
3. Soft Deletes Everywhere
Adding an is_deleted boolean to every table instead of actually deleting rows. This means every query needs a WHERE is_deleted = false filter, which developers inevitably forget, leading to ghost data appearing in the application.
Fix: Use soft deletes only where audit requirements demand them. For everything else, move deleted records to an archive table and hard-delete from the active table.
4. UUID Primary Keys Without Thought
UUIDs as primary keys are popular for distributed systems, but random UUIDs (v4) cause B-tree index fragmentation and poor insert performance. Each insert goes to a random position in the index, preventing sequential write optimization.
Fix: Use UUIDv7 (time-ordered UUIDs) which maintain temporal ordering and B-tree locality. Or use BIGSERIAL for internal IDs and UUIDs for external/API-facing identifiers.
Action Plan: Designing Your Database
Step 1: Understand Your Workload
- What is the read-to-write ratio? (80:20? 99:1?)
- What are the most common query patterns?
- How fast will the data grow? (rows/day, GB/month)
- What are the consistency requirements? (strong? eventual?)
- What are the latency requirements? (p95 under 100ms?)
Step 2: Start Normalized
Begin with a properly normalized schema (3NF). Add indexes for your known query patterns. Measure actual performance before denormalizing.
Step 3: Denormalize Strategically
When specific queries are too slow, denormalize those specific paths. Use materialized views for reporting, computed columns for frequently-calculated values, and JSON columns for variable attributes.
Step 4: Scale Reads Before Writes
Read replicas → caching → CQRS → partitioning → sharding. Follow this progression. Each step adds complexity; stop at the first step that solves your problem.
Step 5: Monitor Continuously
Use pg_stat_statements to identify slow queries. Monitor table sizes, index bloat, and replication lag. Set up alerts for query latency regressions.
Sources
- Use The Index, Luke — SQL Performance
- PostgreSQL Partitioning Documentation
- PostgreSQL Materialized Views
- AWS RDS Read Replicas
- Instagram Engineering Blog
- GitHub Engineering Blog
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
