Database Design Patterns Every Developer Should Know

Computer screen showing database schema and table relationships

Early in my career, I designed a database that seemed elegant: perfectly normalized, no data duplication, beautiful foreign key relationships. Then the application launched, and queries that should have returned in milliseconds took 12 seconds because they joined 8 tables. I learned the hard way that database design isn't about theoretical purity — it's about understanding the trade-offs between competing priorities.

At BirJob, we manage a database that ingests thousands of job listings daily from 80+ sources, serves real-time search queries, and handles analytics workloads. Getting the schema right was critical — and "right" turned out to be a carefully considered mix of patterns, not adherence to any single approach.

This article covers the database design patterns that I believe every developer should understand: normalization and when to break it, denormalization strategies, sharding, partitioning, CQRS, read replicas, and the emerging patterns for modern applications. We'll focus on relational databases (PostgreSQL, MySQL) because they remain the backbone of most applications, with notes on when NoSQL alternatives make sense.

Normalization: The Foundation (and Its Limits)

Normalization is the process of organizing data to reduce redundancy and improve integrity. The normal forms (1NF through 5NF) provide increasingly strict rules for how data should be structured. Most practical applications aim for Third Normal Form (3NF), which eliminates transitive dependencies.

When Normalization Is Right

Write-heavy applications: When data is updated frequently, normalization prevents update anomalies. If a customer's address is stored in one place (the customers table), updating it is a single operation. If it's duplicated across 50 order records, you have 50 updates.
Data integrity is paramount: Financial systems, healthcare records, legal documents — anywhere incorrect data has serious consequences.
Storage efficiency matters: Normalized schemas store less redundant data. At petabyte scale, this matters.

When Normalization Hurts

Read-heavy applications: Normalized schemas require joins to reconstruct data. A query that retrieves a complete order (order details + customer info + product info + shipping address) might join 4-5 tables. At scale, this degrades performance.
Analytics workloads: Analytical queries scan large datasets and aggregate across dimensions. Joins on billions of rows are prohibitively expensive. This is why data warehouses use denormalized star schemas.
Microservices architectures: Cross-service joins are impossible when each service owns its own database. Denormalization (storing redundant data in each service's database) is often the only practical approach.

According to Use The Index, Luke (one of the best SQL performance resources), "Normalization is a tool, not a religion. The right level of normalization depends on your query patterns, not on theoretical correctness."

Denormalization Strategies

Denormalization is the deliberate introduction of data redundancy to improve read performance. It's not "bad design" — it's a conscious trade-off. The key is knowing where and how to denormalize.

Strategy 1: Computed Columns

Store the result of a computation alongside the source data. Example: an orders table that stores total_amount as a column, even though it could be computed from order_items. This eliminates the need for a GROUP BY and SUM on every order retrieval.

Trade-off: The computed value can become stale if order_items are modified without updating total_amount. Use triggers or application-level logic to maintain consistency.

Strategy 2: Materialized Views

Materialized views store the result of a complex query as a physical table. They're particularly useful for dashboard queries that aggregate across multiple tables. PostgreSQL's materialized view documentation describes the feature and its refresh strategies.

Trade-off: Materialized views need to be refreshed. Concurrent refresh (available since PostgreSQL 9.4) allows the view to remain queryable during refresh, but the data can be stale between refreshes.

Strategy 3: JSON Columns

Store related data as a JSON blob within a row, avoiding the need for a separate table and join. PostgreSQL's jsonb type supports indexing, querying, and partial updates on JSON data.

Example: A products table with a specifications jsonb column that stores variable attributes (size, color, weight) instead of a separate product_attributes table with an EAV (Entity-Attribute-Value) pattern.

Trade-off: JSON columns bypass schema enforcement. A typo in a JSON key won't be caught by the database. Use CHECK constraints or application-level validation.

Strategy 4: Prejoined Tables

Create a table that stores the result of a frequently-used join. Example: an order_summary table that combines data from orders, customers, and products into a single flat table optimized for reporting.

Trade-off: This requires maintaining two data models (the normalized source tables and the denormalized summary table). Use database triggers, Change Data Capture (CDC), or scheduled ETL jobs to keep them in sync.

Server room with blue lighting representing database infrastructure

Sharding: Horizontal Scaling for Massive Data

Sharding is the practice of splitting a database across multiple servers, where each server (shard) holds a subset of the data. It's the primary strategy for scaling databases beyond the capacity of a single machine.

Sharding Strategies

Strategy	How It Works	Pros	Cons
Hash-based	Hash the shard key (e.g., user_id) to determine which shard stores the data	Even distribution, simple routing	Cross-shard queries are expensive, adding shards requires resharding
Range-based	Split data by value ranges (e.g., users A-M on shard 1, N-Z on shard 2)	Range queries within a shard are efficient	Hotspots if data isn't uniformly distributed
Geographic	Data is stored in the shard closest to the user's region	Low latency for regional users, data locality compliance	Cross-region queries, complex setup
Tenant-based	Each tenant (customer) gets their own shard or group of shards	Natural isolation, per-tenant scaling	Uneven shard sizes if tenants vary in size

When to Shard

The honest answer: as late as possible. Sharding is operationally complex, makes cross-shard queries painful, and complicates transactions. According to Instagram Engineering, they ran on a single PostgreSQL instance until they had hundreds of millions of rows. GitHub Engineering has written about delaying sharding as long as possible through vertical scaling and query optimization.

Consider sharding only when:

Your single database instance can't handle the write volume (even with connection pooling and write-ahead log optimization)
Your dataset doesn't fit on a single server's storage (including indexes)
You need geographic data locality for compliance (GDPR, data residency laws)

Before sharding, exhaust these alternatives:

Read replicas: Offload read queries to replicas (see below)
Connection pooling: PgBouncer or PgPool reduce connection overhead
Partitioning: Split large tables within a single database (see below)
Caching: Redis or Memcached for frequently-accessed data
Query optimization: Proper indexes, query rewrites, EXPLAIN ANALYZE
Vertical scaling: Bigger machines are simpler than distributed systems

Partitioning: Sharding's Simpler Cousin

Table partitioning splits a large table into smaller physical pieces within the same database. Unlike sharding, there's no distributed system complexity — the database engine handles routing queries to the correct partition transparently.

PostgreSQL supports three partitioning strategies (documented in the official partitioning documentation):

Range Partitioning

Split by value range. Most commonly used for time-series data:

Example: An events table partitioned by month. Queries that filter by date automatically scan only the relevant partition(s), not the entire table.

List Partitioning

Split by a discrete set of values. Useful for multi-tenant applications:

Example: An orders table partitioned by country_code. Each country's data is in its own partition.

Hash Partitioning

Split by hash of a column value. Useful when there's no natural range or list:

Example: A sessions table hash-partitioned on user_id across 16 partitions for even distribution.

When to Partition

Large tables (100M+ rows): Sequential scans on 100M-row tables are slow. Partitioning enables partition pruning, where the database skips irrelevant partitions.
Time-series data: Logs, events, metrics — data that grows continuously and is queried by time range. Old partitions can be dropped without vacuuming.
Maintenance operations: VACUUM, REINDEX, and ANALYZE run faster on smaller partitions than on large monolithic tables.

CQRS: Separate Models for Reads and Writes

CQRS (Command Query Responsibility Segregation) uses different database schemas — or even different databases — for reads and writes. We covered the concept in our Event-Driven Architecture article; here we'll focus on the database design implications.

The Write Model

The write model is optimized for data integrity and consistency. It's typically a normalized relational schema with proper constraints, foreign keys, and transactions. The write model enforces business rules and is the source of truth.

The Read Model

The read model is optimized for query performance. It can be:

A denormalized table in the same database (materialized view)
A search index (Elasticsearch) for full-text search
A cache layer (Redis) for frequently-accessed data
A data warehouse (BigQuery, Redshift) for analytics
A vector database (pgvector, Pinecone) for semantic search

Synchronization Patterns

Pattern	Mechanism	Latency	Complexity
Dual Write	Application writes to both models	Synchronous	Low but risky (inconsistency on failure)
CDC (Change Data Capture)	Database log streaming (Debezium, AWS DMS)	Near real-time (~seconds)	Medium
Event Sourcing	Events published on write, consumed by read model	Near real-time (~seconds)	High
ETL/Batch	Scheduled jobs sync data periodically	Minutes to hours	Low

My recommendation: Start with materialized views in the same database. Move to CDC (Debezium + Kafka) when you need near real-time synchronization across different databases. Use event sourcing only when you need the full event history, not just the current state.

Read Replicas: The Simplest Scaling Pattern

Read replicas are copies of your primary database that serve read queries. All writes go to the primary; the primary replicates changes to one or more replicas. This is the simplest and most effective way to scale read-heavy applications.

Every major cloud database supports read replicas natively: RDS Read Replicas, Cloud SQL Replicas, Azure PostgreSQL Replicas.

Implementation Considerations

Replication lag: Read replicas are eventually consistent. There's a delay (typically milliseconds to seconds) between a write on the primary and its appearance on the replica. For most queries, this is acceptable. For queries that must return freshly-written data (e.g., "show me my just-submitted order"), route those reads to the primary.

Connection routing: Your application needs to know which queries go to the primary and which go to replicas. Options include:

Application-level routing: Mark read-only queries in your ORM or data access layer
Proxy-based routing: PgBouncer or ProxySQL route queries based on type (SELECT → replica, others → primary)
Framework support: Django, Rails, and most modern frameworks support read/write splitting natively

Number of replicas: Start with 1-2 replicas. Add more based on read load. Each replica adds to replication overhead on the primary, so more isn't always better.

Anti-Patterns to Avoid

1. The God Table

A single table with 100+ columns that stores everything about an entity. Common in legacy systems. Leads to wide rows (poor cache efficiency), nullable columns everywhere, and impossible-to-understand schemas.

Fix: Decompose into related tables. Use inheritance (PostgreSQL table inheritance) for entity subtypes.

2. Entity-Attribute-Value (EAV)

A three-column table (entity_id, attribute_name, attribute_value) used to store arbitrary key-value pairs. Seems flexible but is a query performance disaster — retrieving an entity requires pivoting multiple rows.

Fix: Use PostgreSQL's jsonb columns for semi-structured data. They provide similar flexibility with much better query performance and indexing.

3. Soft Deletes Everywhere

Adding an is_deleted boolean to every table instead of actually deleting rows. This means every query needs a WHERE is_deleted = false filter, which developers inevitably forget, leading to ghost data appearing in the application.

Fix: Use soft deletes only where audit requirements demand them. For everything else, move deleted records to an archive table and hard-delete from the active table.

4. UUID Primary Keys Without Thought

UUIDs as primary keys are popular for distributed systems, but random UUIDs (v4) cause B-tree index fragmentation and poor insert performance. Each insert goes to a random position in the index, preventing sequential write optimization.

Fix: Use UUIDv7 (time-ordered UUIDs) which maintain temporal ordering and B-tree locality. Or use BIGSERIAL for internal IDs and UUIDs for external/API-facing identifiers.

Action Plan: Designing Your Database

Step 1: Understand Your Workload

What is the read-to-write ratio? (80:20? 99:1?)
What are the most common query patterns?
How fast will the data grow? (rows/day, GB/month)
What are the consistency requirements? (strong? eventual?)
What are the latency requirements? (p95 under 100ms?)

Step 2: Start Normalized

Begin with a properly normalized schema (3NF). Add indexes for your known query patterns. Measure actual performance before denormalizing.

Step 3: Denormalize Strategically

When specific queries are too slow, denormalize those specific paths. Use materialized views for reporting, computed columns for frequently-calculated values, and JSON columns for variable attributes.

Step 4: Scale Reads Before Writes

Read replicas → caching → CQRS → partitioning → sharding. Follow this progression. Each step adds complexity; stop at the first step that solves your problem.

Step 5: Monitor Continuously

Use pg_stat_statements to identify slow queries. Monitor table sizes, index bloat, and replication lag. Set up alerts for query latency regressions.

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Database Design Patterns Every Developer Should Know

Database Design Patterns Every Developer Should Know

Normalization: The Foundation (and Its Limits)

When Normalization Is Right

When Normalization Hurts

Denormalization Strategies

Strategy 1: Computed Columns

Strategy 2: Materialized Views

Strategy 3: JSON Columns

Strategy 4: Prejoined Tables

Sharding: Horizontal Scaling for Massive Data

Sharding Strategies

When to Shard

Partitioning: Sharding's Simpler Cousin

Range Partitioning

List Partitioning

Hash Partitioning

When to Partition

CQRS: Separate Models for Reads and Writes

The Write Model

The Read Model

Synchronization Patterns

Read Replicas: The Simplest Scaling Pattern

Implementation Considerations

Anti-Patterns to Avoid

1. The God Table

2. Entity-Attribute-Value (EAV)

3. Soft Deletes Everywhere

4. UUID Primary Keys Without Thought

Action Plan: Designing Your Database

Step 1: Understand Your Workload

Step 2: Start Normalized

Step 3: Denormalize Strategically

Step 4: Scale Reads Before Writes

Step 5: Monitor Continuously

Sources

İş axtarışınıza başlayın