Scaling Postgres to 1 Million Rows: Indexing, Partitioning, Connection Pooling

Dashboard showing database performance metrics and query analytics

Last year, BirJob's main vacancies table crossed 800,000 rows. Queries that used to return in 12ms started taking 1.4 seconds. Our job search page — the single most important page on the site — became sluggish. I spent three weeks deep in Postgres internals, and the lessons I learned apply to any application that's outgrowing its initial database setup.

This isn't a theoretical guide. Every recommendation here comes from real production incidents, benchmarks I ran on actual data, and mistakes I made along the way. Whether you're running a SaaS, an e-commerce platform, or a job board like BirJob, the principles are the same: understand your data access patterns, measure before you optimize, and respect Postgres's strengths instead of fighting them.

Let's start from the ground up.

Part 1: Understanding Why Postgres Slows Down

Before you can fix performance, you need to understand what's happening under the hood. Postgres uses a query planner — a sophisticated piece of software that decides how to execute your query. It evaluates multiple strategies (sequential scan, index scan, bitmap scan, etc.) and picks the one it estimates will be cheapest.

The problem is that as your table grows, the planner's choices change. A sequential scan on 10,000 rows is fast. A sequential scan on 1,000,000 rows is not. But sometimes Postgres still chooses a sequential scan on a million-row table because the index statistics are stale, or because you're selecting too many columns, or because the selectivity of your WHERE clause is too low.

The Three Bottlenecks

In my experience, database slowdowns at the million-row scale come from three places:

I/O bottleneck: Postgres has to read too many pages from disk. Each 8KB page might contain dozens of rows, but if your query touches thousands of pages, you're waiting on disk I/O. According to PostgreSQL documentation on page layout, each page is exactly 8192 bytes, and understanding this granularity matters for optimization.
CPU bottleneck: Sorting, hashing, and filtering take CPU time. Complex JOINs with multiple conditions can peg a CPU core at 100%.
Connection bottleneck: Each Postgres connection uses roughly 5-10MB of RAM. If you have 200 concurrent connections (common for web applications), that's 1-2GB just for connection overhead, before any query processing happens. The PostgreSQL wiki on connection counts recommends keeping connections far below what most developers assume.

Let's tackle each one systematically.

Part 2: Indexing — The First Line of Defense

Organized filing system representing database indexing structure

B-Tree Indexes: Your Default Workhorse

B-tree indexes are the default in Postgres, and for good reason. They support equality checks (=), range queries (<, >, BETWEEN), and sorting (ORDER BY). For most workloads, they're the right choice.

But here's where developers go wrong: they either create too few indexes (queries are slow) or too many (writes are slow, and the planner gets confused). The sweet spot requires understanding your actual query patterns.

Here's how I approach it:

-- Step 1: Find your slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

-- Step 2: Analyze a specific query
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM vacancies
WHERE company_id = 42
AND status = 'active'
AND created_at > NOW() - INTERVAL '30 days'
ORDER BY created_at DESC
LIMIT 20;

The EXPLAIN (ANALYZE, BUFFERS) output tells you exactly what's happening. Look for:

Seq Scan on large tables — usually means a missing index
Rows Removed by Filter — the higher this number, the more wasted work
Buffers: shared read — high numbers mean lots of disk I/O

Composite Indexes: Order Matters

A composite index on (company_id, status, created_at) is fundamentally different from one on (status, company_id, created_at). The leftmost column is the most selective filter. If you have 500 companies but only 3 statuses, put company_id first.

-- Good: high-cardinality column first
CREATE INDEX idx_vacancies_company_status_date
ON vacancies (company_id, status, created_at DESC);

-- This index supports:
-- WHERE company_id = 42
-- WHERE company_id = 42 AND status = 'active'
-- WHERE company_id = 42 AND status = 'active' AND created_at > '2026-01-01'
-- WHERE company_id = 42 ORDER BY created_at DESC

-- This index does NOT efficiently support:
-- WHERE status = 'active' (company_id not specified)
-- WHERE created_at > '2026-01-01' (neither company_id nor status specified)

Partial Indexes: Index Only What Matters

This is one of the most underused features in Postgres. If 80% of your queries filter on status = 'active', and only 15% of rows are active, a partial index is dramatically smaller and faster:

CREATE INDEX idx_vacancies_active
ON vacancies (company_id, created_at DESC)
WHERE status = 'active';

On BirJob, this reduced our main search index from 180MB to 28MB. Smaller index = fits in RAM = faster queries. According to benchmarks from Citus Data's guide on partial indexes, partial indexes can improve query performance by 10-50x depending on the filter selectivity.

GIN and GiST Indexes for Full-Text Search

If you're doing text search (and if you're building a job board, you are), GIN indexes on tsvector columns are essential:

-- Add a tsvector column
ALTER TABLE vacancies ADD COLUMN search_vector tsvector;

-- Populate it
UPDATE vacancies SET search_vector =
  setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
  setweight(to_tsvector('english', coalesce(company, '')), 'B') ||
  setweight(to_tsvector('english', coalesce(description, '')), 'C');

-- Create GIN index
CREATE INDEX idx_vacancies_search ON vacancies USING GIN(search_vector);

-- Query with ranking
SELECT title, company,
  ts_rank(search_vector, plainto_tsquery('english', 'python developer')) AS rank
FROM vacancies
WHERE search_vector @@ plainto_tsquery('english', 'python developer')
ORDER BY rank DESC
LIMIT 20;

Index Maintenance: The Hidden Cost

Every index slows down INSERT, UPDATE, and DELETE operations. On a table with 10 indexes, an INSERT has to update all 10 index structures. I've seen tables where removing 4 unused indexes improved write throughput by 40%.

Find unused indexes:

SELECT schemaname, relname, indexrelname, idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Part 3: Partitioning — When One Table Isn't Enough

Abstract visualization of data partitioning and distribution

Indexing solves the "find specific rows quickly" problem. Partitioning solves the "the table is so large that even index maintenance is slow" problem.

When to Partition

Don't partition prematurely. Partitioning adds complexity: queries must include the partition key, migrations are harder, and some operations (like unique constraints across partitions) become more complex. Based on guidance from the PostgreSQL partitioning documentation, partitioning typically becomes beneficial when tables exceed tens of millions of rows or when you have clear time-based data retention patterns.

Partition when:

Your table has 10M+ rows and growing
You have a natural partition key (date, region, tenant)
You need to efficiently delete old data (drop partition vs DELETE)
Query patterns consistently filter on the partition key

Range Partitioning by Date

The most common pattern for time-series data:

-- Create partitioned table
CREATE TABLE job_applications (
  id BIGSERIAL,
  user_id INTEGER NOT NULL,
  vacancy_id INTEGER NOT NULL,
  applied_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  status VARCHAR(20) DEFAULT 'pending',
  cover_letter TEXT
) PARTITION BY RANGE (applied_at);

-- Create monthly partitions
CREATE TABLE job_applications_2026_01
  PARTITION OF job_applications
  FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE job_applications_2026_02
  PARTITION OF job_applications
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

CREATE TABLE job_applications_2026_03
  PARTITION OF job_applications
  FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Automate partition creation with pg_partman
CREATE EXTENSION IF NOT EXISTS pg_partman;
SELECT partman.create_parent(
  'public.job_applications',
  'applied_at',
  'native',
  'monthly'
);

List Partitioning by Category

If your queries consistently filter by a categorical value:

CREATE TABLE vacancies (
  id BIGSERIAL,
  title TEXT NOT NULL,
  company TEXT NOT NULL,
  source VARCHAR(50) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY LIST (source);

CREATE TABLE vacancies_linkedin PARTITION OF vacancies FOR VALUES IN ('linkedin');
CREATE TABLE vacancies_glassdoor PARTITION OF vacancies FOR VALUES IN ('glassdoor');
CREATE TABLE vacancies_direct PARTITION OF vacancies FOR VALUES IN ('direct', 'company_website');
CREATE TABLE vacancies_other PARTITION OF vacancies DEFAULT;

Hash Partitioning for Even Distribution

When you don't have a natural range or list but want to spread load:

CREATE TABLE user_sessions (
  id BIGSERIAL,
  user_id INTEGER NOT NULL,
  session_data JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY HASH (user_id);

CREATE TABLE user_sessions_p0 PARTITION OF user_sessions FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE user_sessions_p1 PARTITION OF user_sessions FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE user_sessions_p2 PARTITION OF user_sessions FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE user_sessions_p3 PARTITION OF user_sessions FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Partition Pruning: The Performance Payoff

When your query includes the partition key in the WHERE clause, Postgres skips irrelevant partitions entirely. A query on January's data only scans the January partition, ignoring the other 11 months of data.

-- This query only scans job_applications_2026_03
EXPLAIN (ANALYZE)
SELECT * FROM job_applications
WHERE applied_at BETWEEN '2026-03-01' AND '2026-03-31'
AND user_id = 12345;

-- Output shows:
-- ->  Index Scan on job_applications_2026_03
--       (never touches other partitions)

Part 4: Connection Pooling — The Unsung Hero

Network connections visualization representing database connection pooling

Here's a statistic that surprises most developers: a Postgres connection that's sitting idle still consumes 5-10MB of RAM. If you're running a Next.js app with serverless functions, each function invocation might open its own connection. Under load, you can easily hit Postgres's max_connections limit (default: 100), causing connection refused errors.

This is where connection pooling becomes critical. The PgBouncer documentation demonstrates that a properly configured connection pooler can handle thousands of application connections with just 20-50 actual Postgres connections.

PgBouncer: The Industry Standard

PgBouncer is a lightweight connection pooler that sits between your application and Postgres. It maintains a pool of actual database connections and multiplexes application requests across them.

# pgbouncer.ini
[databases]
birjob = host=localhost port=5432 dbname=birjob_production

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Pool settings
pool_mode = transaction
default_pool_size = 25
max_client_conn = 1000
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 30
client_login_timeout = 15

Pool Modes: Session vs Transaction vs Statement

Mode	How It Works	Best For	Limitations
Session	Client holds connection for entire session	Apps using session-level features (LISTEN/NOTIFY, prepared statements)	Limited multiplexing benefit
Transaction	Client holds connection only during a transaction	Most web applications	No session-level state between transactions
Statement	Client holds connection for a single statement	Simple autocommit workloads	No multi-statement transactions

For web applications, transaction mode is almost always the right choice. It provides the best multiplexing while still supporting transactions.

Connection Pooling in Application Code

If you're using Prisma (as we do at BirJob), connection pooling is built in:

// schema.prisma
datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
  // Prisma's built-in pool
  // connection_limit is set via query parameter
  // ?connection_limit=10&pool_timeout=20
}

// For serverless environments, use Prisma Data Proxy or PgBouncer
// DATABASE_URL="postgresql://user:pass@pgbouncer:6432/birjob?pgbouncer=true"

For raw Node.js with the pg library:

const { Pool } = require('pg');

const pool = new Pool({
  host: 'localhost',
  port: 6432, // PgBouncer port
  database: 'birjob_production',
  user: 'app_user',
  password: process.env.DB_PASSWORD,
  max: 20,              // Max connections in this pool
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

// Always release connections back to the pool
async function getActiveVacancies(companyId) {
  const client = await pool.connect();
  try {
    const result = await client.query(
      'SELECT * FROM vacancies WHERE company_id = $1 AND status = $2',
      [companyId, 'active']
    );
    return result.rows;
  } finally {
    client.release(); // Critical: always release
  }
}

Part 5: Query Optimization Patterns

Beyond indexing and infrastructure, the queries themselves matter enormously. Here are patterns I've found most impactful at scale.

Keyset Pagination vs OFFSET

The classic LIMIT 20 OFFSET 1000 pattern is a performance disaster on large tables. Postgres has to scan and discard 1000 rows before returning 20. At OFFSET 100000, it's scanning 100,000 rows just to throw them away. Research from Use The Index, Luke shows that keyset pagination maintains constant performance regardless of page depth.

-- Bad: OFFSET pagination (gets slower with each page)
SELECT * FROM vacancies
WHERE status = 'active'
ORDER BY created_at DESC, id DESC
LIMIT 20 OFFSET 10000;
-- Scans 10,020 rows, returns 20

-- Good: Keyset pagination (constant performance)
SELECT * FROM vacancies
WHERE status = 'active'
AND (created_at, id) < ('2026-03-15T10:30:00Z', 985432)
ORDER BY created_at DESC, id DESC
LIMIT 20;
-- Jumps directly to the right position via index

Materialized Views for Expensive Aggregations

-- Instead of computing this on every request:
SELECT company, COUNT(*) as vacancy_count,
  AVG(salary_min) as avg_salary
FROM vacancies
WHERE status = 'active'
GROUP BY company;

-- Create a materialized view, refresh periodically:
CREATE MATERIALIZED VIEW company_stats AS
SELECT company, COUNT(*) as vacancy_count,
  AVG(salary_min) as avg_salary,
  MAX(created_at) as latest_vacancy
FROM vacancies
WHERE status = 'active'
GROUP BY company;

CREATE UNIQUE INDEX idx_company_stats_company ON company_stats(company);

-- Refresh every hour (non-blocking)
REFRESH MATERIALIZED VIEW CONCURRENTLY company_stats;

CTEs vs Subqueries: The Performance Trap

Prior to Postgres 12, CTEs (Common Table Expressions) were always materialized — meaning Postgres would execute the CTE fully and store the results before the outer query could use them. Since Postgres 12, the planner can inline CTEs, but only if they're referenced once and aren't recursive.

-- This CTE may or may not be inlined (Postgres 12+)
WITH active_vacancies AS (
  SELECT * FROM vacancies WHERE status = 'active'
)
SELECT * FROM active_vacancies WHERE company_id = 42;

-- If you need to force materialization (e.g., to prevent repeated work):
WITH active_vacancies AS MATERIALIZED (
  SELECT * FROM vacancies WHERE status = 'active'
)
SELECT * FROM active_vacancies WHERE company_id = 42;

-- If you need to force inlining:
WITH active_vacancies AS NOT MATERIALIZED (
  SELECT * FROM vacancies WHERE status = 'active'
)
SELECT * FROM active_vacancies WHERE company_id = 42;

Part 6: Configuration Tuning

Developer workstation showing database configuration files

The default Postgres configuration is designed to run on a Raspberry Pi. For a production server, you need to tune several key parameters. The PGTune calculator is a great starting point, but understanding why each parameter matters is more valuable.

Memory Settings

# postgresql.conf — for a server with 16GB RAM

# Shared buffer pool (25% of RAM is a good start)
shared_buffers = 4GB

# Memory per-operation (sorts, hashes, etc.)
work_mem = 64MB
# Be careful: this is per-operation, not per-connection
# A complex query with 5 sort operations uses 5 × 64MB = 320MB

# Memory for maintenance operations (VACUUM, CREATE INDEX)
maintenance_work_mem = 1GB

# OS cache hint (tell Postgres how much RAM the OS has for caching)
effective_cache_size = 12GB

WAL and Checkpoint Settings

# Write-Ahead Log settings
wal_buffers = 64MB
min_wal_size = 1GB
max_wal_size = 4GB

# Checkpoints (how often dirty pages are flushed to disk)
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min

Autovacuum Tuning

Postgres's MVCC model means that deleted and updated rows aren't immediately reclaimed. VACUUM cleans them up. On high-write tables, the default autovacuum settings are too conservative:

# More aggressive autovacuum for high-write tables
autovacuum_vacuum_scale_factor = 0.02    # default: 0.2
autovacuum_analyze_scale_factor = 0.01   # default: 0.1
autovacuum_vacuum_cost_delay = 2ms       # default: 2ms (already good)
autovacuum_max_workers = 4               # default: 3

# Per-table override for very hot tables
ALTER TABLE vacancies SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_analyze_scale_factor = 0.005
);

Part 7: Monitoring and Benchmarking

Essential Monitoring Queries

-- Table sizes including indexes and TOAST
SELECT
  schemaname,
  relname AS table_name,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS table_size,
  pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;

-- Cache hit ratio (should be > 99%)
SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / GREATEST(sum(heap_blks_hit) + sum(heap_blks_read), 1) as ratio
FROM pg_statio_user_tables;

-- Index usage ratio
SELECT
  relname,
  seq_scan,
  idx_scan,
  CASE WHEN seq_scan + idx_scan > 0
    THEN round(100.0 * idx_scan / (seq_scan + idx_scan), 1)
    ELSE 0
  END AS idx_scan_pct
FROM pg_stat_user_tables
WHERE seq_scan + idx_scan > 100
ORDER BY seq_scan DESC;

Benchmarking with pgbench

# Initialize pgbench tables
pgbench -i -s 100 birjob_benchmark

# Run a benchmark: 10 clients, 4 threads, 60 seconds
pgbench -c 10 -j 4 -T 60 -P 5 birjob_benchmark

# Custom benchmark script
pgbench -c 20 -j 4 -T 120 -f custom_benchmark.sql birjob_benchmark

Part 8: My Opinionated Take — What Most Guides Get Wrong

After scaling three different Postgres databases past the million-row mark, here's what I believe most guides get wrong:

1. They focus on reads and ignore writes. Most optimization advice is about making SELECT queries faster. But if your write throughput tanks because you have 15 indexes on a hot table, your application is just as broken. Always measure the write side.

2. They recommend partitioning too early. Partitioning a 500K-row table is premature optimization. You're adding complexity for minimal gain. Good indexes on a single table will outperform bad indexes on a partitioned table every time.

3. They ignore connection management. I've seen applications fall over at 50 requests per second — not because the queries were slow, but because each request opened a new connection, and the connection overhead was the bottleneck. PgBouncer should be in your stack from day one.

4. They don't talk about data modeling. The best optimization is a query you don't have to run. Denormalize hot paths. Use materialized views. Cache at the application level. The fastest query is the one that never hits the database.

Part 9: Action Plan — What to Do Right Now

Here's a concrete action plan, ordered by impact and effort:

Week 1: Measure

Enable pg_stat_statements if you haven't already
Identify your top 10 slowest queries by mean execution time
Check your cache hit ratio — if it's below 99%, you need more RAM or better indexes
Count your indexes per table — if any table has more than 8, audit for unused ones

Week 2: Quick Wins

Add missing indexes for your top 5 slow queries
Convert OFFSET pagination to keyset pagination
Set up PgBouncer in transaction mode
Tune shared_buffers and work_mem

Week 3: Structural Improvements

Create partial indexes for common filter patterns
Add materialized views for expensive aggregations
Set up monitoring dashboards (Grafana + pg_stat_statements)
Evaluate whether any tables need partitioning

Ongoing

Review query plans for any new feature before deploying
Run weekly reports on slow queries and index usage
Load test with realistic data volumes before launch

Conclusion

Performance dashboard showing improved database metrics after optimization

Scaling Postgres isn't about finding a single silver bullet. It's about systematically eliminating bottlenecks: first with proper indexing, then with query optimization, then with connection management, and finally — only when the data truly demands it — with partitioning.

The most important thing I've learned is to measure first. Don't guess where the bottleneck is. Use EXPLAIN ANALYZE, use pg_stat_statements, use real production data. The answer is always in the numbers.

Postgres is an extraordinary database. It can handle millions of rows on modest hardware — if you treat it with respect and understanding. I hope this guide helps you get there.

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...