Data Pipeline Architecture: Batch vs Stream vs Lambda vs Kappa
When I built BirJob's scraping pipeline, I faced a classic data engineering decision. We scrape 80+ job sites daily, producing roughly 15,000 job listings per run. Each listing needs deduplication, normalization, categorization, and enrichment before it's visible to users. The naive approach — scrape everything, process synchronously, write to the database — took 4 hours and crashed regularly when a single source timed out. After redesigning around a proper pipeline architecture, the same workload processes in 35 minutes with zero crashes. The architecture choice was the difference between a fragile script and a reliable system.
This guide covers the four major data pipeline architectures — Batch, Streaming, Lambda, and Kappa — with honest analysis of when each makes sense, what they cost to build and operate, and how to choose between them.
What Is a Data Pipeline?
A data pipeline is a series of data processing steps where the output of one step feeds into the next. At its simplest, it's Extract → Transform → Load (ETL). At production scale, it involves orchestration, error handling, monitoring, backfill capabilities, schema evolution, and exactly-once processing guarantees.
The 2024 Data Engineering Survey by Datanami found that 78% of data teams manage more than 50 pipelines, and the average pipeline failure rate is 12% per month. The architecture you choose determines how you handle those failures.
| Architecture | Processing Model | Latency | Complexity | Primary Use Case |
|---|---|---|---|---|
| Batch | Process data in scheduled intervals | Minutes to hours | Low | ETL, analytics, reporting |
| Streaming | Process each event as it arrives | Milliseconds to seconds | High | Real-time analytics, fraud detection |
| Lambda | Batch + Streaming in parallel | Both real-time and batch | Very high | When you need both speed and accuracy |
| Kappa | Streaming only (replaying from log) | Milliseconds to seconds | Medium-high | Event-sourced systems, unified processing |
Architecture 1: Batch Processing
Batch processing is the oldest and most understood data pipeline pattern. Data accumulates over a period, then gets processed all at once.
┌──────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────┐
│ Sources │───▶│ Extract │───▶│ Transform │───▶│ Load │
│ (APIs, │ │ (Scheduled) │ │ (Clean, │ │ (Database,│
│ Files, │ │ │ │ Enrich, │ │ DWH) │
│ DBs) │ │ │ │ Aggregate) │ │ │
└──────────┘ └─────────────┘ └──────────────┘ └──────────┘
│ │
└──── Orchestrator (Airflow, Cron) ─────┘
When Batch Is the Right Choice
- Your data sources produce data in batches (daily reports, CSV exports, API dumps)
- Consumers don't need real-time data (next-day reporting, weekly analytics)
- Processing is computationally expensive (ML model training, large aggregations)
- You need full-table transformations (deduplicate an entire dataset, recompute all scores)
Batch Pipeline with Apache Airflow
# airflow_dags/job_scraping_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['alerts@birjob.com'],
}
with DAG(
'job_scraping_pipeline',
default_args=default_args,
description='Daily job scraping and processing pipeline',
schedule_interval='0 6 * * *', # 6 AM daily
start_date=datetime(2024, 1, 1),
catchup=False,
max_active_runs=1,
tags=['scraping', 'production'],
) as dag:
extract = PythonOperator(
task_id='extract_jobs',
python_callable=run_all_scrapers,
op_kwargs={'concurrency': 2, 'timeout': 300},
)
deduplicate = PythonOperator(
task_id='deduplicate',
python_callable=deduplicate_by_apply_link,
)
normalize = PythonOperator(
task_id='normalize',
python_callable=normalize_job_data,
# Normalize company names, job titles, locations
)
categorize = PythonOperator(
task_id='categorize',
python_callable=auto_categorize_jobs,
# ML-based job category assignment
)
load = PythonOperator(
task_id='load_to_database',
python_callable=upsert_to_postgres,
)
notify = PythonOperator(
task_id='notify_users',
python_callable=send_job_alerts,
)
quality_check = PythonOperator(
task_id='quality_check',
python_callable=run_data_quality_checks,
# Check: min 500 jobs, max 20% duplicate rate, all sources reporting
)
extract >> deduplicate >> normalize >> categorize >> load >> quality_check >> notify
Cost profile: According to Astronomer's 2024 analysis, managed Airflow runs $150-500/month for small-medium workloads. Self-hosted Airflow on a t3.medium EC2 costs ~$30/month but requires maintenance expertise.
Architecture 2: Stream Processing
Stream processing handles each event individually as it arrives. Instead of "process yesterday's data this morning," it's "process each event within milliseconds of it occurring."
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Events │───▶│ Message │───▶│ Stream │───▶│ Sink │
│ (API │ │ Broker │ │ Processor │ │ (DB, │
│ webhooks,│ │ (Kafka, │ │ (Flink, │ │ Search, │
│ CDC) │ │ SQS) │ │ Consumer) │ │ Cache) │
└──────────┘ └──────────────┘ └──────────────┘ └──────────┘
│
Windowing, Joins,
Aggregations in
real-time
When Streaming Is the Right Choice
- Your business requires sub-second data freshness (fraud detection, real-time bidding)
- Events are naturally event-driven (user clicks, IoT sensor data, log events)
- You need continuous processing (monitoring, alerting, live dashboards)
- Data volume makes batch impractical (millions of events per second)
Stream Processing with Node.js + Redis Streams
// A lightweight stream processor using Redis Streams
import Redis from 'ioredis';
const redis = new Redis();
// Producer: push job events to stream
async function publishJobEvent(job: JobListing): Promise<void> {
await redis.xadd(
'stream:raw-jobs',
'*', // Auto-generate ID
'data', JSON.stringify(job),
'source', job.source,
'timestamp', Date.now().toString(),
);
}
// Consumer: process stream events
async function processJobStream(): Promise<void> {
// Create consumer group (idempotent)
try {
await redis.xgroup('CREATE', 'stream:raw-jobs', 'job-processors', '0', 'MKSTREAM');
} catch (e) {
// Group already exists — fine
}
const consumerId = `processor-${process.pid}`;
while (true) {
// Read new messages (blocking for up to 5 seconds)
const messages = await redis.xreadgroup(
'GROUP', 'job-processors', consumerId,
'COUNT', 10,
'BLOCK', 5000,
'STREAMS', 'stream:raw-jobs', '>',
);
if (!messages || messages.length === 0) continue;
for (const [stream, entries] of messages) {
for (const [id, fields] of entries) {
try {
const job = JSON.parse(fields[1]); // fields = ['data', '{...}', 'source', '...']
// Transform
const normalized = normalizeJob(job);
const deduplicated = await checkDuplicate(normalized);
if (!deduplicated) {
// Load to database
await upsertJob(normalized);
// Publish to downstream consumers
await redis.xadd(
'stream:processed-jobs',
'*',
'data', JSON.stringify(normalized),
);
}
// Acknowledge message
await redis.xack('stream:raw-jobs', 'job-processors', id);
} catch (error) {
console.error(`Failed to process message ${id}:`, error);
// Don't acknowledge — message will be re-delivered
}
}
}
}
}
Performance note: Redis Streams handle 100,000+ messages per second on a single node. For BirJob's 15,000 jobs/day workload, this is massive overkill — but it means the system never becomes a bottleneck, even if we 100x our scraping volume.
Architecture 3: Lambda Architecture
The Lambda architecture, proposed by Nathan Marz, runs both batch and streaming paths simultaneously. The streaming "speed layer" provides low-latency approximate results; the batch "serving layer" periodically replaces them with accurate, complete results.
┌──────────────┐
│ Speed Layer │
┌─────▶│ (Streaming) │─────┐
│ │ ~Real-time │ │
┌──────────┐ │ └──────────────┘ ▼
│ All Data │───────┤ ┌──────┐
│ Sources │ │ │Merge │──▶ Users
└──────────┘ │ ┌──────────────┐ │Layer │
│ │ Batch Layer │ └──────┘
└─────▶│ (Scheduled) │─────┘
│ Accurate │
└──────────────┘
When Lambda Is the Right Choice
- You need both real-time AND historically accurate views
- Your streaming pipeline can't guarantee exactly-once processing
- Regulatory requirements demand batch-verified results
- Your team has expertise in both batch and stream processing
The Lambda Architecture Problem
The fundamental issue with Lambda is that you maintain two codebases that process the same data differently. Every business logic change must be made in two places. Every bug might be in one layer but not the other. This operational complexity is why Jay Kreps proposed the Kappa architecture as a simplification.
According to Jay Kreps' influential essay on O'Reilly Radar, the Lambda architecture's "complexity of maintaining two separate systems is significant. The operational burden of running and debugging two different systems is more than most teams anticipate."
Architecture 4: Kappa Architecture
The Kappa architecture eliminates the batch layer entirely. All data goes through the streaming path. For historical reprocessing, you simply replay the event log from the beginning.
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ Events │───▶│ Immutable │───▶│ Stream │───▶│ Serving │
│ │ │ Event Log │ │ Processor │ │ Layer │
│ │ │ (Kafka) │ │ (v1, v2...) │ │ (DB) │
└──────────┘ └──────────────┘ └──────────────┘ └──────────┘
│ │
│ For reprocessing: │
│ 1. Deploy v2 │
│ 2. Replay from │
│ beginning │
└────────────────────┘
When Kappa Is the Right Choice
- Your data is naturally event-based (user actions, transactions, sensor data)
- You want a single codebase for both real-time and historical processing
- Your event log retains all historical data (Kafka with infinite retention)
- Processing logic changes frequently (just deploy a new version and replay)
Decision Framework: Architecture Selection
| Factor | Choose Batch | Choose Streaming | Choose Lambda | Choose Kappa |
|---|---|---|---|---|
| Latency requirement | Hours OK | Sub-second | Mixed | Sub-second |
| Team size | 1-3 engineers | 3-5 engineers | 5+ engineers | 3-5 engineers |
| Data volume | GBs to TBs/day | Any | TBs+/day | Any |
| Processing complexity | Complex OK | Must be fast | Any | Must be fast |
| Infrastructure cost | $$$ | $$$$ | $$$$$ | $$$$ |
| Correctness requirement | High | Eventual | High + fast | Eventual |
Opinionated: Start with Batch, Evolve to Streaming
1. Batch is not dead. Despite the streaming hype, 80% of data workloads are well-served by batch processing. If your users don't need data fresher than 15 minutes, batch is simpler, cheaper, and more reliable. Don't adopt streaming because it's trendy.
2. Lambda is almost always the wrong choice. The operational cost of maintaining two codebases for the same logic is enormous. I've seen teams spend 40% of their engineering time just keeping the batch and speed layers in sync. Unless you have 10+ data engineers and a genuine regulatory need for batch-verified results alongside real-time, avoid Lambda.
3. Kappa is elegant but requires infrastructure investment. You need a durable, high-throughput event log (Kafka) and the ability to replay it efficiently. For small teams, the infrastructure overhead of Kafka makes Kappa impractical.
4. The right evolution path: Batch → Micro-batch → Streaming. Start with Airflow + Postgres. When latency matters, move to 5-minute micro-batches. When sub-second latency is a business requirement (not a nice-to-have), invest in streaming.
5. Don't forget the boring stuff. Schema evolution, data quality monitoring, backfill capabilities, dead letter queues — these unglamorous features matter more than whether you're using Flink or Spark Streaming. A batch pipeline with great monitoring beats a streaming pipeline with no observability.
Action Plan: Building Your First Pipeline
Week 1: Design
- Map your data sources, transformations, and consumers
- Classify each consumer's latency requirements
- Choose an architecture based on the decision framework above
- Design your schema (and plan for schema evolution from day one)
Week 2-3: Build
- Start with a batch pipeline using Airflow or a simple cron job
- Implement Extract, Transform, Load as separate, testable functions
- Add data quality checks after each stage
- Implement idempotency — every pipeline run should be safely re-runnable
Week 4: Harden
- Add monitoring and alerting (pipeline failures, data quality metrics)
- Implement dead letter queues for failed records
- Add backfill capability (run the pipeline for a past date range)
- Document the pipeline architecture for your team
Sources
- Datanami — Data Engineering Survey 2024
- Jay Kreps — Questioning the Lambda Architecture (O'Reilly)
- Astronomer — Airflow Pricing Comparison 2024
- Apache Kafka Documentation
- Apache Flink Documentation
- Nathan Marz — Lambda Architecture
- Redis Streams Documentation
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
