Data Pipeline Architecture: Batch vs Stream vs Lambda vs Kappa

Data pipeline visualization showing multiple processing stages

When I built BirJob's scraping pipeline, I faced a classic data engineering decision. We scrape 80+ job sites daily, producing roughly 15,000 job listings per run. Each listing needs deduplication, normalization, categorization, and enrichment before it's visible to users. The naive approach — scrape everything, process synchronously, write to the database — took 4 hours and crashed regularly when a single source timed out. After redesigning around a proper pipeline architecture, the same workload processes in 35 minutes with zero crashes. The architecture choice was the difference between a fragile script and a reliable system.

This guide covers the four major data pipeline architectures — Batch, Streaming, Lambda, and Kappa — with honest analysis of when each makes sense, what they cost to build and operate, and how to choose between them.

What Is a Data Pipeline?

A data pipeline is a series of data processing steps where the output of one step feeds into the next. At its simplest, it's Extract → Transform → Load (ETL). At production scale, it involves orchestration, error handling, monitoring, backfill capabilities, schema evolution, and exactly-once processing guarantees.

The 2024 Data Engineering Survey by Datanami found that 78% of data teams manage more than 50 pipelines, and the average pipeline failure rate is 12% per month. The architecture you choose determines how you handle those failures.

Architecture	Processing Model	Latency	Complexity	Primary Use Case
Batch	Process data in scheduled intervals	Minutes to hours	Low	ETL, analytics, reporting
Streaming	Process each event as it arrives	Milliseconds to seconds	High	Real-time analytics, fraud detection
Lambda	Batch + Streaming in parallel	Both real-time and batch	Very high	When you need both speed and accuracy
Kappa	Streaming only (replaying from log)	Milliseconds to seconds	Medium-high	Event-sourced systems, unified processing

Architecture 1: Batch Processing

Batch processing is the oldest and most understood data pipeline pattern. Data accumulates over a period, then gets processed all at once.

┌──────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────┐
│  Sources  │───▶│   Extract    │───▶│   Transform   │───▶│   Load    │
│ (APIs,    │    │  (Scheduled) │    │  (Clean,      │    │ (Database,│
│  Files,   │    │              │    │   Enrich,     │    │  DWH)     │
│  DBs)     │    │              │    │   Aggregate)  │    │           │
└──────────┘    └─────────────┘    └──────────────┘    └──────────┘
                     │                                       │
                     └──── Orchestrator (Airflow, Cron) ─────┘

When Batch Is the Right Choice

Your data sources produce data in batches (daily reports, CSV exports, API dumps)
Consumers don't need real-time data (next-day reporting, weekly analytics)
Processing is computationally expensive (ML model training, large aggregations)
You need full-table transformations (deduplicate an entire dataset, recompute all scores)

Batch Pipeline with Apache Airflow

# airflow_dags/job_scraping_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['alerts@birjob.com'],
}

with DAG(
    'job_scraping_pipeline',
    default_args=default_args,
    description='Daily job scraping and processing pipeline',
    schedule_interval='0 6 * * *',  # 6 AM daily
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=1,
    tags=['scraping', 'production'],
) as dag:

    extract = PythonOperator(
        task_id='extract_jobs',
        python_callable=run_all_scrapers,
        op_kwargs={'concurrency': 2, 'timeout': 300},
    )

    deduplicate = PythonOperator(
        task_id='deduplicate',
        python_callable=deduplicate_by_apply_link,
    )

    normalize = PythonOperator(
        task_id='normalize',
        python_callable=normalize_job_data,
        # Normalize company names, job titles, locations
    )

    categorize = PythonOperator(
        task_id='categorize',
        python_callable=auto_categorize_jobs,
        # ML-based job category assignment
    )

    load = PythonOperator(
        task_id='load_to_database',
        python_callable=upsert_to_postgres,
    )

    notify = PythonOperator(
        task_id='notify_users',
        python_callable=send_job_alerts,
    )

    quality_check = PythonOperator(
        task_id='quality_check',
        python_callable=run_data_quality_checks,
        # Check: min 500 jobs, max 20% duplicate rate, all sources reporting
    )

    extract >> deduplicate >> normalize >> categorize >> load >> quality_check >> notify

Cost profile: According to Astronomer's 2024 analysis, managed Airflow runs $150-500/month for small-medium workloads. Self-hosted Airflow on a t3.medium EC2 costs ~$30/month but requires maintenance expertise.

Architecture 2: Stream Processing

Stream processing handles each event individually as it arrives. Instead of "process yesterday's data this morning," it's "process each event within milliseconds of it occurring."

┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐
│  Events   │───▶│  Message      │───▶│  Stream       │───▶│  Sink     │
│ (API      │    │  Broker       │    │  Processor    │    │ (DB,      │
│  webhooks,│    │  (Kafka,      │    │  (Flink,      │    │  Search,  │
│  CDC)     │    │   SQS)        │    │   Consumer)   │    │  Cache)   │
└──────────┘    └──────────────┘    └──────────────┘    └──────────┘
                                          │
                                   Windowing, Joins,
                                   Aggregations in
                                   real-time

When Streaming Is the Right Choice

Your business requires sub-second data freshness (fraud detection, real-time bidding)
Events are naturally event-driven (user clicks, IoT sensor data, log events)
You need continuous processing (monitoring, alerting, live dashboards)
Data volume makes batch impractical (millions of events per second)

Stream Processing with Node.js + Redis Streams

// A lightweight stream processor using Redis Streams
import Redis from 'ioredis';

const redis = new Redis();

// Producer: push job events to stream
async function publishJobEvent(job: JobListing): Promise<void> {
  await redis.xadd(
    'stream:raw-jobs',
    '*', // Auto-generate ID
    'data', JSON.stringify(job),
    'source', job.source,
    'timestamp', Date.now().toString(),
  );
}

// Consumer: process stream events
async function processJobStream(): Promise<void> {
  // Create consumer group (idempotent)
  try {
    await redis.xgroup('CREATE', 'stream:raw-jobs', 'job-processors', '0', 'MKSTREAM');
  } catch (e) {
    // Group already exists — fine
  }

  const consumerId = `processor-${process.pid}`;

  while (true) {
    // Read new messages (blocking for up to 5 seconds)
    const messages = await redis.xreadgroup(
      'GROUP', 'job-processors', consumerId,
      'COUNT', 10,
      'BLOCK', 5000,
      'STREAMS', 'stream:raw-jobs', '>',
    );

    if (!messages || messages.length === 0) continue;

    for (const [stream, entries] of messages) {
      for (const [id, fields] of entries) {
        try {
          const job = JSON.parse(fields[1]); // fields = ['data', '{...}', 'source', '...']

          // Transform
          const normalized = normalizeJob(job);
          const deduplicated = await checkDuplicate(normalized);

          if (!deduplicated) {
            // Load to database
            await upsertJob(normalized);

            // Publish to downstream consumers
            await redis.xadd(
              'stream:processed-jobs',
              '*',
              'data', JSON.stringify(normalized),
            );
          }

          // Acknowledge message
          await redis.xack('stream:raw-jobs', 'job-processors', id);
        } catch (error) {
          console.error(`Failed to process message ${id}:`, error);
          // Don't acknowledge — message will be re-delivered
        }
      }
    }
  }
}

Performance note: Redis Streams handle 100,000+ messages per second on a single node. For BirJob's 15,000 jobs/day workload, this is massive overkill — but it means the system never becomes a bottleneck, even if we 100x our scraping volume.

Architecture 3: Lambda Architecture

The Lambda architecture, proposed by Nathan Marz, runs both batch and streaming paths simultaneously. The streaming "speed layer" provides low-latency approximate results; the batch "serving layer" periodically replaces them with accurate, complete results.

                           ┌──────────────┐
                           │  Speed Layer  │
                    ┌─────▶│  (Streaming)  │─────┐
                    │      │  ~Real-time   │     │
┌──────────┐       │      └──────────────┘     ▼
│  All Data │───────┤                       ┌──────┐
│  Sources  │       │                       │Merge │──▶ Users
└──────────┘       │      ┌──────────────┐  │Layer │
                    │      │  Batch Layer  │  └──────┘
                    └─────▶│  (Scheduled)  │─────┘
                           │  Accurate     │
                           └──────────────┘

When Lambda Is the Right Choice

You need both real-time AND historically accurate views
Your streaming pipeline can't guarantee exactly-once processing
Regulatory requirements demand batch-verified results
Your team has expertise in both batch and stream processing

The Lambda Architecture Problem

The fundamental issue with Lambda is that you maintain two codebases that process the same data differently. Every business logic change must be made in two places. Every bug might be in one layer but not the other. This operational complexity is why Jay Kreps proposed the Kappa architecture as a simplification.

According to Jay Kreps' influential essay on O'Reilly Radar, the Lambda architecture's "complexity of maintaining two separate systems is significant. The operational burden of running and debugging two different systems is more than most teams anticipate."

Architecture 4: Kappa Architecture

The Kappa architecture eliminates the batch layer entirely. All data goes through the streaming path. For historical reprocessing, you simply replay the event log from the beginning.

┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐
│  Events   │───▶│  Immutable    │───▶│  Stream       │───▶│  Serving  │
│           │    │  Event Log    │    │  Processor    │    │  Layer    │
│           │    │  (Kafka)      │    │  (v1, v2...)  │    │  (DB)     │
└──────────┘    └──────────────┘    └──────────────┘    └──────────┘
                      │                    │
                      │   For reprocessing: │
                      │   1. Deploy v2      │
                      │   2. Replay from    │
                      │      beginning      │
                      └────────────────────┘

When Kappa Is the Right Choice

Your data is naturally event-based (user actions, transactions, sensor data)
You want a single codebase for both real-time and historical processing
Your event log retains all historical data (Kafka with infinite retention)
Processing logic changes frequently (just deploy a new version and replay)

Decision Framework: Architecture Selection

Factor	Choose Batch	Choose Streaming	Choose Lambda	Choose Kappa
Latency requirement	Hours OK	Sub-second	Mixed	Sub-second
Team size	1-3 engineers	3-5 engineers	5+ engineers	3-5 engineers
Data volume	GBs to TBs/day	Any	TBs+/day	Any
Processing complexity	Complex OK	Must be fast	Any	Must be fast
Infrastructure cost	$$$	$$$$	$$$$$	$$$$
Correctness requirement	High	Eventual	High + fast	Eventual

Opinionated: Start with Batch, Evolve to Streaming

1. Batch is not dead. Despite the streaming hype, 80% of data workloads are well-served by batch processing. If your users don't need data fresher than 15 minutes, batch is simpler, cheaper, and more reliable. Don't adopt streaming because it's trendy.

2. Lambda is almost always the wrong choice. The operational cost of maintaining two codebases for the same logic is enormous. I've seen teams spend 40% of their engineering time just keeping the batch and speed layers in sync. Unless you have 10+ data engineers and a genuine regulatory need for batch-verified results alongside real-time, avoid Lambda.

3. Kappa is elegant but requires infrastructure investment. You need a durable, high-throughput event log (Kafka) and the ability to replay it efficiently. For small teams, the infrastructure overhead of Kafka makes Kappa impractical.

4. The right evolution path: Batch → Micro-batch → Streaming. Start with Airflow + Postgres. When latency matters, move to 5-minute micro-batches. When sub-second latency is a business requirement (not a nice-to-have), invest in streaming.

5. Don't forget the boring stuff. Schema evolution, data quality monitoring, backfill capabilities, dead letter queues — these unglamorous features matter more than whether you're using Flink or Spark Streaming. A batch pipeline with great monitoring beats a streaming pipeline with no observability.

Action Plan: Building Your First Pipeline

Week 1: Design

Map your data sources, transformations, and consumers
Classify each consumer's latency requirements
Choose an architecture based on the decision framework above
Design your schema (and plan for schema evolution from day one)

Week 2-3: Build

Start with a batch pipeline using Airflow or a simple cron job
Implement Extract, Transform, Load as separate, testable functions
Add data quality checks after each stage
Implement idempotency — every pipeline run should be safely re-runnable

Week 4: Harden

Add monitoring and alerting (pipeline failures, data quality metrics)
Implement dead letter queues for failed records
Add backfill capability (run the pipeline for a past date range)
Document the pipeline architecture for your team

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Data Pipeline Architecture: Batch vs Stream vs Lambda vs Kappa

Data Pipeline Architecture: Batch vs Stream vs Lambda vs Kappa

What Is a Data Pipeline?

Architecture 1: Batch Processing

When Batch Is the Right Choice

Batch Pipeline with Apache Airflow

Architecture 2: Stream Processing

When Streaming Is the Right Choice

Stream Processing with Node.js + Redis Streams

Architecture 3: Lambda Architecture

When Lambda Is the Right Choice

The Lambda Architecture Problem

Architecture 4: Kappa Architecture

When Kappa Is the Right Choice

Decision Framework: Architecture Selection

Opinionated: Start with Batch, Evolve to Streaming

Action Plan: Building Your First Pipeline

Sources

İş axtarışınıza başlayın

Oxşar məqalələr