Event-Driven Architecture: When, Why, and How

Network diagram showing interconnected systems and data flows

Two years ago, I inherited a system where a single "create order" API call triggered 14 synchronous HTTP requests to downstream services. Payment processing, inventory checks, email notifications, analytics events, loyalty points calculation — all in one request chain. The p99 latency was 12 seconds. When any one of those 14 services was slow or down, the entire order flow failed.

We rewrote it to be event-driven in three months. The order API now does exactly three things: validate the request, persist the order, and publish an "OrderCreated" event. Everything else happens asynchronously. The p99 latency dropped to 180ms. When the email service goes down, orders still process. When the analytics pipeline is slow, nobody notices.

That experience made me a believer in event-driven architecture — but also taught me that it's not a silver bullet. Event-driven systems introduce their own category of complexity: eventual consistency, message ordering, duplicate handling, and debugging distributed workflows. This article covers when event-driven architecture is the right choice, the major message broker options, and the patterns that make it work in production.

What Is Event-Driven Architecture?

Event-driven architecture (EDA) is a design pattern where systems communicate by producing and consuming events — records of things that happened. Instead of service A directly calling service B (request-response), service A publishes an event ("OrderCreated"), and service B subscribes to that event and reacts to it independently.

The key distinction is temporal decoupling. The producer doesn't wait for the consumer. It doesn't even know who the consumers are. This decoupling enables:

Independent scaling: Consumers can be scaled independently based on their own processing capacity
Fault isolation: A failing consumer doesn't affect the producer or other consumers
Extensibility: New consumers can be added without modifying the producer
Temporal flexibility: Events can be processed immediately, batched, or replayed

According to the Gartner Technology Trends report, over 60% of new digital business solutions will incorporate event-driven patterns by 2026, up from approximately 30% in 2022. The shift is driven by the need for real-time responsiveness and system resilience.

The Big Three: Kafka vs RabbitMQ vs SQS

Choosing a message broker is one of the most consequential infrastructure decisions in an event-driven system. The three dominant options serve fundamentally different use cases.

Apache Kafka

Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, publish-subscribe messaging. It was originally developed at LinkedIn and is now maintained by the Apache Software Foundation and commercially supported by Confluent.

Key characteristics:

Events are stored in an append-only log, partitioned across brokers
Consumers track their own position (offset) in the log — events aren't deleted after consumption
Supports replay: consumers can re-read historical events
Scales to millions of events per second with horizontal partitioning
Strong ordering guarantees within a partition

When to use Kafka: High-volume event streaming (clickstreams, IoT sensor data, financial transactions), event sourcing, real-time analytics pipelines, and any scenario where event replay is required.

When NOT to use Kafka: Simple task queues, low-volume applications (under 1,000 events/second), or when you need complex routing logic (Kafka's routing is partition-based, which is limited compared to RabbitMQ).

RabbitMQ

RabbitMQ is a traditional message broker implementing the AMQP protocol. It's designed for reliable message delivery with sophisticated routing capabilities.

Key characteristics:

Messages are stored in queues, not logs — they're deleted after acknowledgment
Rich routing via exchanges (direct, topic, fanout, headers)
Push-based delivery — the broker pushes messages to consumers
Supports priority queues, dead letter queues, and delayed messages
Lower operational complexity than Kafka for small-to-medium deployments

When to use RabbitMQ: Task queues (background job processing), complex routing requirements, request-reply patterns, and applications requiring message priority or TTL (time-to-live).

When NOT to use RabbitMQ: High-throughput streaming (RabbitMQ tops out around 50,000 messages/second per queue), event replay requirements, or log-based processing.

Amazon SQS

Amazon SQS is a fully managed message queuing service. It's the simplest option — no infrastructure to manage, no clusters to configure.

Key characteristics:

Fully managed — zero operational overhead
Two queue types: Standard (at-least-once, best-effort ordering) and FIFO (exactly-once, strict ordering)
Integrates natively with AWS Lambda, SNS, and other AWS services
Scales automatically to any throughput
Pay per request ($0.40 per million requests)

When to use SQS: AWS-native applications, simple task queues, Lambda-driven architectures, and when you want zero operational overhead. Paired with SNS (Simple Notification Service) for pub-sub patterns.

When NOT to use SQS: Multi-cloud deployments, event replay requirements, complex routing logic, or when you need consumer groups (SQS doesn't have them natively).

Cloud computing infrastructure with connected nodes

Comparison Table

Feature	Kafka	RabbitMQ	SQS
Model	Log-based streaming	Message queue (AMQP)	Managed message queue
Throughput	Millions/sec	~50K/sec per queue	Virtually unlimited (managed)
Message Retention	Configurable (days/weeks/forever)	Until consumed	Up to 14 days
Replay	Yes (offset reset)	No	No
Ordering	Per partition	Per queue	FIFO queues only
Routing	Topics + partitions	Exchanges (rich routing)	Simple (queue-based)
Delivery Guarantee	At-least-once (exactly-once with transactions)	At-least-once (with ack)	At-least-once (standard) / Exactly-once (FIFO)
Ops Complexity	High (ZooKeeper/KRaft, brokers, partitions)	Medium (clustering, mirroring)	Zero (fully managed)
Cost (small scale)	$500+/mo (3-node cluster) or Confluent Cloud	$100+/mo (single node)	$5-50/mo (pay per use)
Best For	Event streaming, replay, analytics	Task queues, complex routing	AWS apps, serverless, simplicity

CQRS: Separating Reads from Writes

CQRS (Command Query Responsibility Segregation) is a pattern that naturally complements event-driven architecture. The idea is simple: use separate models for reading and writing data. Commands (writes) go to one model, queries (reads) go to another.

In practice, this often means:

Write side: A normalized relational database (PostgreSQL) handles commands. It ensures data integrity, enforces constraints, and emits events for every state change.
Read side: A denormalized read store (Elasticsearch, Redis, a materialized view) is optimized for query performance. It's updated asynchronously by consuming events from the write side.

According to Microsoft's architecture documentation, CQRS is appropriate when "the number of reads vastly exceeds the number of writes" and when "read and write workloads have different scaling requirements."

A Practical Example

Consider an e-commerce product catalog. Writes are infrequent (products are added or updated a few times per day) but reads are constant (thousands of search queries per minute). With CQRS:

A product manager updates a product's price via the admin panel (command)
The command handler validates the change and persists it to PostgreSQL
A "ProductPriceUpdated" event is published to Kafka
An Elasticsearch consumer reads the event and updates the search index
A Redis consumer reads the event and invalidates the cached product page
An analytics consumer reads the event and logs the price change for reporting

Each consumer operates independently. The search index might be updated in 200ms, the cache in 50ms, and the analytics log in 2 seconds. The user who changed the price doesn't wait for any of them.

The Catch: Eventual Consistency

CQRS introduces eventual consistency between the write and read models. For a brief period after a write, the read model may return stale data. This is acceptable for most applications (do you need to see a price change reflected in search results within 100ms?) but unacceptable for others (financial balances, inventory counts).

Mitigation strategies:

Read-your-own-writes: After a write, redirect the user to read from the write model (or a synchronized read replica) for a short period
Optimistic UI updates: Update the UI immediately based on the command, then reconcile with the read model when it catches up
Causal consistency: Include a version number or timestamp in events to detect and handle stale reads

Event Sourcing: The Nuclear Option

Event sourcing takes event-driven architecture to its logical extreme: instead of storing the current state of an entity, you store the complete sequence of events that led to the current state. The current state is derived by replaying events.

Think of it like a bank account. Instead of storing "balance: $1,000," you store every transaction: "deposited $500," "withdrew $200," "deposited $700." The balance is calculated by replaying the transaction history.

When Event Sourcing Makes Sense

Audit requirements: Financial systems, healthcare records, legal documents — any domain where you need a complete, immutable history of every change
Complex business logic: When the rules for state transitions are complex and need to be tested/replayed independently
Temporal queries: "What was the state of this entity at 3pm yesterday?" — trivial with event sourcing, nearly impossible with traditional CRUD

When Event Sourcing Is Overkill

Simple CRUD applications: If your entities have straightforward create/update/delete operations, event sourcing adds complexity without clear benefit
High-write, low-read scenarios: Replaying thousands of events to derive current state is expensive if you do it frequently
Teams without event-driven experience: Event sourcing is the hardest pattern to get right. Start with simple pub-sub before attempting it.

Martin Fowler's seminal article on event sourcing remains the best introduction to the pattern and its trade-offs.

Matrix-style data flowing across a screen

Common Pitfalls and How to Avoid Them

1. The Event Schema Evolution Problem

Events are contracts between producers and consumers. When you change an event's schema (adding a field, renaming a field, changing a type), you risk breaking every consumer. Unlike API versioning, event schema changes affect consumers that may have been running for months without updates.

Solution: Use a schema registry (Confluent Schema Registry for Kafka, or a custom registry). Enforce backward compatibility. Use Avro or Protobuf (which have built-in schema evolution) instead of JSON.

2. The Distributed Monolith

If every service needs events from every other service, you haven't built a distributed system — you've built a distributed monolith with a message broker in the middle. The coupling is still there; it's just hidden behind events.

Solution: Define clear domain boundaries. Each domain publishes its own events and only consumes events from adjacent domains. Use domain events (business-level) rather than CRUD events (technical-level).

3. Message Ordering Assumptions

In distributed systems, events can arrive out of order, be duplicated, or be delayed. Designing consumers that assume strict ordering will lead to data corruption.

Solution: Design idempotent consumers. Use event IDs to detect and skip duplicates. Use timestamps or version numbers to handle out-of-order delivery. In Kafka, use partition keys to ensure ordering for related events.

4. The Debugging Nightmare

When a request-response system fails, you get an error and a stack trace. When an event-driven system fails, you get silence — an event was published but nothing happened. Tracing a request across multiple asynchronous consumers is significantly harder than following a synchronous call chain.

Solution: Implement distributed tracing (OpenTelemetry) with correlation IDs that flow through events. Build a dead letter queue dashboard. Monitor consumer lag to detect processing delays. Log every event consumption, not just failures.

Action Plan: Adopting Event-Driven Architecture

Phase 1: Identify the Right Starting Point (Week 1-2)

Map your synchronous dependencies. Draw a diagram of every HTTP call between your services. Identify the longest call chains and the most fragile dependencies.
Pick one integration to decouple. Choose a non-critical, high-volume integration — email notifications, analytics events, or audit logging. Don't start with your payment flow.
Choose your broker. For most teams starting out: SQS if you're on AWS and want simplicity, RabbitMQ if you need routing flexibility, Kafka if you need replay and high throughput.

Phase 2: Implement the First Event Flow (Week 3-6)

Define the event schema. Use a structured format (JSON Schema, Avro, or Protobuf). Include: event type, event ID, timestamp, producer ID, and payload.
Build the producer. Modify the existing service to publish an event after its primary operation. Use the transactional outbox pattern to ensure the event is published exactly once.
Build the consumer. Create a new consumer that subscribes to the event and performs the previously synchronous action. Implement idempotency, retry logic, and dead letter queue handling.
Monitor everything. Track producer publish rate, consumer processing rate, consumer lag, error rate, and dead letter queue depth.

Phase 3: Expand and Refine (Week 7-12)

Add more event flows. Based on the success of the first flow, identify additional integrations to decouple.
Implement CQRS for read-heavy query patterns. Start with a single query that would benefit from a dedicated read model.
Build tooling. Create an event catalog (a registry of all events, their schemas, and their producers/consumers). Build a replay tool for testing and recovery.
Establish governance. Define naming conventions for events, schema compatibility rules, and consumer SLAs.

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Event-Driven Architecture: When, Why, and How

Event-Driven Architecture: When, Why, and How

What Is Event-Driven Architecture?

The Big Three: Kafka vs RabbitMQ vs SQS

Apache Kafka

RabbitMQ

Amazon SQS

Comparison Table

CQRS: Separating Reads from Writes

A Practical Example

The Catch: Eventual Consistency

Event Sourcing: The Nuclear Option

When Event Sourcing Makes Sense

When Event Sourcing Is Overkill

Common Pitfalls and How to Avoid Them

1. The Event Schema Evolution Problem

2. The Distributed Monolith

3. Message Ordering Assumptions

4. The Debugging Nightmare

Action Plan: Adopting Event-Driven Architecture

Phase 1: Identify the Right Starting Point (Week 1-2)

Phase 2: Implement the First Event Flow (Week 3-6)

Phase 3: Expand and Refine (Week 7-12)

Sources

İş axtarışınıza başlayın