Event-Driven Architecture: When, Why, and How
Two years ago, I inherited a system where a single "create order" API call triggered 14 synchronous HTTP requests to downstream services. Payment processing, inventory checks, email notifications, analytics events, loyalty points calculation — all in one request chain. The p99 latency was 12 seconds. When any one of those 14 services was slow or down, the entire order flow failed.
We rewrote it to be event-driven in three months. The order API now does exactly three things: validate the request, persist the order, and publish an "OrderCreated" event. Everything else happens asynchronously. The p99 latency dropped to 180ms. When the email service goes down, orders still process. When the analytics pipeline is slow, nobody notices.
That experience made me a believer in event-driven architecture — but also taught me that it's not a silver bullet. Event-driven systems introduce their own category of complexity: eventual consistency, message ordering, duplicate handling, and debugging distributed workflows. This article covers when event-driven architecture is the right choice, the major message broker options, and the patterns that make it work in production.
What Is Event-Driven Architecture?
Event-driven architecture (EDA) is a design pattern where systems communicate by producing and consuming events — records of things that happened. Instead of service A directly calling service B (request-response), service A publishes an event ("OrderCreated"), and service B subscribes to that event and reacts to it independently.
The key distinction is temporal decoupling. The producer doesn't wait for the consumer. It doesn't even know who the consumers are. This decoupling enables:
- Independent scaling: Consumers can be scaled independently based on their own processing capacity
- Fault isolation: A failing consumer doesn't affect the producer or other consumers
- Extensibility: New consumers can be added without modifying the producer
- Temporal flexibility: Events can be processed immediately, batched, or replayed
According to the Gartner Technology Trends report, over 60% of new digital business solutions will incorporate event-driven patterns by 2026, up from approximately 30% in 2022. The shift is driven by the need for real-time responsiveness and system resilience.
The Big Three: Kafka vs RabbitMQ vs SQS
Choosing a message broker is one of the most consequential infrastructure decisions in an event-driven system. The three dominant options serve fundamentally different use cases.
Apache Kafka
Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, publish-subscribe messaging. It was originally developed at LinkedIn and is now maintained by the Apache Software Foundation and commercially supported by Confluent.
Key characteristics:
- Events are stored in an append-only log, partitioned across brokers
- Consumers track their own position (offset) in the log — events aren't deleted after consumption
- Supports replay: consumers can re-read historical events
- Scales to millions of events per second with horizontal partitioning
- Strong ordering guarantees within a partition
When to use Kafka: High-volume event streaming (clickstreams, IoT sensor data, financial transactions), event sourcing, real-time analytics pipelines, and any scenario where event replay is required.
When NOT to use Kafka: Simple task queues, low-volume applications (under 1,000 events/second), or when you need complex routing logic (Kafka's routing is partition-based, which is limited compared to RabbitMQ).
RabbitMQ
RabbitMQ is a traditional message broker implementing the AMQP protocol. It's designed for reliable message delivery with sophisticated routing capabilities.
Key characteristics:
- Messages are stored in queues, not logs — they're deleted after acknowledgment
- Rich routing via exchanges (direct, topic, fanout, headers)
- Push-based delivery — the broker pushes messages to consumers
- Supports priority queues, dead letter queues, and delayed messages
- Lower operational complexity than Kafka for small-to-medium deployments
When to use RabbitMQ: Task queues (background job processing), complex routing requirements, request-reply patterns, and applications requiring message priority or TTL (time-to-live).
When NOT to use RabbitMQ: High-throughput streaming (RabbitMQ tops out around 50,000 messages/second per queue), event replay requirements, or log-based processing.
Amazon SQS
Amazon SQS is a fully managed message queuing service. It's the simplest option — no infrastructure to manage, no clusters to configure.
Key characteristics:
- Fully managed — zero operational overhead
- Two queue types: Standard (at-least-once, best-effort ordering) and FIFO (exactly-once, strict ordering)
- Integrates natively with AWS Lambda, SNS, and other AWS services
- Scales automatically to any throughput
- Pay per request ($0.40 per million requests)
When to use SQS: AWS-native applications, simple task queues, Lambda-driven architectures, and when you want zero operational overhead. Paired with SNS (Simple Notification Service) for pub-sub patterns.
When NOT to use SQS: Multi-cloud deployments, event replay requirements, complex routing logic, or when you need consumer groups (SQS doesn't have them natively).
Comparison Table
| Feature | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Model | Log-based streaming | Message queue (AMQP) | Managed message queue |
| Throughput | Millions/sec | ~50K/sec per queue | Virtually unlimited (managed) |
| Message Retention | Configurable (days/weeks/forever) | Until consumed | Up to 14 days |
| Replay | Yes (offset reset) | No | No |
| Ordering | Per partition | Per queue | FIFO queues only |
| Routing | Topics + partitions | Exchanges (rich routing) | Simple (queue-based) |
| Delivery Guarantee | At-least-once (exactly-once with transactions) | At-least-once (with ack) | At-least-once (standard) / Exactly-once (FIFO) |
| Ops Complexity | High (ZooKeeper/KRaft, brokers, partitions) | Medium (clustering, mirroring) | Zero (fully managed) |
| Cost (small scale) | $500+/mo (3-node cluster) or Confluent Cloud | $100+/mo (single node) | $5-50/mo (pay per use) |
| Best For | Event streaming, replay, analytics | Task queues, complex routing | AWS apps, serverless, simplicity |
CQRS: Separating Reads from Writes
CQRS (Command Query Responsibility Segregation) is a pattern that naturally complements event-driven architecture. The idea is simple: use separate models for reading and writing data. Commands (writes) go to one model, queries (reads) go to another.
In practice, this often means:
- Write side: A normalized relational database (PostgreSQL) handles commands. It ensures data integrity, enforces constraints, and emits events for every state change.
- Read side: A denormalized read store (Elasticsearch, Redis, a materialized view) is optimized for query performance. It's updated asynchronously by consuming events from the write side.
According to Microsoft's architecture documentation, CQRS is appropriate when "the number of reads vastly exceeds the number of writes" and when "read and write workloads have different scaling requirements."
A Practical Example
Consider an e-commerce product catalog. Writes are infrequent (products are added or updated a few times per day) but reads are constant (thousands of search queries per minute). With CQRS:
- A product manager updates a product's price via the admin panel (command)
- The command handler validates the change and persists it to PostgreSQL
- A "ProductPriceUpdated" event is published to Kafka
- An Elasticsearch consumer reads the event and updates the search index
- A Redis consumer reads the event and invalidates the cached product page
- An analytics consumer reads the event and logs the price change for reporting
Each consumer operates independently. The search index might be updated in 200ms, the cache in 50ms, and the analytics log in 2 seconds. The user who changed the price doesn't wait for any of them.
The Catch: Eventual Consistency
CQRS introduces eventual consistency between the write and read models. For a brief period after a write, the read model may return stale data. This is acceptable for most applications (do you need to see a price change reflected in search results within 100ms?) but unacceptable for others (financial balances, inventory counts).
Mitigation strategies:
- Read-your-own-writes: After a write, redirect the user to read from the write model (or a synchronized read replica) for a short period
- Optimistic UI updates: Update the UI immediately based on the command, then reconcile with the read model when it catches up
- Causal consistency: Include a version number or timestamp in events to detect and handle stale reads
Event Sourcing: The Nuclear Option
Event sourcing takes event-driven architecture to its logical extreme: instead of storing the current state of an entity, you store the complete sequence of events that led to the current state. The current state is derived by replaying events.
Think of it like a bank account. Instead of storing "balance: $1,000," you store every transaction: "deposited $500," "withdrew $200," "deposited $700." The balance is calculated by replaying the transaction history.
When Event Sourcing Makes Sense
- Audit requirements: Financial systems, healthcare records, legal documents — any domain where you need a complete, immutable history of every change
- Complex business logic: When the rules for state transitions are complex and need to be tested/replayed independently
- Temporal queries: "What was the state of this entity at 3pm yesterday?" — trivial with event sourcing, nearly impossible with traditional CRUD
When Event Sourcing Is Overkill
- Simple CRUD applications: If your entities have straightforward create/update/delete operations, event sourcing adds complexity without clear benefit
- High-write, low-read scenarios: Replaying thousands of events to derive current state is expensive if you do it frequently
- Teams without event-driven experience: Event sourcing is the hardest pattern to get right. Start with simple pub-sub before attempting it.
Martin Fowler's seminal article on event sourcing remains the best introduction to the pattern and its trade-offs.
Common Pitfalls and How to Avoid Them
1. The Event Schema Evolution Problem
Events are contracts between producers and consumers. When you change an event's schema (adding a field, renaming a field, changing a type), you risk breaking every consumer. Unlike API versioning, event schema changes affect consumers that may have been running for months without updates.
Solution: Use a schema registry (Confluent Schema Registry for Kafka, or a custom registry). Enforce backward compatibility. Use Avro or Protobuf (which have built-in schema evolution) instead of JSON.
2. The Distributed Monolith
If every service needs events from every other service, you haven't built a distributed system — you've built a distributed monolith with a message broker in the middle. The coupling is still there; it's just hidden behind events.
Solution: Define clear domain boundaries. Each domain publishes its own events and only consumes events from adjacent domains. Use domain events (business-level) rather than CRUD events (technical-level).
3. Message Ordering Assumptions
In distributed systems, events can arrive out of order, be duplicated, or be delayed. Designing consumers that assume strict ordering will lead to data corruption.
Solution: Design idempotent consumers. Use event IDs to detect and skip duplicates. Use timestamps or version numbers to handle out-of-order delivery. In Kafka, use partition keys to ensure ordering for related events.
4. The Debugging Nightmare
When a request-response system fails, you get an error and a stack trace. When an event-driven system fails, you get silence — an event was published but nothing happened. Tracing a request across multiple asynchronous consumers is significantly harder than following a synchronous call chain.
Solution: Implement distributed tracing (OpenTelemetry) with correlation IDs that flow through events. Build a dead letter queue dashboard. Monitor consumer lag to detect processing delays. Log every event consumption, not just failures.
Action Plan: Adopting Event-Driven Architecture
Phase 1: Identify the Right Starting Point (Week 1-2)
- Map your synchronous dependencies. Draw a diagram of every HTTP call between your services. Identify the longest call chains and the most fragile dependencies.
- Pick one integration to decouple. Choose a non-critical, high-volume integration — email notifications, analytics events, or audit logging. Don't start with your payment flow.
- Choose your broker. For most teams starting out: SQS if you're on AWS and want simplicity, RabbitMQ if you need routing flexibility, Kafka if you need replay and high throughput.
Phase 2: Implement the First Event Flow (Week 3-6)
- Define the event schema. Use a structured format (JSON Schema, Avro, or Protobuf). Include: event type, event ID, timestamp, producer ID, and payload.
- Build the producer. Modify the existing service to publish an event after its primary operation. Use the transactional outbox pattern to ensure the event is published exactly once.
- Build the consumer. Create a new consumer that subscribes to the event and performs the previously synchronous action. Implement idempotency, retry logic, and dead letter queue handling.
- Monitor everything. Track producer publish rate, consumer processing rate, consumer lag, error rate, and dead letter queue depth.
Phase 3: Expand and Refine (Week 7-12)
- Add more event flows. Based on the success of the first flow, identify additional integrations to decouple.
- Implement CQRS for read-heavy query patterns. Start with a single query that would benefit from a dedicated read model.
- Build tooling. Create an event catalog (a registry of all events, their schemas, and their producers/consumers). Build a replay tool for testing and recovery.
- Establish governance. Define naming conventions for events, schema compatibility rules, and consumer SLAs.
Sources
- Apache Kafka Documentation
- RabbitMQ Documentation
- Amazon SQS
- Confluent Platform
- Microsoft CQRS Pattern
- Martin Fowler — Event Sourcing
- Confluent Schema Registry
- OpenTelemetry
- Gartner Technology Trends
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
