A backend engineer's journey of learning and growth.
by kan01234
Scale the QR Code Payment System using the Saga Pattern, which is ideal when:
Use the Saga Pattern to orchestrate a distributed payment flow thatβs:
+------------------+
| API Gateway |
+--------+---------+
|
v
+--------------------------+
| Payment Service | β HTTP layer + idempotency check
| (IdempotencyKey) |
+-----------+--------------+
|
v emits event: PaymentRequested
+--------------------------+
| Transaction Service | β owns saga state & orchestrates steps
| (Saga + Recovery logic) |
+-----------+--------------+
|
| emits commands (via Event Queue):
|
βββ DeductWalletCommand ββββββββββββββΆ Wallet Service
βββ WriteLedgerCommand βββββββββββββββΆ Ledger Service
βββ SendNotificationCommand ββββββββββΆ Notification Service
β β
| emits events | emits events
WalletDebited / WalletFailed NotificationSent / Failed
β β
βββββββββββ receives LedgerWritten event
from Ledger Service
Component | Responsibility |
---|---|
API Gateway | Accepts requests, forwards to Payment Service |
Payment Service | Handles requestId , writes IdempotencyKey , emits PaymentRequested |
Transaction Service | Orchestrates the Saga: Tracks state ( PENDING β COMPLETED/FAILED ), emits commands, compensates if needed |
Wallet Service | Listens to DeductWalletCommand / RefundWalletCommand , updates balance, emits result events |
Ledger Service | Listens to WriteLedgerCommand , writes double ledger records, emits confirmation |
Notification Service | Sends payment result to users, listens to SendNotificationCommand |
Event Queue | Asynchronous, reliable event delivery (Kafka, SQS, Pub/Sub, etc.) |
Topic Name | Publisher | Subscribers |
---|---|---|
PaymentRequested |
PaymentService | TransactionService |
DeductWalletCommand |
TransactionService | WalletService |
WalletDebited |
WalletService | TransactionService |
WalletFailed |
WalletService | TransactionService |
WriteLedgerCommand |
TransactionService | LedgerService |
LedgerWritten |
LedgerService | TransactionService |
SendNotificationCommand |
TransactionService | NotificationService |
NotificationSent |
NotificationService | TransactionService (optional) |
Step 1: Initiate Payment (HTTP)
Step 2: Start Saga
Step 3: Deduct Wallet
Step 4: Write Ledger
Emits WriteLedgerCommand
Step 5: Finalize Transaction
Step 6: Notify Users
Step 7: Respond to Client
Table / Service | State / Record |
---|---|
idempotency_keys |
requestId β SUCCESS + cached response |
transactions |
txnId β SUCCESS , with payer/payee/amount metadata |
wallets |
updated balances |
ledger_entries |
debit & credit rows |
notifications |
message sent/logged |
Aspect | Sync API | Async API |
---|---|---|
Client experience | Waits for result (success/failure) | Gets accepted immediately, polls/gets push |
PaymentService role | Needs to wait for events or poll status | Just emits event and returns immediately |
Response status | 200 OK , 400 Bad Request , etc. |
202 Accepted with requestId/txnId |
Coupling | Slightly more coupled (waits on result) | Fully decoupled |
Latency | Higher (depends on saga duration) | Low latency |
Use cases | POS, e-commerce checkout (UX critical) | Top-ups, batch jobs, QR payments |
Retries/Timeouts | Must handle waiting/timeout inside service | External retry logic possible |
βThe user scans QR and expects to see βPayment Successβ right away on screen.β
βWe show a spinner and notify user when payment finishes (via push or polling).β
Retry Layer | Purpose | Example |
---|---|---|
Producer Retry | If publish to Kafka/SQS fails | Retry publishing DeductWalletCommand |
Consumer Retry | If handler fails to process the message | Retry consuming WalletDebited |
Business Retry | If downstream dependency fails (e.g., DB down) | Retry debit wallet DB update |
Orchestrator Retry | Retry full saga step after timeout/failure | Re-publish command from TransactionService |
A DLQ stores messages that fail permanently after all retries are exhausted
Component | When to DLQ |
---|---|
WalletService | DeductWalletCommand fails 5 times |
LedgerService | WriteLedgerCommand consistently fails (e.g., invalid txn) |
NotificationService | Email/push provider unreachable or malformed msg |
TransactionService | Cannot progress saga due to missing events |
+----------------------+
| Event Queue (Kafka) |
+----------+-----------+
|
+-----v----------------------+
| WalletService |
| - Retry handler |
| - OnFail β DLQ Producer |
+-----+----------------------+
|
v
+--------------------+
| Wallet.DLQ Topic | <-- Store JSONs of failed messages
+--------------------+
(repeat for ledger, notify, etc.)
Because all retries must be safe to repeat, each service must:
What to Monitor | Tool/Action |
---|---|
DLQ size growing | Alert (PagerDuty, Slack) |
Message processing latency | Metrics dashboard (e.g., Prometheus) |
Retry count per message | Metrics + DLQ tagging |
Saga stuck in PENDING too long |
Auto-recovery or ops investigation |
Mechanism | Why It Matters in Fintech |
---|---|
Retries | Handle transient failures gracefully |
DLQ | Prevent infinite retry loops, ensure recoverability |
Idempotency | Ensures retry wonβt corrupt state |
Monitoring | Enables ops teams to intervene fast |
In a saga-based system, if one step in the payment process fails irrecoverably, the system cannot roll back using ACID transactions. Instead, it uses compensation actions to reverse the effects of the previous steps.
Failed Step | Compensation Action |
---|---|
Ledger writing fails | Emit RefundWalletCommand to return funds |
Notification fails | Retry only; no compensation needed |
Wallet deduction fails | Mark the transaction as FAILED |
All compensation commands are:
+------------------+
| PENDING |
+--------+---------+
|
WalletDebited
|
v
+------------------+
| WALLET_OK |
+--------+---------+
|
LedgerWritten
|
v
+------------------+
| SUCCESS |
+------------------+
|
WalletFailed / LedgerFailed
v
+------------------+
| FAILED |
+------------------+
Even with retries and DLQs, services or brokers may crash. To recover from these cases, we ensure:
The TransactionService maintains a durable record of:
This allows resumption after failure or restart.
On service restart:
tags: system-designRecovery logic is part of the TransactionService and is often run as a periodic background job or as part of service startup.