kan01234 - Software Engineer Notes

Logo

A backend engineer's journey of learning and growth.

View the Project on GitHub kan01234/post

22 June 2025

System Design: Scale the QR Code Payment System using the Saga Pattern

by kan01234

Scale the QR Code Payment System using the Saga Pattern, which is ideal when:

🧱 Objective:

Use the Saga Pattern to orchestrate a distributed payment flow that’s:

🎯 When to Use Sagas

🧩 High-Level Components

+------------------+
|   API Gateway    |
+--------+---------+
         |
         v
+--------------------------+
|   Payment Service        | ← HTTP layer + idempotency check
|   (IdempotencyKey)       |
+-----------+--------------+
            |
            v   emits event: PaymentRequested
+--------------------------+
| Transaction Service      | ← owns saga state & orchestrates steps
| (Saga + Recovery logic)  |
+-----------+--------------+
            |
            | emits commands (via Event Queue):
            |
            β”œβ”€β”€ DeductWalletCommand ─────────────▢ Wallet Service
            β”œβ”€β”€ WriteLedgerCommand ──────────────▢ Ledger Service
            └── SendNotificationCommand ─────────▢ Notification Service
                      ↑                                ↑
                      | emits events                   | emits events
              WalletDebited / WalletFailed    NotificationSent / Failed
                      ↑                                ↑
                      └────────── receives  LedgerWritten event
                                   from Ledger Service

πŸ“¦ Components and Responsibilities

Component Responsibility
API Gateway Accepts requests, forwards to Payment Service
Payment Service Handles requestId, writes IdempotencyKey, emits PaymentRequested
Transaction Service Orchestrates the Saga:
Tracks state (PENDING β†’ COMPLETED/FAILED), emits commands, compensates if needed
Wallet Service Listens to DeductWalletCommand / RefundWalletCommand, updates balance, emits result events
Ledger Service Listens to WriteLedgerCommand, writes double ledger records, emits confirmation
Notification Service Sends payment result to users, listens to SendNotificationCommand
Event Queue Asynchronous, reliable event delivery (Kafka, SQS, Pub/Sub, etc.)

🧠 Event Topics (example with Kafka-style)

Topic Name Publisher Subscribers
PaymentRequested PaymentService TransactionService
DeductWalletCommand TransactionService WalletService
WalletDebited WalletService TransactionService
WalletFailed WalletService TransactionService
WriteLedgerCommand TransactionService LedgerService
LedgerWritten LedgerService TransactionService
SendNotificationCommand TransactionService NotificationService
NotificationSent NotificationService TransactionService (optional)

βœ… Benefits of This Design

βœ… Happy Path Flow

Step 1: Initiate Payment (HTTP)

Step 2: Start Saga

Step 3: Deduct Wallet

Step 4: Write Ledger

Step 5: Finalize Transaction

Step 6: Notify Users

Step 7: Respond to Client

πŸ” Final States After Success

Table / Service State / Record
idempotency_keys requestId β†’ SUCCESS + cached response
transactions txnId β†’ SUCCESS, with payer/payee/amount metadata
wallets updated balances
ledger_entries debit & credit rows
notifications message sent/logged

πŸ”„ Compare: Sync vs. Async Design

Aspect Sync API Async API
Client experience Waits for result (success/failure) Gets accepted immediately, polls/gets push
PaymentService role Needs to wait for events or poll status Just emits event and returns immediately
Response status 200 OK, 400 Bad Request, etc. 202 Accepted with requestId/txnId
Coupling Slightly more coupled (waits on result) Fully decoupled
Latency Higher (depends on saga duration) Low latency
Use cases POS, e-commerce checkout (UX critical) Top-ups, batch jobs, QR payments
Retries/Timeouts Must handle waiting/timeout inside service External retry logic possible

Sync Examples

β€œThe user scans QR and expects to see β€˜Payment Success’ right away on screen.”

Async Examples

β€œWe show a spinner and notify user when payment finishes (via push or polling).”

Final Guiding Principle

πŸ” Retry Strategy Overview

Retry Layer Purpose Example
Producer Retry If publish to Kafka/SQS fails Retry publishing DeductWalletCommand
Consumer Retry If handler fails to process the message Retry consuming WalletDebited
Business Retry If downstream dependency fails (e.g., DB down) Retry debit wallet DB update
Orchestrator Retry Retry full saga step after timeout/failure Re-publish command from TransactionService

🧠 Common Retry Tactics

  1. Immediate Retry
    • Useful for transient errors (e.g., race conditions, network blips)
    • Example: Retry DB write 3 times with 100ms delay
  2. Exponential Backoff
  1. Circuit Breaker

πŸ’€ Dead Letter Queues (DLQ)

A DLQ stores messages that fail permanently after all retries are exhausted

Component When to DLQ
WalletService DeductWalletCommand fails 5 times
LedgerService WriteLedgerCommand consistently fails (e.g., invalid txn)
NotificationService Email/push provider unreachable or malformed msg
TransactionService Cannot progress saga due to missing events

What Happens in DLQ?

πŸ“¦ Sample DLQ Architecture

+----------------------+
| Event Queue (Kafka)  |
+----------+-----------+
           |
     +-----v----------------------+
     | WalletService              |
     |  - Retry handler           |
     |  - OnFail β†’ DLQ Producer   |
     +-----+----------------------+
           |
           v
     +--------------------+
     | Wallet.DLQ Topic   | <-- Store JSONs of failed messages
     +--------------------+

     (repeat for ledger, notify, etc.)

πŸ” Idempotency + Retry

Because all retries must be safe to repeat, each service must:

πŸ”­ Monitoring & Alerting

What to Monitor Tool/Action
DLQ size growing Alert (PagerDuty, Slack)
Message processing latency Metrics dashboard (e.g., Prometheus)
Retry count per message Metrics + DLQ tagging
Saga stuck in PENDING too long Auto-recovery or ops investigation

Summary

Mechanism Why It Matters in Fintech
Retries Handle transient failures gracefully
DLQ Prevent infinite retry loops, ensure recoverability
Idempotency Ensures retry won’t corrupt state
Monitoring Enables ops teams to intervene fast

πŸ”„ Compensation Logic

In a saga-based system, if one step in the payment process fails irrecoverably, the system cannot roll back using ACID transactions. Instead, it uses compensation actions to reverse the effects of the previous steps.

Examples of Compensation:

Failed Step Compensation Action
Ledger writing fails Emit RefundWalletCommand to return funds
Notification fails Retry only; no compensation needed
Wallet deduction fails Mark the transaction as FAILED

All compensation commands are:

πŸ”„ Transaction State Machine

   +------------------+
   |     PENDING      |
   +--------+---------+
            |
     WalletDebited
            |
            v
   +------------------+
   |   WALLET_OK      |
   +--------+---------+
            |
     LedgerWritten
            |
            v
   +------------------+
   |   SUCCESS        |
   +------------------+

            |
     WalletFailed / LedgerFailed
            v
   +------------------+
   |     FAILED       |
   +------------------+

🧯 Recovery from Partial Failures

Even with retries and DLQs, services or brokers may crash. To recover from these cases, we ensure:

πŸ›  Durable State Tracking

This allows resumption after failure or restart.

♻️ Automatic Saga Resumption

On service restart:

βœ… Idempotency Ensures Safe Retries

Recovery logic is part of the TransactionService and is often run as a periodic background job or as part of service startup.

tags: system-design