A backend engineer's journey of learning and growth.
by kan01234
A Linux-powered edge device, operating in a remote setting, runs a program that generates periodic debug or error log messages. These logs are currently channeled through the system’s default syslog mechanism.
The main features is allows an operator to remotely access log messages from the edge device though secure cloud server
To conserve network bandwidth and cloud storage, the device should only upload logs when explicitly requested by the operator, and only the specific logs that the operator needs.
Let’s make some estimations of the log message:
Timestamp: 23 characters (yyyy-MM-dd HH:mm:ss.SSS)
Log Level: 5 characters
PID: Assuming a typical PID length of 5 digits
Thread Name: 15 characters
Logger Name: Let's assume an average of 20 characters
Log Message: This is variable, but let's stick with our previous assumption of 100 bytes on average
Exception Stack Trace: This is highly variable, but let's assume an average of 500 bytes when present (this can be adjusted based on your application's exception handling)
Other characters and formatting: Let's add a buffer of 20 bytes for colons, spaces, and other formatting elements
Calculating the Average Log Message size:
Log per day: 24 * 60 * 60 * 1000 (up to ms)
Log message length with exception: 688 bytes
Log message length without exception: 188 bytes
Probability of exception: 10%
log message size = 86,400,000 * 10% * 688 bytes + 86,400,000 * 188 bytes = 20GB
Ordered Log Delivery
A key assumption for the system design is that the edge device is capable of delivering log messages to the MQTT broker in the order they were generated. This implies that the device’s logging mechanism or any intermediary components ensure that logs are published to the broker in their correct chronological sequence, even if there are network delays or temporary disconnections.
This assumption simplifies the overall architecture and eliminates the need for complex message reordering or buffering on the server-side. It allows us to leverage the inherent ordering guarantees of certain message queues (e.g., RabbitMQ with ordering enabled, Amazon SQS FIFO queues) to ensure that logs are processed and stored in the correct order on the cloud server.
Assume that every log retrieval request sent to the edge device can be successfully processed and completed by the device. This means that the device has sufficient resources (CPU, memory, storage) and network connectivity to retrieve, filter (if applicable), and upload all the relevant log data within a reasonable timeframe.
├── 2024-09-10/
│ ├── 00/
│ │ ├── 2024-09-10_00-00.json.gz
│ │ ├── 2024-09-10_00-00_1.json.gz
│ │ ├── 2024-09-10_00-00_error.json.gz
│ │ ├── 2024-09-10_00-01.json.gz
│ │ ├── ...
│ │ └── 2024-09-10_00-59.json.gz
│ ├── 01/
│ │ ├── 2024-09-10_01-00.json
│ │ ├── 2024-09-10_01-00_error.json
│ │ ├── ...
│ └── ...
└── ...
Log Type | Uncompressed Size (approx.) | Calculation | Compressed Size (gzip, approx.) |
---|---|---|---|
Non-Error | 10.9 MB | 60,000 logs/min * 188 bytes/log / 1024^2 | 5.5 MB (50% compression) |
Error | 4.13 MB | 60,000 logs/min * 10% * 500 bytes/log / 1024^2 | 2.1 MB (50% compression) |
Rotation Rules:
Scenario: Retrieving logs between 10:01:00 and 10:05:00
2024-09-10
)10
)2024-09-10_10-01.json.gz
to 2024-09-10_10-05.json.gz
)_error
suffix (if filtering for errors) or all files (if no filtering).Publish Log Chunks to MQTT
Handle Interruptions
Completion and Cleanup
Memory Usage in Typical case:
The largest files would likely be the non-error log files, which are estimated to be around 10.9 MB (uncompressed) or 5.5 MB (compressed). In this case, the maximum memory usage would be around 5.5 MB.
Retrieving entire log files based on minute boundaries can lead to transferring unnecessary data. Even if a specific time range within a minute is requested, the whole file is retrieved and processed, wasting bandwidth and server resources.
Finer-Grained Log Rotation
Concept: Decrease log rotation interval (e.g., to 30 seconds or 15 seconds) for smaller files and less extraneous data transfer.
Benefits:
Considerations:
In-File Filtering with Structured Logs
Concept: Filter directly within log files on the device by parsing and extracting entries matching the requested criteria.
Benefits:
Considerations:
Key takeaway
The system uses MQTT for communication, ensures trust with TLS, and handles intermittent connectivity through retained messages, QoS, and persistent storage. It also employs a file-system-based approach with checkpoints to manage log retrieval and interruptions efficiently.
MQTT is chosen for its lightweight nature, efficiency in handling intermittent connectivity, and publish-subscribe model suitable for edge device scenarios.
Mutual TLS authentication is recommended to establish trust between the device and the server, ensuring secure and authenticated communication.
Both the device and the server use X.509 certificates to verify each other’s identity during connection setup, preventing unauthorized access and man-in-the-middle attacks.
Comparison of Authentication Approaches
Feature | OAuth 2.0/OpenID Connect | Username/Password |
---|---|---|
Security | High | Low to Moderate |
Flexibility | High (various flows/grants) | Low |
Scalability | High (centralized identity) | High (with clustering) |
Complexity | High (external auth server) | Low |
Network Reliance | High (for token acquisition) | Low |
Suitability | Large-scale, high-security | Small-scale, simple |
Chosen Approach: Username/Password
For this system, we’ll utilize username/password authentication due to the intermittent network connectivity on the edge device. This approach minimizes reliance on network stability during the authentication process. We’ll ensure robust security by combining it with TLS encryption and enforcing strong password policies on the device.
Important Note: If the system’s security requirements change or the scale of deployment increases significantly, it’s advisable to re-evaluate the authentication mechanism and potentially transition to a more robust approach like OAuth 2.0 or mutual TLS.
While a private tunnel adds inherent security, TLS can be layered on top to provide defense-in-depth and granular access control.
Even within a private network, TLS ensures data confidentiality and integrity.
MQTT’s Publish-Subscribe Model: The server publishes log retrieval requests to specific topics, and the device subscribes to those topics. This allows the device to receive requests even if it was offline when the request was initially published.
Retained Messages: The server can publish requests as retained messages, ensuring the device receives the latest request upon reconnection.
Delivery manner: Using at-least-once or exactly-once delivery of messages, ensuring reliability even with intermittent connectivity.
Persistent Storage: The device can store pending requests in persistent storage to handle scenarios where it’s offline or loses power during processin
/foo/bar/request/
├── request-1
├── request-2
├── request-1/
│ ├── 2024-09-10_05-00.json.gz
│ ├── 2024-09-10_06-00.json.gz
│ ├── 2024-09-10_07-00.json.gz
│ ├── 2024-09-12T10:05-00.trail
...
1. Receive and Acknowledge Request
/foo/bar/request/request_<request_id>.json
).2. Identify and Copy Relevant Log Files
.json.gz
) within the range and matching any log level filter.3. Process Each Copied File Sequentially
.trail
).4. Handle Interruptions
5. Completion and Cleanup
Based on the requested time range, the device identifies the relevant daily and hourly directories in the file system (e.g., /var/log/iot_device/2024-09-10/10/). It selects the log files (*.json.gz) within those directories whose timestamps fall within the requested range. If filtering by log level is required, it further selects only the files with the _error
let say we are having these meta data for the request
Log Request Metadata: This metadata is associated with each log retrieval request initiated by the operator. It includes information such as:
Given the log request metadata’s characteristics and the primary access pattern of querying by request_id
, suitable database options include:
1. Key-Value Store (KVS)
request_id
retrieval)2. Relational Database (SQL)
3. Document Database
Recommendation:
request_id
and temporary data retention: KVS with TTL is a strong option due to its simplicity, efficiency, and automatic expiration.MQTT to Message Queue:
The IoT device publishes log messages to the MQTT broker. The broker then forwards these messages to a message queue, ensuring reliable delivery and decoupling the device from the processing stage.
Processor and Storage:
The processor consumes messages from the queue, potentially processing them further. It then reassembles the log chunks into complete log files, maintaining their chronological order. These reassembled log files are then stored in either:
S3: For cost-effective and scalable storage, especially for simpler log retrieval scenarios.
Text Search Engine: For advanced search, filtering, and analytics capabilities.
s3://log-bucket/
├── requests/
│ ├── <request_UUID_1>/
│ │ ├── 2024-09-10_05-00.json.gz
│ │ └── 2024-09-10_06-00.json.gz
│ └── ...
└── ...
In this structure:
Each file within a request directory represents a processed log file, either containing non-error logs or error logs (with the _error suffix).
The files can be either:
The choice between compressed or uncompressed storage depends on the trade-offs between storage costs, retrieval speed, and the capabilities of the tools used to access and analyze the logs.
Instead of (or in addition to) storing log chunks directly in S3, the processor can index them into a text search engine like Elasticsearch.
Parsing and Indexing: The processor parses each log chunk, extracts relevant fields (timestamp, log level, device ID, message, etc.), and indexes this structured data into Elasticsearch.
Querying: When the operator requests logs, the server translates the query criteria into an Elasticsearch query. Elasticsearch performs the search and filtering, returning the matching log entries.
is_last_chunk
flag to signal end of file upload.is_last_chunk
set to true
is received.The operator can access the processed logs through the following methods:
aws s3 cp
or aws s3 sync
commands to download specific files or entire directories based on the request ID and file naming conventions.The choice of access method depends on the operator’s preferences and the specific use case. The API offers programmatic access, S3 commands provide direct control, and a web interface offers a more user-friendly experience.
tags: mqtt,system-design