kan01234 - Software Engineer Notes

Logo

A backend engineer's journey of learning and growth.

View the Project on GitHub kan01234/post

15 September 2024

Remote Log Retrieval for Linux-Based Edge Devices: A System Design Approach

by kan01234

Remote Log Retrieval for Linux-Based Edge Devices: A System Design Approach

Requrimetns

A Linux-powered edge device, operating in a remote setting, runs a program that generates periodic debug or error log messages. These logs are currently channeled through the system’s default syslog mechanism.

The main features is allows an operator to remotely access log messages from the edge device though secure cloud server

To conserve network bandwidth and cloud storage, the device should only upload logs when explicitly requested by the operator, and only the specific logs that the operator needs.

Assumption

Estimated Log Message Length

Let’s make some estimations of the log message:

Timestamp: 23 characters (yyyy-MM-dd HH:mm:ss.SSS)
Log Level: 5 characters
PID: Assuming a typical PID length of 5 digits
Thread Name: 15 characters
Logger Name: Let's assume an average of 20 characters
Log Message: This is variable, but let's stick with our previous assumption of 100 bytes on average
Exception Stack Trace: This is highly variable, but let's assume an average of 500 bytes when present (this can be adjusted based on your application's exception handling)
Other characters and formatting: Let's add a buffer of 20 bytes for colons, spaces, and other formatting elements

Estimated Log Message / Day

Calculating the Average Log Message size:

Log per day: 24 * 60 * 60 * 1000 (up to ms)
Log message length with exception: 688 bytes
Log message length without exception: 188 bytes
Probability of exception: 10%

log message size = 86,400,000 * 10% * 688 bytes + 86,400,000 * 188 bytes = 20GB

Edge device can delivery the event in order

Ordered Log Delivery

A key assumption for the system design is that the edge device is capable of delivering log messages to the MQTT broker in the order they were generated. This implies that the device’s logging mechanism or any intermediary components ensure that logs are published to the broker in their correct chronological sequence, even if there are network delays or temporary disconnections.

This assumption simplifies the overall architecture and eliminates the need for complex message reordering or buffering on the server-side. It allows us to leverage the inherent ordering guarantees of certain message queues (e.g., RabbitMQ with ordering enabled, Amazon SQS FIFO queues) to ensure that logs are processed and stored in the correct order on the cloud server.

Request Completion

Assume that every log retrieval request sent to the edge device can be successfully processed and completed by the device. This means that the device has sufficient resources (CPU, memory, storage) and network connectivity to retrieve, filter (if applicable), and upload all the relevant log data within a reasonable timeframe.

System Design

sa

Local Log Storage and Management on the Edge Device

File Structure

├── 2024-09-10/ 
    │   ├── 00/
    │   │   ├── 2024-09-10_00-00.json.gz 
    │   │   ├── 2024-09-10_00-00_1.json.gz
    │   │   ├── 2024-09-10_00-00_error.json.gz
    │   │   ├── 2024-09-10_00-01.json.gz
    │   │   ├── ...
    │   │   └── 2024-09-10_00-59.json.gz 
    │   ├── 01/
    │   │   ├── 2024-09-10_01-00.json
    │   │   ├── 2024-09-10_01-00_error.json
    │   │   ├── ... 
    │   └── ... 
    └── ... 

Estimated File Sizes

Log Type Uncompressed Size (approx.) Calculation Compressed Size (gzip, approx.)
Non-Error 10.9 MB 60,000 logs/min * 188 bytes/log / 1024^2 5.5 MB (50% compression)
Error 4.13 MB 60,000 logs/min * 10% * 500 bytes/log / 1024^2 2.1 MB (50% compression)

Log Rotation

Rotation Rules:

Log Retrieval and Filtering Process

Scenario: Retrieving logs between 10:01:00 and 10:05:00

  1. Receive Log Request
    • The device receives a request specifying the desired time range (e.g., 10:01:00 - 10:05:00) and any filtering criteria (e.g., log level). We will dicusss how to receive this request later.
  2. Identify Relevant Files
    • Based on the requested time range, the device identifies:
      • The relevant daily directory (e.g., 2024-09-10)
      • The corresponding hourly subdirectory (10)
      • The log files within that directory whose timestamps fall within the range (2024-09-10_10-01.json.gz to 2024-09-10_10-05.json.gz)
    • If filtering by log level is required, it further selects only the files with the _error suffix (if filtering for errors) or all files (if no filtering).
  3. Publish Log Chunks to MQTT

    • The device publishes each log chunk to the appropriate MQTT topic, ensuring in-order delivery.
      • Each chunk’s payload includes metadata like request_id, timestamp, chunk_sequence, and is_last_chunk.
  4. Handle Interruptions

    • If the upload is interrupted, the progress checkpoint files indicate where to resume upon reconnection/restart.
  5. Completion and Cleanup

    • Once all files are uploaded:
      • Delete the request JSON file.
      • Delete the temporary directory and its contents.

Memory Usage in Typical case:

The largest files would likely be the non-error log files, which are estimated to be around 10.9 MB (uncompressed) or 5.5 MB (compressed). In this case, the maximum memory usage would be around 5.5 MB.

Drawback: Potential Transfer of Unnecessary Log Data

Retrieving entire log files based on minute boundaries can lead to transferring unnecessary data. Even if a specific time range within a minute is requested, the whole file is retrieved and processed, wasting bandwidth and server resources.

Potential Improvements

Finer-Grained Log Rotation

Concept: Decrease log rotation interval (e.g., to 30 seconds or 15 seconds) for smaller files and less extraneous data transfer.

Benefits:

Considerations:

In-File Filtering with Structured Logs

Concept: Filter directly within log files on the device by parsing and extracting entries matching the requested criteria.

Benefits:

Considerations:

Communication bewteen Edge Device and Cloud Server

Key takeaway

The system uses MQTT for communication, ensures trust with TLS, and handles intermittent connectivity through retained messages, QoS, and persistent storage. It also employs a file-system-based approach with checkpoints to manage log retrieval and interruptions efficiently.

Protocol

MQTT is chosen for its lightweight nature, efficiency in handling intermittent connectivity, and publish-subscribe model suitable for edge device scenarios.

Establishing Trust with TLS:

Mutual TLS authentication is recommended to establish trust between the device and the server, ensuring secure and authenticated communication.

Both the device and the server use X.509 certificates to verify each other’s identity during connection setup, preventing unauthorized access and man-in-the-middle attacks.

Authentication Methods

  1. OAuth 2.0/OpenID Connect
  1. Username/Password

Comparison of Authentication Approaches

Feature OAuth 2.0/OpenID Connect Username/Password
Security High Low to Moderate
Flexibility High (various flows/grants) Low
Scalability High (centralized identity) High (with clustering)
Complexity High (external auth server) Low
Network Reliance High (for token acquisition) Low
Suitability Large-scale, high-security Small-scale, simple

Chosen Approach: Username/Password

For this system, we’ll utilize username/password authentication due to the intermittent network connectivity on the edge device. This approach minimizes reliance on network stability during the authentication process. We’ll ensure robust security by combining it with TLS encryption and enforcing strong password policies on the device.

Important Note: If the system’s security requirements change or the scale of deployment increases significantly, it’s advisable to re-evaluate the authentication mechanism and potentially transition to a more robust approach like OAuth 2.0 or mutual TLS.

Private Tunnel (Optional):

While a private tunnel adds inherent security, TLS can be layered on top to provide defense-in-depth and granular access control.

Even within a private network, TLS ensures data confidentiality and integrity.

Handling Intermittent Connectivity and Device Downtime:

Handling Interruption on Process

/foo/bar/request/
├── request-1
├── request-2
├── request-1/
│   ├── 2024-09-10_05-00.json.gz
│   ├── 2024-09-10_06-00.json.gz
│   ├── 2024-09-10_07-00.json.gz
│   ├── 2024-09-12T10:05-00.trail
...

1. Receive and Acknowledge Request

2. Identify and Copy Relevant Log Files

3. Process Each Copied File Sequentially

4. Handle Interruptions

5. Completion and Cleanup

Based on the requested time range, the device identifies the relevant daily and hourly directories in the file system (e.g., /var/log/iot_device/2024-09-10/10/). It selects the log files (*.json.gz) within those directories whose timestamps fall within the requested range. If filtering by log level is required, it further selects only the files with the _error

Sequence

sequenceDiagram participant e1 as Edge Device 1 participant e2 as Edge Device 2 participant b as Message Broker (Cloud Server) participant a as Log Service (API / UI) participant o as Operator e1 -->> b: subscribe topic /client/building-a/log-request/e1 e2 -->> b: subscribe topic /client/building-a/log-request/e2 e2 -->> e2: power off o -->> a: request log, from 10:05 to 10:08, device 1 a -->> b: publish topic /client/building-a/log-request/e1, 10:05 - 10:08 a -->> o: OK o -->> a: request log, from 11:05 - 11:08, device 2 a -->> b: publish topic /client/building-a/log-request/e2, 11:05 - 11:08 a -->> o: OK b -->> e1: publish request log, from 10:05 to 10:08, device 1 e1 -->> e1: write request to local storage e1 -->> b: ack e1 -->> b: publish log by chunk, topic /client/building-a/log/e1, chunk, chunk_seq, is_last_chunk e2 -->> e2: power on b -->> e2: publish topic /client/building-a/e2, 11:05 - 11:08, (then same as the log request flow of device 1)

Log Storage and Management on the Cloud Server

Log Request Data

let say we are having these meta data for the request

Log Request Metadata: This metadata is associated with each log retrieval request initiated by the operator. It includes information such as:

Potential Database

Given the log request metadata’s characteristics and the primary access pattern of querying by request_id, suitable database options include:

1. Key-Value Store (KVS)

2. Relational Database (SQL)

3. Document Database

Recommendation:

Summary of Log Processing and Storage Architecture

MQTT to Message Queue:

The IoT device publishes log messages to the MQTT broker. The broker then forwards these messages to a message queue, ensuring reliable delivery and decoupling the device from the processing stage.

Processor and Storage:

The processor consumes messages from the queue, potentially processing them further. It then reassembles the log chunks into complete log files, maintaining their chronological order. These reassembled log files are then stored in either:

S3: For cost-effective and scalable storage, especially for simpler log retrieval scenarios.
Text Search Engine: For advanced search, filtering, and analytics capabilities.

S3 Folder Strcuture

s3://log-bucket/
├── requests/
│   ├── <request_UUID_1>/
│   │   ├── 2024-09-10_05-00.json.gz
│   │   └── 2024-09-10_06-00.json.gz 
│   └── ...
└── ...

In this structure:

The choice between compressed or uncompressed storage depends on the trade-offs between storage costs, retrieval speed, and the capabilities of the tools used to access and analyze the logs.

Applying a Text Search Engine (e.g., Elasticsearch) (Optional)

Instead of (or in addition to) storing log chunks directly in S3, the processor can index them into a text search engine like Elasticsearch.

  1. Parsing and Indexing: The processor parses each log chunk, extracts relevant fields (timestamp, log level, device ID, message, etc.), and indexes this structured data into Elasticsearch.

  2. Querying: When the operator requests logs, the server translates the query criteria into an Elasticsearch query. Elasticsearch performs the search and filtering, returning the matching log entries.

Potential Processing and Filtering Tasks within the Processor (Optional)

Determining Request Completion

Accessing the Logs

The operator can access the processed logs through the following methods:

The choice of access method depends on the operator’s preferences and the specific use case. The API offers programmatic access, S3 commands provide direct control, and a web interface offers a more user-friendly experience.

tags: mqtt,system-design