trace-looad: fix review findings

* Add documentation * remove broken statistics * set defaults for unit tests differently.
COVESA · Aug 15, 2024 · 6aaae63 · 6aaae63
1 parent 398ac45
commit 6aaae63
Show file tree

Hide file tree

Showing 9 changed files with 98 additions and 223 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -394,11 +394,19 @@ if(WITH_DLT_TRACE_LOAD_CTRL)
 
     # Configure limits for daemon
     if(NOT TRACE_LOAD_DAEMON_SOFT_LIMIT)
-        set(TRACE_LOAD_DAEMON_SOFT_LIMIT 0)
+        if (WITH_DLT_UNIT_TESTS)
+            set(TRACE_LOAD_DAEMON_SOFT_LIMIT 83333)
+        else()
+            set(TRACE_LOAD_DAEMON_SOFT_LIMIT 0)
+        endif()
     endif()
 
     if(NOT TRACE_LOAD_DAEMON_HARD_LIMIT)
-        set(TRACE_LOAD_DAEMON_HARD_LIMIT 0)
+        if (WITH_DLT_UNIT_TESTS)
+            set(TRACE_LOAD_DAEMON_HARD_LIMIT 100000)
+        else()
+            set(TRACE_LOAD_DAEMON_HARD_LIMIT 0)
+        endif()
     endif()
 
     if (TRACE_LOAD_DAEMON_SOFT_LIMIT GREATER TRACE_LOAD_DAEMON_HARD_LIMIT)

diff --git a/doc/dlt_trace_limit.md b/doc/dlt_trace_limit.md
@@ -0,0 +1,66 @@
+- [Trace load limits](#trace-load-limits)
+  - [Overview](#overview)
+  - [Technical concept](#technical-concept)
+
+# Trace load limits
+
+## Overview
+
+In a shared system, many developers log as much as they want into DLT. It can overload the logging system, resulting in poor performance and dropped logs. This feature introduces trace load limits to restrict applications to a specified log volume measured in bytes/s. 
+
+Trace load limits are configured via a space-separated configuration file.
+The format of the file follows:
+
+> APPID [CTXID] SOFT_LIMIT HARD_LIMIT
+
+The entry with the most matching criteria will be selected for each log, meaning the app and context or the app ID must match. For non-configured applications, default limits are applied.
+It allows configuring trace load for single contexts, which can be used, for example, to limit different applications in the QNX slog to a given budget without affecting others or to provide a file transfer unlimited bandwidth.
+You should always specify a budget for the application ID without the contexts to ensure that new contexts and internal logs like DLTL can be logged.
+
+Applications start with a default limit defined via CMake variables TRACE_LOAD_USER_HARD_LIMIT, TRACE_LOAD_USER_SOFT_LIMIT. 
+Once the connection to the daemon is established, each configuration entry matching the app ID will be sent to the client via application message. If no configuration is found, TRACE_LOAD_DAEMON_HARD_LIMIT and TRACE_LOAD_USER_SOFT_LIMIT will be used instead.
+The two-stage configuration process ensures that the daemon default can be set to 0, forcing developers to define a limit for their application while ensuring that applications can log before they receive the log levels.
+
+Measuring the trace load is done in the daemon and the client. 
+Dropping messages on the client side is the primary mechanism and prevents sending logs to the daemon only to be dropped there, 
+reducing the IPC's load. Measuring again on the daemon side ensures that rouge clients are still limited to the defined trace load.
+
+Exceeding the soft limit will produce the following logs:
+
+```
+ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO. (soft limit: 2000 bytes/sec, current: 2414 bytes/sec)] 
+ECU1- DEMO DLTL log warn V 1 [Trace load exceeded trace soft limit on apid: DEMO, ctid TEST.(soft limit: 150 bytes/sec, current: 191 bytes/sec)]
+```
+
+Exceeding the hard limit will produce the same message but with the appended text '42 messages discarded.' And (obviously) messages will be dropped. Dropped messages are lost and cannot be recovered, which forces developers to get their logging volume under control.
+
+As debug and trace load are usually disabled for production and thus do not impact the performance of actual systems, these logs are not accounted for in trace load limits. In practice, this was very useful for improving the developer experience while maintaining good performance, as developers know that debugging and trace logs are only available during development and that essential information has to be logged at a higher level.
+
+To simplify creating a trace limit baseline, the script 'utils/calculate-load.py' is provided, which suggests limits based on actual log volume.
+
+## Technical concept
+
+The bandwidth is calculated based on the payload of each message of every application or context. Each limit gets assigned several slots (default 60s) with a width of 1s, called a window. The sum of all slots represents the current bandwidth. 
+
+The `dlt_check_trace_load` function is the main entry point for handling trace load. It receives the payload size and the trace load configuration (among other parameters, but these are the most important) and returns whether the message should be kept. A return value of `true` means the message can be logged, while `false` means it should be dropped. This method is called for each message received by the daemon and for each message sent by the client. Doing this on the client side ensures that the daemon is not overloaded by clients sending too much data. Checking it on the daemon side ensures that rouge clients are still limited to the defined trace load.
+
+Within the `dlt_check_trace_load` function, the first step is to check if the current slot is still valid by checking its time frame. It is done in the `dlt_switch_slot_if_needed` function, which also does sanity checks and cleans the window if necessary.
+
+Cleaning does the following things:
+
+1. Check if a timestamp overflow occurred, which will happen after ~119 hours. In this case, the time calculation will be reset.
+2. Calculate the number of elapsed slots since receiving the last message.
+3. The window is reset if the calculated amount of elapsed slots is larger than the window size (`total_bytes_of_window`).
+4. Otherwise, remove data from slots that are not in the current window anymore.
+5. Output errors when the limits are exceeded.
+
+After the preparation, the actual recording of the trace load is calculated. The calculation is done in the following way:
+
+1. Add the payload size to the current slot.
+2. Increment the total size of the window.
+3. Save the current slot to the window so that we can use this in the next iteration.
+4. Calculate the average trace load of the window. It is an average in byte/s over the window size. The average value is used to ensure that short spikes do not immediately lead to dropped messages.
+
+The trace load configuration contains two flags, `is_over_soft_limit` and `is_over_hard_limit`. When the limits are exceeded, they will be set.
+If a new message exceeds the hard limits, it will be dropped, and its size will be removed from the window as the message is dropped.
+It will also increment the `hard_limit_over_counter`, which counts the number of dropped messages. `hard_limit_over_counter` is a part of the message logged in case of exceeding the hard limit.
diff --git a/include/dlt/dlt_common.h b/include/dlt/dlt_common.h
@@ -853,7 +853,7 @@ extern "C"
 #define DLT_TRACE_LOAD_WINDOW_RESOLUTION  (10000)
 
 /* Special Context ID for output soft_limit/hard_limit over warning message (DLT LIMITS) */
-#define DLT_INTERNAL_CONTEXT_ID           ("DLTL")
+#define DLT_TRACE_LOAD_CONTEXT_ID ("DLTL")
 
 /* Frequency in which warning messages are logged in seconds when an application is over the soft limit
  * Unit of this value is Number of slot of window.
@@ -919,6 +919,21 @@ extern pthread_rwlock_t trace_load_rw_lock;
 /* Precomputation  */
 static const uint64_t TIMESTAMP_BASED_WINDOW_SIZE = DLT_TRACE_LOAD_WINDOW_SIZE * DLT_TRACE_LOAD_WINDOW_RESOLUTION;
 typedef DltReturnValue (DltLogInternal)(DltLogLevelType loglevel, const char *text, void* params);
+
+/**
+ * Check if the trace load is within the limits.
+ * This adds the current message size to the trace load and checks if it is within the limits.
+ * It's the main entry point for trace load control.
+ * The function also handles output of warning messages when the trace load is over the limits.
+ * @param tl_settings The trace load settings for the message source.
+ * @param log_level Which log level should be used to output the trace load exceeded messages.
+ * @param timestamp The timestamp of the message, used to calculate the current slot.
+ * @param size Size of the payload.
+ * @param internal_dlt_log Function to output the trace load exceeded messages.
+ * @param internal_dlt_log_params Additional parameters for the internal_dlt_log function.
+ * @return True if the trace load is within the limits, false otherwise.
+ *         False means that the message should be discarded.
+ */
 bool dlt_check_trace_load(
     DltTraceLoadSettings* tl_settings,
     int32_t log_level,