Exploring DataSunrise’s Performance Under High Traffic Conditions
DataSunrise is frequently queried about its performance when handling traffic volumes that exceed typical limits. Clients are particularly concerned about whether DataSunrise will drop traffic, experience delays, or manage the situation differently. This post delves into how DataSunrise copes with such scenarios.
Operational Capacity and Performance Metrics
To grasp how DataSunrise performs under pressure, it’s crucial to understand its operational thresholds across various server configurations. Here are the maximum operations per second for different Amazon EC2 instances, which push CPU usage to 100% for proxying and auditing tasks:
- m5.8xlarge: 24,500 operations/sec
- m5.4xlarge: 18,700 operations/sec
- m5.2xlarge: 15,350 operations/sec
- m5.xlarge: 7,800 operations/sec
- m5.large: 3,900 operations/sec
These results are based on our tests using RDS Postgres on an m5.2xlarge instance equipped with 12,000 IOPS storage.
Identifying and Managing Bottlenecks
Audit System Bottlenecks
If the Audit Storage cannot keep up with a spike in traffic, DataSunrise utilizes an internal queue within its Audit Journal system, capable of handling several thousand events depending on system settings (refer to the AuditHighWaterMark parameter). Should traffic spikes exceed the queue’s capacity, events may be rejected. However, this default behavior can be modified so that DataSunrise will pause and wait until there is space in the queue to log new events (see the AuditPutThreadQueueWait parameter). During this wait, application traffic may be temporarily halted, usually for just milliseconds to seconds, depending on the auditing system’s performance.
For optimizing your audit system, consider the following:
- Enhancing the database performance by selecting servers with more CPU and memory.
- Checking your network setup as latency significantly affects performance; ideally, the DataSunrise host and Audit Storage should be on the same subnet.
- Reviewing and tailoring your rules and audit events to focus only on those critical for your compliance policies.
DataSunrise Parsing System Bottlenecks
If your auditing system is functioning correctly, CPU usage becomes the next potential bottleneck. The handling mode – passive or active – greatly influences application impact:
- Passive Mode. Here, traffic is handled asynchronously in a separate pool of threads before being resent to the server. Traffic is temporarily stored in an internal queue, which can buffer spikes and potentially enhance application performance (refer to the MessageHandlersGlobalQueueHighWaterMark and MessageHandlersLocalQueueHighWaterMark parameters). If this buffer fills, DataSunrise will stop parsing new traffic on that connection, and you’ll receive an alert in the Event Monitor. This situation won’t degrade application performance, but some events might be missed in the audit.
- Active Mode. In this mode, traffic cannot be handled asynchronously since DataSunrise must make real-time decisions about operations. No queues are used, and performance directly correlates with CPU capability. During traffic spikes, DataSunrise processes as much as possible, which may increase latency in your application queries.
Understanding these mechanisms and settings can help you optimize DataSunrise’s configuration for better handling of high traffic volumes and prevent potential performance bottlenecks.
For further reading on choosing the right database for audit storage and improving its performance, consider our detailed guide How to Choose the Database for Audit Storage: A Performance Analysis.