Fix alerts from the System Diagnostics page.
Possible causes of large incident data include:
Mapping unnecessary fields during incident ingestion
Indexing incident fields that contain a large amount of data (more than 100 KB)
Inefficient tasks storing additional data to incident fields during playbook execution
Solutions:
Modify mapping to reduce the number of incident fields.
Navigate to the incident field page (Settings → Objects Setup → Incidents → Incident Fields, select a custom incident field, and turn off Make data available for search under the Attributes tab to reduce database storage. This option is not available for system fields.
For playbooks that are storing unnecessary data to incident fields, turn on Quiet Mode for selected tasks or the entire playbook.
Large investigation context data (larger than 1 MB) can slow down playbook execution. A possible cause for large investigation context data is playbooks storing a large amount of data to context.
Solution:
Check for large context data, using one or more of the following methods:
Open the incident and view the context data.
Run
!PrintContext
from the CLI.Run
!Print value=${incident
} from the CLI.
Manually delete an incident or the incident's context using the !DeleteContext command.
Memory issues might be caused by enrichment data larger than 1 MB, which can be caused by:
An integration returns a large amount of data for a single indicator.
Many indicators are being enriched.
Solutions:
Modify playbook tasks to limit indicator extraction and enrichment.
Modify integration settings to limit the data returned by indicator enrichment.
If data in the Cortex XSOAR platform is not updating in real time, this may be due to WebSocket disconnects. The causes of WebSocket disconnects are:
The network proxy killing the WebSocket connection.
Slow connection between server and client.
Solutions:
If the network proxy is the cause, adjust network proxy settings to allow WebSocket connections.
Check the throughput between server and client.
Verify the System Diagnostics page does not show alerts for large incidents, context, tasks, etc.
Running too many containers on a single machine concurrently, or failing to limit container resources, can lead to issues with the Docker service.
Note
The number of containers that can run at the same time, without affecting system performance, depends on a variety of factors, including CPU, memory, storage IOPs, and specific integrations.
Solutions:
Restart the Docker service.
If more than 150 containers are running at the same time, move some of the Docker workloads to other machines, either by adding engines or by adding app servers. See Docker Hardening Guide.
Playbook tasks larger than 150 KB can lead to playbook performance issues. Large playbook tasks result from storing a large amount of data to task inputs and outputs.
Solutions:
Run playbook tasks in Quiet Mode. In Quiet Mode, inputs and outputs are not displayed in the Work Plan view, but are still used during playbook execution.
Check playbooks for inefficient tasks.
Work Plans larger than 3 MB can lead to slow playbook execution, memory spikes, and overall slow UI. Large Work Plans can be caused by:
Playbooks are storing a large amount of data to task inputs and outputs.
Many indicators being enriched.
Enrichment integrations returning a large amount of data per indicator.
Playbooks looping on too many arguments during a playbook run.
Solutions:
Each of these resolution steps are optional and need not be completed in a specific order.
Mark specific tasks, or entire playbooks, to run in Quiet Mode. In Quiet Mode, inputs and outputs are not displayed, but outputs are still written to context.
Check for inefficient tasks and unnecessary loops in playbooks.
Verify that you are only extracting and enriching the indicators required for investigations.
Purge large work plan data directly from the Systems Diagnostics page.
Latency above 10 ms between components in an Elasticsearch deployment can lead to slow UI performance.
Solutions:
Determine which component(s) are surpassing the maximum latency and take steps to reduce latency.
If components are in a different region or network, consider reducing latency by moving components to the same network.
Exceeding the maximum percentage of disk space usage allowed by Elasticsearch can lead to multiple issues:
When attempting to save new data (incidents, etc.) in Elasticsearch, new indexes cannot be created if disk usage exceeds the low watermark level set by Elasticsearch. An error unable to allocate shards is displayed and the data is not saved.
If disk usage exceeds the high watermark level set by Elasticsearch, you may not be able to log in to Cortex XSOAR and all data becomes read-only.
Solutions
Increase disk space on existing data nodes.
Add additional data nodes.
Storage with insufficient space or insufficient specifications can lead to slow system performance.
Solutions:
If there is insufficient disk space, archive data by month to condense data.
Verify that storage meets the minimum requirements of SSD with 3k dedicated IOPS. (BoltDB)
Exceeding the hosted service limit for number of incidents ingested per day can lead to slow system performance.Service Limits
Solution:
Set up pre-processing rules to eliminate duplicate incidents that do not require investigation.Create Pre-Process Rules for Incidents
Exceeding the hosted service limit for total number of stored indicators can lead to slow system performance.Service Limits
Solutions:
Delete indicators that are no longer relevant.
Contact your Cortex XSOAR account manager to request a higher storage limit.
Exceeding the hosted service limit for partition data can lead to slow system performance. If your data usage is near or exceeds the hosted service limit, contact customer support to archive your data and free up disk space.Service LimitsService Limits
Search performance is optimized for data stored over the previous three months. Performing high-velocity searches across longer time ranges can cause slow system performance and/or CPU and memory spikes.
Solutions:
Check if dashboards and widgets are using the All times time range and limit the time range.
Confirm that manual searches are not being run with the All times time range.
On the Incidents page, select the Hide the Panel option in order to hide widgets and reduce the number of searches.
Check for playbook tasks that execute a query but do not have a time range argument specified, or where the time range is too broad.
System performance can be impacted by many exclusion list entires. By default, if the number of exclusion list entries exceeds 1000, a yellow alert is triggered in the System Diagnostics page. If the list exceeds 5000 entries, a red alert is triggered.
Solutions:
Check if you can add a regex instead of specific indicator values. Using a regex will reduce the number of individual entries and improve performance.
Verify that exclusion list entries are up to date and remove any that are no longer relevant.
Large audit log files can lead to memory or CPU spikes and impact system performance.
Solution:
Playbook tasks lager than 250 KB have Quiet Mode enabled by default, to limit impact on system performance. In Quiet Mode, playbook tasks are not indexed, so you cannot search on the results of specific tasks. Entries are not written to the War Room, and inputs and outputs are not presented for Work Plan tasks. All of the information is still available in the context data, and errors and warnings are written to the War Room.
If you want to prevent tasks larger than 250 KB from automatically being put into Quiet Mode, set the task.auto.quiet.mode.enabled
server configuration to false
.
If you want to change the threshold for playbook tasks, increase the value (in bytes) of the task.size.limit.bytes
server configuration. You must set a value larger than 250000 (i.e. larger than 250 KB).