Custom data patterns: Guardrails and syntax guide - Details and examples about the capabilities and limitations with regular expressions in Cortex Cloud Data Classification. - Administrator Guide - Cortex XSIAM - Cortex - Security Operations

Cortex XSIAM 3.x Documentation

Product
Cortex XSIAM
Creation date
2025-07-15
Last date published
2026-06-16
Category
Administrator Guide
Abstract

Details and examples about the capabilities and limitations with regular expressions in Cortex Cloud Data Classification.

This guide explains the capabilities and limitations of creating custom classifiers. Understanding these guardrails will help you create effective and functional data patterns.

Guardrails

This section details the configuration rules, the logic behind these requirements, and how they affect pattern detection.

One regex per detector
  • Rule: Each custom detector supports a single regular expression.

  • Why: This simplifies the configuration and ensures that each detector has a clear, singular purpose.

  • Impact: If you need to match multiple different patterns (for example, different formats of an ID), you need to create separate custom detectors for each pattern or combine them into a single regex using the alternation operator | (pipe), provided that this does not violate complexity limits. Example: format1|format2

Context words are mandatory
  • Rule: You must provide at least one context word for your classifier.

  • Why: The Data Classification engine uses context words as a first-pass filter. It only runs your potentially regular expression (regex) if one of the context words is found in the proximity you defined. This ensures high performance across large datasets.

  • Impact: If you don't provide context words, or if the context words don't appear near your target data, the regex does not execute, and no match is found.

Avoid start-of-string and end-of-string anchors (^ and $)
  • Rule: Do not use the start-of string and end-of-string ^ or $ anchors.

  • Why: In many regex engines, ^ and $ match the start or end of a line. However, in Cortex Cloud Data Classification, they match the start or end of the entire text being scanned. Since your target data ( such as an ID or key) is usually embedded in the middle of a file or sentence, using these causes the match to fail.

  • Impact:

    • ^[A-Z]{2}\d{5} fails to find "AB12345".

    • [A-Z]{2}\d{5}$ fails to find "AB12345".

Avoid using lookaround and backreference entities
  • Rule: Cortex Cloud Data Classification does not support the following:

    • lookahead ((?=...))

    • lookbehind ((?<=...))

    • backreference (\1)

  • Why: Using the lookaround and backreference entities can lead to exponential execution time. Cortex Cloud Data Classification allows lookaround entities using the context words, and it uses the Rust regex engine, which guarantees linear time execution O(n) to prevent ReDoS (Regular Expression Denial of Service) attacks and to ensure predictable performance.

  • Impact: You must rewrite patterns to avoid these constructs. For example, instead of using lookbehind to ensure that a prefix exists, include the prefix in the match and use a capturing group for the data you want to extract.

Regex complexity limits
  • Rule: Cortex Cloud Data Classification enforces limits on regex complexity to prevent performance issues.

    • Unbound repetitions: If possible, avoid unbounded repetitions such as .* or .+ or ensure that they are not nested.

    • Nesting depth: Deeply nested patterns, such as ((((a)b)c)d)), are limited.

    • Branching: Too many alternations (such as a|b|c|...) or nested alternations can trigger validation errors.

  • Why: Complex patterns with excessive nesting or branching can lead to "combinatorial explosion," where the number of possible matches grows exponentially, causing the scanner to hang or crash.

  • Impact: If your regex is too complex, Cortex Cloud Data Classification rejects it with a validation error. In short, simplify your pattern by reducing nesting or breaking it into smaller components.

Syntax: Supported vs. unsupported

This section lists the specific regular expression characters and groupings that are allowed or restricted for use in custom patterns.

  • Supported syntax

    • Character classes: [a-z], [0-9], \d, \w, \s

    • Groupings:

      • Capturing: (...)

      • Noncapturing: (?:...)

    • Alternation: Pipe | (OR operator). Example: cat|dog

    • Case insensitivity: (?i) flag. Example: (?i)pattern matches "Pattern", "PATTERN", and so on.

  • Unsupported syntax

    • lookahead: (?=...), (?!...)

    • lookbehind: (?<=...), (?<!...)

    • backreference: \1, \2

Examples: Do's and don'ts
Example 1: Matching an ID (anchors)

Goal: Match an ID that starts with 2 letters followed by 5 digits (such as "XY12345").

  • Do: [a-zA-Z]{2}\d{5}

    Reason: This allows the pattern to match anywhere in the text.

  • Don't: ^[a-zA-Z]{2}\d{5} or [a-zA-Z]{2}\d{5}$

    Reason: The ^ and $ anchors force the match to be at the start or end of the entire file. However, it misses IDs inside a sentence or JSON object.

Example 2: Case insensitivity

Goal: Match the word "Confidential" regardless of case.

  • Do: (?i)confidential

    Reason: The (?i) flag enables case-insensitive matching for the pattern.

  • Don't: [C|c][O|o][N|n]...

    Reason: This is inefficient and hard to read.

Example 3: Testing your pattern

Goal: Verify that your pattern works in the Test Data Pattern box.

  • Do: If a test fails, clear the Test Data Pattern box completely and retype or paste the test string.

    Reason: This ensures that the test environment resets to a stateless condition before processing the new input.

  • Don't: Edit the existing text in the test box and expect immediate results if previous tests failed.