Details and examples about the capabilities and limitations with regular expressions in Cortex Cloud Data Classification.
This guide explains the capabilities and limitations of creating custom classifiers. Understanding these guardrails will help you create effective and functional data patterns.
Guardrails
This section details the configuration rules, the logic behind these requirements, and how they affect pattern detection.
Rule: Each custom detector supports a single regular expression.
Why: This simplifies the configuration and ensures that each detector has a clear, singular purpose.
Impact: If you need to match multiple different patterns (for example, different formats of an ID), you need to create separate custom detectors for each pattern or combine them into a single regex using the alternation operator | (pipe), provided that this does not violate complexity limits. Example:
format1|format2
Rule: You must provide at least one context word for your classifier.
Why: The Data Classification engine uses context words as a first-pass filter. It only runs your potentially regular expression (regex) if one of the context words is found in the proximity you defined. This ensures high performance across large datasets.
Impact: If you don't provide context words, or if the context words don't appear near your target data, the regex does not execute, and no match is found.
Rule: Do not use the start-of string and end-of-string
^or$anchors.Why: In many regex engines,
^and$match the start or end of a line. However, in Cortex Cloud Data Classification, they match the start or end of the entire text being scanned. Since your target data ( such as an ID or key) is usually embedded in the middle of a file or sentence, using these causes the match to fail.Impact:
^[A-Z]{2}\d{5}fails to find "AB12345".[A-Z]{2}\d{5}$fails to find "AB12345".
Rule: Cortex Cloud Data Classification does not support the following:
lookahead
((?=...))lookbehind
((?<=...))backreference
(\1)
Why: Using the
lookaroundandbackreferenceentities can lead to exponential execution time. Cortex Cloud Data Classification allowslookaroundentities using the context words, and it uses the Rust regex engine, which guarantees linear time executionO(n)to prevent ReDoS (Regular Expression Denial of Service) attacks and to ensure predictable performance.Impact: You must rewrite patterns to avoid these constructs. For example, instead of using
lookbehindto ensure that a prefix exists, include the prefix in the match and use a capturing group for the data you want to extract.
Rule: Cortex Cloud Data Classification enforces limits on regex complexity to prevent performance issues.
Unbound repetitions: If possible, avoid unbounded repetitions such as
.*or.+or ensure that they are not nested.Nesting depth: Deeply nested patterns, such as
((((a)b)c)d)), are limited.Branching: Too many alternations (such as
a|b|c|...) or nested alternations can trigger validation errors.
Why: Complex patterns with excessive nesting or branching can lead to "combinatorial explosion," where the number of possible matches grows exponentially, causing the scanner to hang or crash.
Impact: If your regex is too complex, Cortex Cloud Data Classification rejects it with a validation error. In short, simplify your pattern by reducing nesting or breaking it into smaller components.
Syntax: Supported vs. unsupported
This section lists the specific regular expression characters and groupings that are allowed or restricted for use in custom patterns.
Supported syntax
Character classes:
[a-z], [0-9], \d, \w, \sGroupings:
Capturing:
(...)Noncapturing:
(?:...)
Alternation: Pipe
|(OR operator). Example:cat|dogCase insensitivity:
(?i)flag. Example:(?i)patternmatches "Pattern", "PATTERN", and so on.
Unsupported syntax
lookahead:
(?=...), (?!...)lookbehind:
(?<=...), (?<!...)backreference:
\1, \2
Examples: Do's and don'ts
Goal: Match an ID that starts with 2 letters followed by 5 digits (such as "XY12345").
Do:
[a-zA-Z]{2}\d{5}Reason: This allows the pattern to match anywhere in the text.
Don't:
^[a-zA-Z]{2}\d{5}or[a-zA-Z]{2}\d{5}$Reason: The
^and$anchors force the match to be at the start or end of the entire file. However, it misses IDs inside a sentence or JSON object.
Goal: Match the word "Confidential" regardless of case.
Do:
(?i)confidentialReason: The
(?i)flag enables case-insensitive matching for the pattern.Don't:
[C|c][O|o][N|n]...Reason: This is inefficient and hard to read.
Goal: Verify that your pattern works in the Test Data Pattern box.
Do: If a test fails, clear the Test Data Pattern box completely and retype or paste the test string.
Reason: This ensures that the test environment resets to a stateless condition before processing the new input.
Don't: Edit the existing text in the test box and expect immediate results if previous tests failed.