regex-master
Expert-level regular expressions covering character classes, quantifiers, groups, lookahead/lookbehind, backreferences, PCRE vs RE2 vs POSIX differences, catastrophic backtracking, and language-specific implementations in Python, JavaScript, and Go.
Installation
npx clawhub@latest install regex-masterView the full skill documentation and source below.
Documentation
Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
. Any character except newline (unless DOTALL flag)
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_]
\W Non-word character
\s Whitespace [ \t\n\r\f\v]
\S Non-whitespace
[abc] Character class: a, b, or c
[^abc] Negated class: anything except a, b, c
[a-z] Range: lowercase letters
[a-zA-Z0-9] Alphanumeric
^ Start of string (or line in MULTILINE mode)
$ End of string (or line in MULTILINE mode)
\b Word boundary (between \w and \W)
\B Non-word boundary
\A Absolute start of string (not affected by MULTILINE)
\Z Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
* 0 or more
+ 1 or more
? 0 or 1
{n} Exactly n
{n,} n or more
{n,m} Between n and m
Lazy: consume minimum, expand if needed
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
{n,m}? n to m (lazy)
Possessive (PCRE/Java): consume maximum, NO backtracking
*+ 0 or more possessive
++ 1 or more possessive
?+ 0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re
text = "<b>bold</b> and <i>italic</i>"
# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>'] ← too greedy
# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← as expected
# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1) # "2026"
m.group(2) # "03"
m.group(3) # "14"
# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme
# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year") # "2026"
m.groupdict() # {"year": "2026", "month": "03", "day": "14"}
# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"] — only the USD amount
# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)
# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]
# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]
# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]
# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")
# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick") # remove duplicates
# "the quick"
re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time
dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab") # ← will hang!
# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +
# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")
# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$") # same intent, unambiguous
PCRE vs RE2 vs POSIX
| Feature | PCRE | RE2 | POSIX |
| Named groups | ✅ (?P...) | ✅ (?P...) | ❌ |
| Lookahead | ✅ | ✅ | ❌ |
| Lookbehind | ✅ | ✅ (fixed-width) | ❌ |
| Backreferences | ✅ | ❌ | ✅ |
| Possessive | ✅ | N/A | ❌ |
| Atomic groups | ✅ | N/A | ❌ |
| Performance | O(2^n) worst | O(n) guaranteed | O(n) |
| Used in | Python, PHP, Perl, Java | Go, RE2, Rust (regex) | grep, sed |
RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)
PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:
HTML/XML parsing
<div class="(\w+)">.*?</div> — fails on nested tags, attributes
✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser
Nested structures (JSON, S-expressions, balanced parens)
(?:\([^)]*\))+ — can't handle (\(inner (\(deep\))\))
✅ Use: json.parse(), proper parser
Dates with complex rules (leap years, month lengths)
✅ Use: datetime.strptime(), date-fns, Temporal
Email validation (RFC 5321 is 100+ pages)
✅ Use: simple heuristic regex + send verification email
URLs (there is no universally correct URL regex)
✅ Use: URL() constructor (JS), urllib.parse (Python)
CSV with quoted fields containing commas
"field1","field with, comma","field3"
✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+" # ← O(2^n) — exponential
r"(a|aa)+" # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$" # ← O(2^n) — on non-matching string
# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking
# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!" # test with your pattern
# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
result = re.match(pattern, user_input)
finally:
signal.alarm(0)
Language-Specific: Python re Module
import re
# Flags
re.IGNORECASE # re.I — case-insensitive
re.MULTILINE # re.M — ^ and $ match line boundaries
re.DOTALL # re.S — dot matches newline
re.VERBOSE # re.X — allow whitespace and comments
re.ASCII # re.A — \w, \d, etc. match ASCII only (not Unicode)
# Functions
re.match(pattern, string) # match at START of string only
re.search(pattern, string) # match ANYWHERE in string
re.findall(pattern, string) # return list of all matches
re.finditer(pattern, string) # return iterator of Match objects
re.sub(pattern, repl, string) # substitute matches
re.split(pattern, string) # split by pattern
# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
r"""
(?P<local>[a-zA-Z0-9._%+\-]+) # local part
@
(?P<domain>[a-zA-Z0-9.\-]+) # domain
\.
(?P<tld>[a-zA-Z]{2,}) # TLD
""",
re.VERBOSE,
)
# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
m = EMAIL_RE.match(email)
return m.groupdict() if m else None
# Practical example: log parser
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
r"\s+(?P<logger>[\w.]+)"
r"\s+(?P<message>.+)"
)
def parse_log_line(line: str) -> dict | None:
m = LOG_PATTERN.match(line.strip())
return m.groupdict() if m else None
Language-Specific: JavaScript
// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
. Any character except newline (unless DOTALL flag)
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_]
\W Non-word character
\s Whitespace [ \t\n\r\f\v]
\S Non-whitespace
[abc] Character class: a, b, or c
[^abc] Negated class: anything except a, b, c
[a-z] Range: lowercase letters
[a-zA-Z0-9] Alphanumeric
^ Start of string (or line in MULTILINE mode)
$ End of string (or line in MULTILINE mode)
\b Word boundary (between \w and \W)
\B Non-word boundary
\A Absolute start of string (not affected by MULTILINE)
\Z Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
* 0 or more
+ 1 or more
? 0 or 1
{n} Exactly n
{n,} n or more
{n,m} Between n and m
Lazy: consume minimum, expand if needed
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
{n,m}? n to m (lazy)
Possessive (PCRE/Java): consume maximum, NO backtracking
*+ 0 or more possessive
++ 1 or more possessive
?+ 0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re
text = "<b>bold</b> and <i>italic</i>"
# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>'] ← too greedy
# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← as expected
# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1) # "2026"
m.group(2) # "03"
m.group(3) # "14"
# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme
# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year") # "2026"
m.groupdict() # {"year": "2026", "month": "03", "day": "14"}
# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"] — only the USD amount
# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)
# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]
# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]
# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]
# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")
# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick") # remove duplicates
# "the quick"
re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time
dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab") # ← will hang!
# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +
# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")
# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$") # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX
Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed
RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)
PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:
HTML/XML parsing
<div class="(\w+)">.*?</div> — fails on nested tags, attributes
✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser
Nested structures (JSON, S-expressions, balanced parens)
(?:\([^)]*\))+ — can't handle (\(inner (\(deep\))\))
✅ Use: json.parse(), proper parser
Dates with complex rules (leap years, month lengths)
✅ Use: datetime.strptime(), date-fns, Temporal
Email validation (RFC 5321 is 100+ pages)
✅ Use: simple heuristic regex + send verification email
URLs (there is no universally correct URL regex)
✅ Use: URL() constructor (JS), urllib.parse (Python)
CSV with quoted fields containing commas
"field1","field with, comma","field3"
✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+" # ← O(2^n) — exponential
r"(a|aa)+" # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$" # ← O(2^n) — on non-matching string
# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking
# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!" # test with your pattern
# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
result = re.match(pattern, user_input)
finally:
signal.alarm(0)
Language-Specific: Python re Module
import re
# Flags
re.IGNORECASE # re.I — case-insensitive
re.MULTILINE # re.M — ^ and $ match line boundaries
re.DOTALL # re.S — dot matches newline
re.VERBOSE # re.X — allow whitespace and comments
re.ASCII # re.A — \w, \d, etc. match ASCII only (not Unicode)
# Functions
re.match(pattern, string) # match at START of string only
re.search(pattern, string) # match ANYWHERE in string
re.findall(pattern, string) # return list of all matches
re.finditer(pattern, string) # return iterator of Match objects
re.sub(pattern, repl, string) # substitute matches
re.split(pattern, string) # split by pattern
# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
r"""
(?P<local>[a-zA-Z0-9._%+\-]+) # local part
@
(?P<domain>[a-zA-Z0-9.\-]+) # domain
\.
(?P<tld>[a-zA-Z]{2,}) # TLD
""",
re.VERBOSE,
)
# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
m = EMAIL_RE.match(email)
return m.groupdict() if m else None
# Practical example: log parser
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
r"\s+(?P<logger>[\w.]+)"
r"\s+(?P<message>.+)"
)
def parse_log_line(line: str) -> dict | None:
m = LOG_PATTERN.match(line.strip())
return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");
// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)
// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
console.log(match.groups.ts, match.groups.level, match.groups.msg);
}
// Named groups in replace
const formatted = "2026-03-14".replace(
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
"__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"
// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);
// Escape user input before inserting into regex
function escapeRegex(str) {
return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}
Language-Specific: Go (RE2)
import "regexp"
// Go uses RE2 — no backreferences, guaranteed O(n)
var agentIDRe = regexp.MustCompile(`^[a-z0-9-]{3,64}Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
. Any character except newline (unless DOTALL flag)
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_]
\W Non-word character
\s Whitespace [ \t\n\r\f\v]
\S Non-whitespace
[abc] Character class: a, b, or c
[^abc] Negated class: anything except a, b, c
[a-z] Range: lowercase letters
[a-zA-Z0-9] Alphanumeric
^ Start of string (or line in MULTILINE mode)
$ End of string (or line in MULTILINE mode)
\b Word boundary (between \w and \W)
\B Non-word boundary
\A Absolute start of string (not affected by MULTILINE)
\Z Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
* 0 or more
+ 1 or more
? 0 or 1
{n} Exactly n
{n,} n or more
{n,m} Between n and m
Lazy: consume minimum, expand if needed
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
{n,m}? n to m (lazy)
Possessive (PCRE/Java): consume maximum, NO backtracking
*+ 0 or more possessive
++ 1 or more possessive
?+ 0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re
text = "<b>bold</b> and <i>italic</i>"
# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>'] ← too greedy
# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← as expected
# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1) # "2026"
m.group(2) # "03"
m.group(3) # "14"
# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme
# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year") # "2026"
m.groupdict() # {"year": "2026", "month": "03", "day": "14"}
# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"] — only the USD amount
# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)
# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]
# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]
# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]
# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")
# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick") # remove duplicates
# "the quick"
re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time
dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab") # ← will hang!
# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +
# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")
# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$") # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX
Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed
RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)
PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:
HTML/XML parsing
<div class="(\w+)">.*?</div> — fails on nested tags, attributes
✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser
Nested structures (JSON, S-expressions, balanced parens)
(?:\([^)]*\))+ — can't handle (\(inner (\(deep\))\))
✅ Use: json.parse(), proper parser
Dates with complex rules (leap years, month lengths)
✅ Use: datetime.strptime(), date-fns, Temporal
Email validation (RFC 5321 is 100+ pages)
✅ Use: simple heuristic regex + send verification email
URLs (there is no universally correct URL regex)
✅ Use: URL() constructor (JS), urllib.parse (Python)
CSV with quoted fields containing commas
"field1","field with, comma","field3"
✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+" # ← O(2^n) — exponential
r"(a|aa)+" # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$" # ← O(2^n) — on non-matching string
# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking
# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!" # test with your pattern
# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
result = re.match(pattern, user_input)
finally:
signal.alarm(0)
Language-Specific: Python re Module
import re
# Flags
re.IGNORECASE # re.I — case-insensitive
re.MULTILINE # re.M — ^ and $ match line boundaries
re.DOTALL # re.S — dot matches newline
re.VERBOSE # re.X — allow whitespace and comments
re.ASCII # re.A — \w, \d, etc. match ASCII only (not Unicode)
# Functions
re.match(pattern, string) # match at START of string only
re.search(pattern, string) # match ANYWHERE in string
re.findall(pattern, string) # return list of all matches
re.finditer(pattern, string) # return iterator of Match objects
re.sub(pattern, repl, string) # substitute matches
re.split(pattern, string) # split by pattern
# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
r"""
(?P<local>[a-zA-Z0-9._%+\-]+) # local part
@
(?P<domain>[a-zA-Z0-9.\-]+) # domain
\.
(?P<tld>[a-zA-Z]{2,}) # TLD
""",
re.VERBOSE,
)
# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
m = EMAIL_RE.match(email)
return m.groupdict() if m else None
# Practical example: log parser
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
r"\s+(?P<logger>[\w.]+)"
r"\s+(?P<message>.+)"
)
def parse_log_line(line: str) -> dict | None:
m = LOG_PATTERN.match(line.strip())
return m.groupdict() if m else None
Language-Specific: JavaScript
// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
. Any character except newline (unless DOTALL flag)
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_]
\W Non-word character
\s Whitespace [ \t\n\r\f\v]
\S Non-whitespace
[abc] Character class: a, b, or c
[^abc] Negated class: anything except a, b, c
[a-z] Range: lowercase letters
[a-zA-Z0-9] Alphanumeric
^ Start of string (or line in MULTILINE mode)
$ End of string (or line in MULTILINE mode)
\b Word boundary (between \w and \W)
\B Non-word boundary
\A Absolute start of string (not affected by MULTILINE)
\Z Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
* 0 or more
+ 1 or more
? 0 or 1
{n} Exactly n
{n,} n or more
{n,m} Between n and m
Lazy: consume minimum, expand if needed
*? 0 or more (lazy)
+? 1 or more (lazy)
?? 0 or 1 (lazy)
{n,m}? n to m (lazy)
Possessive (PCRE/Java): consume maximum, NO backtracking
*+ 0 or more possessive
++ 1 or more possessive
?+ 0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re
text = "<b>bold</b> and <i>italic</i>"
# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>'] ← too greedy
# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← as expected
# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>'] ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1) # "2026"
m.group(2) # "03"
m.group(3) # "14"
# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme
# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year") # "2026"
m.groupdict() # {"year": "2026", "month": "03", "day": "14"}
# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"] — only the USD amount
# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)
# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]
# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]
# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]
# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")
# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick") # remove duplicates
# "the quick"
re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time
dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab") # ← will hang!
# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +
# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")
# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$") # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX
Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed
RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)
PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:
HTML/XML parsing
<div class="(\w+)">.*?</div> — fails on nested tags, attributes
✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser
Nested structures (JSON, S-expressions, balanced parens)
(?:\([^)]*\))+ — can't handle (\(inner (\(deep\))\))
✅ Use: json.parse(), proper parser
Dates with complex rules (leap years, month lengths)
✅ Use: datetime.strptime(), date-fns, Temporal
Email validation (RFC 5321 is 100+ pages)
✅ Use: simple heuristic regex + send verification email
URLs (there is no universally correct URL regex)
✅ Use: URL() constructor (JS), urllib.parse (Python)
CSV with quoted fields containing commas
"field1","field with, comma","field3"
✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+" # ← O(2^n) — exponential
r"(a|aa)+" # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$" # ← O(2^n) — on non-matching string
# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking
# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!" # test with your pattern
# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
result = re.match(pattern, user_input)
finally:
signal.alarm(0)
Language-Specific: Python re Module
import re
# Flags
re.IGNORECASE # re.I — case-insensitive
re.MULTILINE # re.M — ^ and $ match line boundaries
re.DOTALL # re.S — dot matches newline
re.VERBOSE # re.X — allow whitespace and comments
re.ASCII # re.A — \w, \d, etc. match ASCII only (not Unicode)
# Functions
re.match(pattern, string) # match at START of string only
re.search(pattern, string) # match ANYWHERE in string
re.findall(pattern, string) # return list of all matches
re.finditer(pattern, string) # return iterator of Match objects
re.sub(pattern, repl, string) # substitute matches
re.split(pattern, string) # split by pattern
# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
r"""
(?P<local>[a-zA-Z0-9._%+\-]+) # local part
@
(?P<domain>[a-zA-Z0-9.\-]+) # domain
\.
(?P<tld>[a-zA-Z]{2,}) # TLD
""",
re.VERBOSE,
)
# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
m = EMAIL_RE.match(email)
return m.groupdict() if m else None
# Practical example: log parser
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
r"\s+(?P<logger>[\w.]+)"
r"\s+(?P<message>.+)"
)
def parse_log_line(line: str) -> dict | None:
m = LOG_PATTERN.match(line.strip())
return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");
// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)
// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
console.log(match.groups.ts, match.groups.level, match.groups.msg);
}
// Named groups in replace
const formatted = "2026-03-14".replace(
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
"__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"
// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);
// Escape user input before inserting into regex
function escapeRegex(str) {
return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}
Language-Specific: Go (RE2)
)
func ValidateAgentID(id string) bool {
return agentIDRe.MatchString(id)
}
// Named groups (SubexpNames)
logRe := regexp.MustCompile(
`(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<msg>.+)`,
)
func ParseLog(line string) map[string]string {
match := logRe.FindStringSubmatch(line)
if match == nil { return nil }
result := make(map[string]string)
for i, name := range logRe.SubexpNames() {
if i != 0 && name != "" {
result[name] = match[i]
}
}
return result
}
// ReplaceAllStringFunc for complex substitutions
result := re.ReplaceAllStringFunc(input, func(s string) string {
return strings.ToUpper(s)
})
Practical Patterns
# Email: pragmatic (not RFC-perfect — verify by sending)
EMAIL = r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"
# URL extraction (handles common cases)
URL = r"https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!__CODE_BLOCK_13__amp;'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+"
# Agent ID validation
AGENT_ID = r"^[a-z0-9][a-z0-9\-]{1,62}[a-z0-9]$"
# ISO 8601 date
ISO_DATE = r"^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$"
# Semantic version
SEMVER = r"^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<pre>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?(?:\+(?P<build>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?$"
# Log line with structured data
LOG_LINE = re.compile(
r"^(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?)"
r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
r"\s+\[(?P<req_id>[a-f0-9\-]+)\]"
r"\s+(?P<message>.+)$"
)
# CSV with quoted fields (handles commas in quotes)
CSV_FIELD = re.compile(r'"(?:[^"\\]|\\.)*"|[^,\n]+')
# Markdown headings
MD_HEADING = re.compile(r"^(?P<level>#{1,6})\s+(?P<text>.+)$", re.MULTILINE)
Anti-Patterns
# ❌ Parsing HTML with regex
re.findall(r"<div class=\"content\">(.*?)</div>", html)
# ✅ Use BeautifulSoup or lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find("div", class_="content").text
# ❌ Not compiling patterns used in loops
for line in lines:
if re.match(r"ERROR: \d+", line): # recompiles each iteration
# ✅
ERROR_RE = re.compile(r"ERROR: \d+")
for line in lines:
if ERROR_RE.match(line):
# ❌ Nested quantifiers on overlapping patterns
r"(\w+)+" # catastrophic
r"([a-zA-Z0-9]+)+" # catastrophic
# ✅ Remove inner quantifier or use atomic group
# ❌ Anchoring incorrectly
re.match(r"error", text) # only matches at start
re.search(r"^error$", text) # needless re.search when re.match would do
# ❌ Capturing when you don't need captures (slower)
r"(https?)://(.*)" # capturing groups
# ✅
r"(?:https?)://(?:.*)" # non-capturing
# ❌ Using regex for simple contains check
if re.search(r"error", text):
# ✅
if "error" in text.lower():
Quick Reference
Greedy: .* .+ matches max, backtracks if needed
Lazy: .*? .+? matches min, expands if needed
Possessive: .*+ .++ matches max, NO backtracking (PCRE)
Groups: (capture), (?:non-capture), (?P<name>named)
Lookahead: (?=ahead) (?!not-ahead) — zero-width, not consumed
Lookbehind: (?<=behind) (?<!not-behind) — zero-width, fixed-width in RE2
Backref: \1 by number, (?P=name) in Python, __CODE_BLOCK_15__lt;name> in JS replace
ReDoS: (x+)+ or (x|x)+ patterns → catastrophic with non-matching input
RE2 vs PCRE: RE2 = O(n) guaranteed, no backrefs; PCRE = full features, risk of ReDoS
Python re: re.compile + VERBOSE flag for complex patterns
JS: /g flag + matchAll() for all matches with groups
Go: regexp.MustCompile, SubexpNames() for named group extraction
When to stop: HTML, JSON, CSV with quotes, nested structures → use proper parsers