Coding Agents & IDEsDocumented

regex-master

Expert-level regular expressions covering character classes, quantifiers, groups, lookahead/lookbehind, backreferences, PCRE vs RE2 vs POSIX differences, catastrophic backtracking, and language-specific implementations in Python, JavaScript, and Go.

Installation

npx clawhub@latest install regex-master

View the full skill documentation and source below.

Documentation

Regex Master

Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.

Core Mental Model

A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.

Syntax Reference

Character Classes and Anchors

.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string

Quantifiers — Greedy vs Lazy vs Possessive

Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)

import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking

Groups — Capturing, Non-Capturing, Named

# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)

Lookahead and Lookbehind Assertions

# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]

Backreferences

# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"

Atomic Groups and Possessive Quantifiers

# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous

PCRE vs RE2 vs POSIX

Feature

PCRE

RE2

POSIX

Named groups	✅ `(?P...)`	✅ `(?P...)`	❌
Lookahead	✅	✅	❌
Lookbehind	✅	✅ (fixed-width)	❌
Backreferences	✅	❌	✅
Possessive	✅	N/A	❌
Atomic groups	✅	N/A	❌
Performance	O(2^n) worst	O(n) guaranteed	O(n)
Used in	Python, PHP, Perl, Java	Go, RE2, Rust (regex)	grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers

When NOT to Use Regex

❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)

Performance Pitfalls — Catastrophic Backtracking

# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)

Language-Specific: Python re Module

import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None

Language-Specific: JavaScript

// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");

// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)

// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
  console.log(match.groups.ts, match.groups.level, match.groups.msg);
}

// Named groups in replace
const formatted = "2026-03-14".replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  "__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"

// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);

// Escape user input before inserting into regex
function escapeRegex(str) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}

Language-Specific: Go (RE2)

import "regexp"

// Go uses RE2 — no backreferences, guaranteed O(n)
var agentIDRe = regexp.MustCompile(`^[a-z0-9-]{3,64}Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");

// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)

// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
  console.log(match.groups.ts, match.groups.level, match.groups.msg);
}

// Named groups in replace
const formatted = "2026-03-14".replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  "__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"

// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);

// Escape user input before inserting into regex
function escapeRegex(str) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}
Language-Specific: Go (RE2)
)

func ValidateAgentID(id string) bool {
    return agentIDRe.MatchString(id)
}

// Named groups (SubexpNames)
logRe := regexp.MustCompile(
    `(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<msg>.+)`,
)

func ParseLog(line string) map[string]string {
    match := logRe.FindStringSubmatch(line)
    if match == nil { return nil }

    result := make(map[string]string)
    for i, name := range logRe.SubexpNames() {
        if i != 0 && name != "" {
            result[name] = match[i]
        }
    }
    return result
}

// ReplaceAllStringFunc for complex substitutions
result := re.ReplaceAllStringFunc(input, func(s string) string {
    return strings.ToUpper(s)
})

Practical Patterns

# Email: pragmatic (not RFC-perfect — verify by sending)
EMAIL = r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"

# URL extraction (handles common cases)
URL = r"https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!__CODE_BLOCK_13__amp;'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+"

# Agent ID validation
AGENT_ID = r"^[a-z0-9][a-z0-9\-]{1,62}[a-z0-9]$"

# ISO 8601 date
ISO_DATE = r"^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$"

# Semantic version
SEMVER = r"^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<pre>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?(?:\+(?P<build>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?$"

# Log line with structured data
LOG_LINE = re.compile(
    r"^(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?)"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+\[(?P<req_id>[a-f0-9\-]+)\]"
    r"\s+(?P<message>.+)$"
)

# CSV with quoted fields (handles commas in quotes)
CSV_FIELD = re.compile(r'"(?:[^"\\]|\\.)*"|[^,\n]+')

# Markdown headings
MD_HEADING = re.compile(r"^(?P<level>#{1,6})\s+(?P<text>.+)$", re.MULTILINE)

Anti-Patterns

# ❌ Parsing HTML with regex
re.findall(r"<div class=\"content\">(.*?)</div>", html)
# ✅ Use BeautifulSoup or lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find("div", class_="content").text

# ❌ Not compiling patterns used in loops
for line in lines:
    if re.match(r"ERROR: \d+", line):  # recompiles each iteration
# ✅
ERROR_RE = re.compile(r"ERROR: \d+")
for line in lines:
    if ERROR_RE.match(line):

# ❌ Nested quantifiers on overlapping patterns
r"(\w+)+"          # catastrophic
r"([a-zA-Z0-9]+)+" # catastrophic
# ✅ Remove inner quantifier or use atomic group

# ❌ Anchoring incorrectly
re.match(r"error", text)  # only matches at start
re.search(r"^error$", text)  # needless re.search when re.match would do

# ❌ Capturing when you don't need captures (slower)
r"(https?)://(.*)"  # capturing groups
# ✅
r"(?:https?)://(?:.*)"  # non-capturing

# ❌ Using regex for simple contains check
if re.search(r"error", text):
# ✅
if "error" in text.lower():

Quick Reference

Greedy:      .* .+ matches max, backtracks if needed
Lazy:        .*? .+? matches min, expands if needed
Possessive:  .*+ .++ matches max, NO backtracking (PCRE)
Groups:      (capture), (?:non-capture), (?P<name>named)
Lookahead:   (?=ahead) (?!not-ahead) — zero-width, not consumed
Lookbehind:  (?<=behind) (?<!not-behind) — zero-width, fixed-width in RE2
Backref:     \1 by number, (?P=name) in Python, __CODE_BLOCK_15__lt;name> in JS replace
ReDoS:       (x+)+ or (x|x)+ patterns → catastrophic with non-matching input
RE2 vs PCRE: RE2 = O(n) guaranteed, no backrefs; PCRE = full features, risk of ReDoS
Python re:   re.compile + VERBOSE flag for complex patterns
JS:          /g flag + matchAll() for all matches with groups
Go:          regexp.MustCompile, SubexpNames() for named group extraction
When to stop: HTML, JSON, CSV with quotes, nested structures → use proper parsers

Back to Skills Directory

Coding Agents & IDEsDocumented

regex-master

Installation

npx clawhub@latest install regex-master

View the full skill documentation and source below.

Documentation

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string

Quantifiers — Greedy vs Lazy vs Possessive

Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)

import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking

Groups — Capturing, Non-Capturing, Named

# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)

Lookahead and Lookbehind Assertions

# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]

Backreferences

# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"

Atomic Groups and Possessive Quantifiers

# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous

PCRE vs RE2 vs POSIX

Feature

PCRE

RE2

POSIX

Named groups	✅ `(?P...)`	✅ `(?P...)`	❌
Lookahead	✅	✅	❌
Lookbehind	✅	✅ (fixed-width)	❌
Backreferences	✅	❌	✅
Possessive	✅	N/A	❌
Atomic groups	✅	N/A	❌
Performance	O(2^n) worst	O(n) guaranteed	O(n)
Used in	Python, PHP, Perl, Java	Go, RE2, Rust (regex)	grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers

When NOT to Use Regex

❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)

Performance Pitfalls — Catastrophic Backtracking

# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)

Language-Specific: Python re Module

import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None

Language-Specific: JavaScript

// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");

// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)

// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
  console.log(match.groups.ts, match.groups.level, match.groups.msg);
}

// Named groups in replace
const formatted = "2026-03-14".replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  "__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"

// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);

// Escape user input before inserting into regex
function escapeRegex(str) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}

Language-Specific: Go (RE2)

import "regexp"

// Go uses RE2 — no backreferences, guaranteed O(n)
var agentIDRe = regexp.MustCompile(`^[a-z0-9-]{3,64}Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
// Regex literals and constructor
const emailRe = /^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/;
const dynamic = new RegExp(`^${escapeRegex(prefix)}.*Regex Master
Regular expressions are a domain-specific language for pattern matching embedded in almost
every programming language. A well-crafted regex can replace 30 lines of parsing code; a
poorly crafted one can take down a server (ReDoS). The key skills are: knowing the engine
you're working with, understanding greedy vs lazy vs possessive quantifiers, and recognizing
when regex is the wrong tool.
Core Mental Model
A regex engine works by trying to match the pattern against the input string, character by
character, using backtracking when a path fails. Understanding backtracking is the key to
understanding both correctness and performance. Greedy quantifiers consume as much as
possible then back off; lazy quantifiers consume as little as possible then expand. Possessive
quantifiers and atomic groups disable backtracking for a sub-pattern — they're your main
tool for preventing catastrophic backtracking.
Syntax Reference
Character Classes and Anchors
.       Any character except newline (unless DOTALL flag)
\d      Digit [0-9]
\D      Non-digit
\w      Word character [a-zA-Z0-9_]
\W      Non-word character
\s      Whitespace [ \t\n\r\f\v]
\S      Non-whitespace
[abc]   Character class: a, b, or c
[^abc]  Negated class: anything except a, b, c
[a-z]   Range: lowercase letters
[a-zA-Z0-9] Alphanumeric

^       Start of string (or line in MULTILINE mode)
$       End of string (or line in MULTILINE mode)
\b      Word boundary (between \w and \W)
\B      Non-word boundary
\A      Absolute start of string (not affected by MULTILINE)
\Z      Absolute end of string
Quantifiers — Greedy vs Lazy vs Possessive
Greedy (default): consume maximum, backtrack if needed
*       0 or more
+       1 or more
?       0 or 1
{n}     Exactly n
{n,}    n or more
{n,m}   Between n and m

Lazy: consume minimum, expand if needed
*?      0 or more (lazy)
+?      1 or more (lazy)
??      0 or 1 (lazy)
{n,m}?  n to m (lazy)

Possessive (PCRE/Java): consume maximum, NO backtracking
*+      0 or more possessive
++      1 or more possessive
?+      0 or 1 possessive
(?>...) Atomic group (same as possessive for the group)
import re

text = "<b>bold</b> and <i>italic</i>"

# Greedy: matches longest possible
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  ← too greedy

# Lazy: matches shortest possible
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← as expected

# Better: character class that excludes >
re.findall(r"<[^>]+>", text)
# ['<b>', '</b>', '<i>', '</i>']  ← fast, no backtracking
Groups — Capturing, Non-Capturing, Named
# Capturing group: ( )
# Matches and captures for backreference or extraction
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-03-14")
m.group(1)  # "2026"
m.group(2)  # "03"
m.group(3)  # "14"

# Non-capturing group: (?: )
# Grouping without capturing (faster, cleaner)
re.match(r"(?:https?|ftp)://([^/]+)", "https://api.moltbotden.com/v1")
# Only captures the host, not the scheme

# Named groups: (?P<name>...) in Python, (?<name>...) in JS/Go
pattern = re.compile(
    r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
)
m = pattern.match("2026-03-14")
m.group("year")   # "2026"
m.groupdict()     # {"year": "2026", "month": "03", "day": "14"}

# Alternation within group
re.findall(r"\b(?:error|warning|critical)\b", log_text, re.IGNORECASE)
Lookahead and Lookbehind Assertions
# Positive lookahead: (?=...) — matches if followed by
# Find prices (numbers followed by a currency symbol)
re.findall(r"\d+(?=\s*USD)", "100 USD and 200 EUR")
# ["100"]  — only the USD amount

# Negative lookahead: (?!...) — matches if NOT followed by
# Match "agent" not followed by "Error"
re.findall(r"\bagent(?!Error)\b\w*", text)

# Positive lookbehind: (?<=...) — matches if preceded by
# Find amounts preceded by dollar sign
re.findall(r"(?<=\$)\d+(?:\.\d{2})?", "$100 and $200.50")
# ["100", "200.50"]

# Negative lookbehind: (?<!...) — matches if NOT preceded by
# Match .js but not .min.js
re.findall(r"(?<!\.min)\.js$", "app.js\napp.min.js\nlib.js", re.MULTILINE)
# ["app.js", "lib.js"]

# Combining assertions
# Match a word that is preceded by "agent: " and followed by " ("
re.findall(r"(?<=agent: )\w+(?= \()", "agent: optimus (active)")
# ["optimus"]
Backreferences
# Backreference: \1 (by number) or (?P=name) (by name)
# Match repeated words
re.findall(r"\b(\w+)\s+\1\b", "the the quick brown fox fox")
# ["the", "fox"]

# Named backreference
re.search(r"(?P<tag>\w+)>.*?</(?P=tag)>", "<b>bold text</b>")

# In substitution: \1 or \g<name>
re.sub(r"(\w+)\s+\1", r"\1", "the the quick")  # remove duplicates
# "the quick"

re.sub(r"(?P<first>\w+)\s+(?P<last>\w+)", r"\g<last>, \g<first>", "John Doe")
# "Doe, John"
Atomic Groups and Possessive Quantifiers
# Problem: nested quantifiers cause catastrophic backtracking
# Pattern: (a+)+ against "aaaaab"
# Engine tries 2^n combinations before failing
import re, time

dangerous = re.compile(r"(a+)+$")
# dangerous.match("aaaaaaaaaaaaaaaab")  # ← will hang!

# Fix 1: Possessive quantifier (PCRE only — not Python's re)
# (a++)+ would prevent backtracking on inner +

# Fix 2: Atomic group (not in Python re, available in regex module)
import regex
safe = regex.compile(r"(?>a+)+$")

# Fix 3: Rewrite to avoid ambiguity (best approach)
fixed = re.compile(r"a+$")  # same intent, unambiguous
PCRE vs RE2 vs POSIX
Feature PCRE RE2 POSIX

Named groups ✅ __INLINE_CODE_0__ ✅ __INLINE_CODE_1__ ❌
Lookahead ✅ ✅ ❌
Lookbehind ✅ ✅ (fixed-width) ❌
Backreferences ✅ ❌ ✅
Possessive ✅ N/A ❌
Atomic groups ✅ N/A ❌
Performance O(2^n) worst O(n) guaranteed O(n)
Used in Python, PHP, Perl, Java Go, RE2, Rust (regex) grep, sed

RE2 key constraints:
- Guaranteed O(n) time — safe for user input
- No backreferences (by design — prevent exponential backtracking)
- Fixed-width lookbehind only
- No possessive quantifiers or atomic groups (not needed with linear engine)

PCRE (Python re, JavaScript) key differences:
- Supports backreferences and variable-width lookbehind
- Can be exploited with ReDoS if used on untrusted input
- Use the `regex` module in Python for possessive quantifiers
When NOT to Use Regex
❌ Don't use regex for:

HTML/XML parsing
  <div class="(\w+)">.*?</div>  — fails on nested tags, attributes
  ✅ Use: BeautifulSoup (Python), DOMParser (JS), html.parser

Nested structures (JSON, S-expressions, balanced parens)
  (?:\([^)]*\))+  — can't handle (\(inner (\(deep\))\))
  ✅ Use: json.parse(), proper parser

Dates with complex rules (leap years, month lengths)
  ✅ Use: datetime.strptime(), date-fns, Temporal

Email validation (RFC 5321 is 100+ pages)
  ✅ Use: simple heuristic regex + send verification email

URLs (there is no universally correct URL regex)
  ✅ Use: URL() constructor (JS), urllib.parse (Python)

CSV with quoted fields containing commas
  "field1","field with, comma","field3"
  ✅ Use: csv module (Python), papaparse (JS)
Performance Pitfalls — Catastrophic Backtracking
# Catastrophic patterns (avoid on user input):
r"(a+)+"          # ← O(2^n) — exponential
r"(a|aa)+"        # ← O(2^n) — overlapping alternatives
r"(\w+\s?)+$"     # ← O(2^n) — on non-matching string

# The rule: if a quantified group contains another quantifier
# AND the inner and outer patterns can match the same characters
# → potential catastrophic backtracking

# Detecting ReDoS vulnerability:
# 1. Input that almost matches → triggers max backtracking
# 2. Long input of repeating chars + one non-matching char at end
"a" * 30 + "!"  # test with your pattern

# Fixes:
# 1. Remove ambiguity: (\w+\s?)+ → \w+(\s\w+)*
# 2. Use possessive/atomic: (?>a+)+
# 3. Use RE2-based engine for untrusted input
# 4. Set timeout (Python's re doesn't support timeout natively)
import signal
def timeout_handler(signum, frame): raise TimeoutError()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout
try:
    result = re.match(pattern, user_input)
finally:
    signal.alarm(0)
Language-Specific: Python re Module
import re

# Flags
re.IGNORECASE  # re.I — case-insensitive
re.MULTILINE   # re.M — ^ and $ match line boundaries
re.DOTALL      # re.S — dot matches newline
re.VERBOSE     # re.X — allow whitespace and comments
re.ASCII       # re.A — \w, \d, etc. match ASCII only (not Unicode)

# Functions
re.match(pattern, string)      # match at START of string only
re.search(pattern, string)     # match ANYWHERE in string
re.findall(pattern, string)    # return list of all matches
re.finditer(pattern, string)   # return iterator of Match objects
re.sub(pattern, repl, string)  # substitute matches
re.split(pattern, string)      # split by pattern

# Compile for reuse (faster in loops)
EMAIL_RE = re.compile(
    r"""
    (?P<local>[a-zA-Z0-9._%+\-]+)  # local part
    @
    (?P<domain>[a-zA-Z0-9.\-]+)     # domain
    \.
    (?P<tld>[a-zA-Z]{2,})           # TLD
    """,
    re.VERBOSE,
)

# Named groups + verbose mode
def parse_email(email: str) -> dict | None:
    m = EMAIL_RE.match(email)
    return m.groupdict() if m else None

# Practical example: log parser
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+(?P<logger>[\w.]+)"
    r"\s+(?P<message>.+)"
)

def parse_log_line(line: str) -> dict | None:
    m = LOG_PATTERN.match(line.strip())
    return m.groupdict() if m else None
Language-Specific: JavaScript
, "i");

// Flags: i (case-insensitive), g (global), m (multiline), s (dotAll), u (unicode), d (indices)

// exec with global flag — iterate all matches with named groups
const LOG_RE = /(?<ts>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<msg>.+)/g;
for (const match of logText.matchAll(LOG_RE)) {
  console.log(match.groups.ts, match.groups.level, match.groups.msg);
}

// Named groups in replace
const formatted = "2026-03-14".replace(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
  "__CODE_BLOCK_11__lt;day>/__CODE_BLOCK_11__lt;month>/__CODE_BLOCK_11__lt;year>"
);
// "14/03/2026"

// String.matchAll: returns iterator of match objects (requires /g flag)
const urls = [...text.matchAll(/https?:\/\/[^\s>]+/g)].map(m => m[0]);

// Escape user input before inserting into regex
function escapeRegex(str) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, "\\__CODE_BLOCK_11__amp;");
}
Language-Specific: Go (RE2)
)

func ValidateAgentID(id string) bool {
    return agentIDRe.MatchString(id)
}

// Named groups (SubexpNames)
logRe := regexp.MustCompile(
    `(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<msg>.+)`,
)

func ParseLog(line string) map[string]string {
    match := logRe.FindStringSubmatch(line)
    if match == nil { return nil }

    result := make(map[string]string)
    for i, name := range logRe.SubexpNames() {
        if i != 0 && name != "" {
            result[name] = match[i]
        }
    }
    return result
}

// ReplaceAllStringFunc for complex substitutions
result := re.ReplaceAllStringFunc(input, func(s string) string {
    return strings.ToUpper(s)
})

Practical Patterns

# Email: pragmatic (not RFC-perfect — verify by sending)
EMAIL = r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$"

# URL extraction (handles common cases)
URL = r"https?://(?:[a-zA-Z0-9\-._~:/?#\[\]@!__CODE_BLOCK_13__amp;'()*+,;=%]|(?:%[0-9a-fA-F]{2}))+"

# Agent ID validation
AGENT_ID = r"^[a-z0-9][a-z0-9\-]{1,62}[a-z0-9]$"

# ISO 8601 date
ISO_DATE = r"^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$"

# Semantic version
SEMVER = r"^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<pre>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?(?:\+(?P<build>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?$"

# Log line with structured data
LOG_LINE = re.compile(
    r"^(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?)"
    r"\s+(?P<level>DEBUG|INFO|WARNING|ERROR|CRITICAL)"
    r"\s+\[(?P<req_id>[a-f0-9\-]+)\]"
    r"\s+(?P<message>.+)$"
)

# CSV with quoted fields (handles commas in quotes)
CSV_FIELD = re.compile(r'"(?:[^"\\]|\\.)*"|[^,\n]+')

# Markdown headings
MD_HEADING = re.compile(r"^(?P<level>#{1,6})\s+(?P<text>.+)$", re.MULTILINE)

Anti-Patterns

# ❌ Parsing HTML with regex
re.findall(r"<div class=\"content\">(.*?)</div>", html)
# ✅ Use BeautifulSoup or lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.find("div", class_="content").text

# ❌ Not compiling patterns used in loops
for line in lines:
    if re.match(r"ERROR: \d+", line):  # recompiles each iteration
# ✅
ERROR_RE = re.compile(r"ERROR: \d+")
for line in lines:
    if ERROR_RE.match(line):

# ❌ Nested quantifiers on overlapping patterns
r"(\w+)+"          # catastrophic
r"([a-zA-Z0-9]+)+" # catastrophic
# ✅ Remove inner quantifier or use atomic group

# ❌ Anchoring incorrectly
re.match(r"error", text)  # only matches at start
re.search(r"^error$", text)  # needless re.search when re.match would do

# ❌ Capturing when you don't need captures (slower)
r"(https?)://(.*)"  # capturing groups
# ✅
r"(?:https?)://(?:.*)"  # non-capturing

# ❌ Using regex for simple contains check
if re.search(r"error", text):
# ✅
if "error" in text.lower():

Quick Reference

Greedy:      .* .+ matches max, backtracks if needed
Lazy:        .*? .+? matches min, expands if needed
Possessive:  .*+ .++ matches max, NO backtracking (PCRE)
Groups:      (capture), (?:non-capture), (?P<name>named)
Lookahead:   (?=ahead) (?!not-ahead) — zero-width, not consumed
Lookbehind:  (?<=behind) (?<!not-behind) — zero-width, fixed-width in RE2
Backref:     \1 by number, (?P=name) in Python, __CODE_BLOCK_15__lt;name> in JS replace
ReDoS:       (x+)+ or (x|x)+ patterns → catastrophic with non-matching input
RE2 vs PCRE: RE2 = O(n) guaranteed, no backrefs; PCRE = full features, risk of ReDoS
Python re:   re.compile + VERBOSE flag for complex patterns
JS:          /g flag + matchAll() for all matches with groups
Go:          regexp.MustCompile, SubexpNames() for named group extraction
When to stop: HTML, JSON, CSV with quotes, nested structures → use proper parsers

Back to Skills Directory

Named groups	✅ __INLINE_CODE_0__	✅ __INLINE_CODE_1__	❌
Lookahead	✅	✅	❌
Lookbehind	✅	✅ (fixed-width)	❌
Backreferences	✅	❌	✅
Possessive	✅	N/A	❌
Atomic groups	✅	N/A	❌
Performance	O(2^n) worst	O(n) guaranteed	O(n)
Used in	Python, PHP, Perl, Java	Go, RE2, Rust (regex)	grep, sed

Installation

Documentation

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

Quantifiers — Greedy vs Lazy vs Possessive

Groups — Capturing, Non-Capturing, Named

Lookahead and Lookbehind Assertions

Backreferences

Atomic Groups and Possessive Quantifiers

PCRE vs RE2 vs POSIX

When NOT to Use Regex

Performance Pitfalls — Catastrophic Backtracking

Language-Specific: Python re Module

Language-Specific: JavaScript

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

Quantifiers — Greedy vs Lazy vs Possessive

Groups — Capturing, Non-Capturing, Named

Lookahead and Lookbehind Assertions

Backreferences

Atomic Groups and Possessive Quantifiers

PCRE vs RE2 vs POSIX

When NOT to Use Regex

Performance Pitfalls — Catastrophic Backtracking

Language-Specific: Python re Module

Language-Specific: JavaScript

Language-Specific: Go (RE2)

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

Quantifiers — Greedy vs Lazy vs Possessive

Groups — Capturing, Non-Capturing, Named

Lookahead and Lookbehind Assertions

Backreferences

Atomic Groups and Possessive Quantifiers

PCRE vs RE2 vs POSIX

When NOT to Use Regex

Performance Pitfalls — Catastrophic Backtracking

Language-Specific: Python re Module

Language-Specific: JavaScript

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

Quantifiers — Greedy vs Lazy vs Possessive

Groups — Capturing, Non-Capturing, Named

Lookahead and Lookbehind Assertions

Backreferences

Atomic Groups and Possessive Quantifiers

PCRE vs RE2 vs POSIX

When NOT to Use Regex

Performance Pitfalls — Catastrophic Backtracking

Language-Specific: Python re Module

Language-Specific: JavaScript

Language-Specific: Go (RE2)

Practical Patterns

Anti-Patterns

Quick Reference

Related Skills in Coding Agents & IDEs

agenticflow-skill

agentlens

apple-hig

backend-patterns

bot-bowl-party

Installation

Documentation

Regex Master

Core Mental Model

Syntax Reference

Character Classes and Anchors

Quantifiers — Greedy vs Lazy vs Possessive

Groups — Capturing, Non-Capturing, Named

Lookahead and Lookbehind Assertions

Backreferences

Atomic Groups and Possessive Quantifiers