Introduction
fuzzy-regex is a high-performance fuzzy regular expression engine written in Rust. It combines traditional regex constructs with approximate string matching using Damerau-Levenshtein automata and the Bitap algorithm.
Key Features
- Fuzzy Matching: Match strings with configurable edit distance tolerance (insertions, deletions, substitutions, transpositions)
- Full Regex Support: Character classes, quantifiers, groups, alternation, anchors, lookahead/lookbehind
- Per-Segment Fuzziness: Control fuzziness for individual parts of a pattern
- Capture Groups: Named and numbered capture groups with fuzzy matching
- Similarity Scoring: Get match quality scores (0.0 - 1.0)
- Streaming API: Process large files and network streams incrementally
- High Performance: Bitap algorithm for patterns ≤64 chars, SIMD optimizations
- Unicode Support: Full Unicode support including case-insensitive matching
Why Fuzzy Regex?
Traditional regex requires exact matches. Fuzzy regex allows for “close enough” matches:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "hello" with up to 1 edit
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
println!("{}", re.is_match("hello")); // Exact match
println!("{}", re.is_match("hallo")); // 1 substitution
println!("{}", re.is_match("helo")); // 1 deletion
println!("{}", re.is_match("helllo")); // 1 insertion
println!("{}", re.is_match("hlelo")); // 1 transposition
}
Performance
fuzzy-regex is designed for high performance:
- Bitap Algorithm: O(n×k) time complexity for patterns ≤64 characters
- SIMD Optimizations: ~2-10x speedup on supported platforms (AVX2, NEON)
- Hardened Mode: Guaranteed O(n) worst-case for pathological patterns
- Streaming: Process gigabytes of data without loading into memory
Typical throughput: 1.4-2.0 Gbps for streaming fuzzy matching on modern hardware.
Pathological Pattern Protection
Some regex patterns can cause O(n²) behavior in naive implementations. fuzzy-regex includes hardened mode that guarantees O(n) performance even for patterns like .*a|b:
#![allow(unused)]
fn main() {
let re = FuzzyRegex::new(".*a|b").unwrap();
// Guaranteed O(n) even on pathological patterns
let matches = re.find_all_hardened(text);
}
Ecosystem
- MSRV: Rust 1.88+
- Features: SIMD (default), mimalloc (optional)
- Dependencies: Zero runtime dependencies
Installation
Add fuzzy-regex to your Cargo.toml:
[dependencies]
fuzzy-regex = "0.1"
Feature Flags
simd(default): Enable SIMD optimizations for faster matching on x86_64 and ARMmimalloc: Use mimalloc allocator for better performance in memory-intensive workloads
[dependencies]
fuzzy-regex = { version = "0.1", default-features = false }
# Or with mimalloc
fuzzy-regex = { version = "0.1", features = ["simd", "mimalloc"] }
MSRV
The minimum supported Rust version is 1.88.
Quick Verification
use fuzzy_regex::FuzzyRegex;
fn main() {
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
println!("{}", re.is_match("hallo")); // true
}
Quick Start
Basic Fuzzy Matching
fn main() {
use fuzzy_regex::FuzzyRegex;
// Allow up to 2 edits
let re = FuzzyRegex::new("(?:hello){e<=2}").unwrap();
assert!(re.is_match("hello")); // Exact match
assert!(re.is_match("helo")); // 1 deletion
assert!(re.is_match("helllo")); // 1 insertion
assert!(re.is_match("hallo")); // 1 substitution
assert!(re.is_match("hlelo")); // 1 transposition
}
Using the Builder
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:teh){e<=1}")
.similarity(0.7) // Minimum similarity score
.case_insensitive(true) // Case-insensitive matching
.build()
.unwrap();
let m = re.find("I saw teh cat").unwrap();
assert_eq!(m.as_str(), "teh");
assert!(m.similarity() >= 0.7);
}
Finding Matches
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"(?:\w+){e<=1}").unwrap();
// Find first match
let m = re.find("This is a tset").unwrap();
assert_eq!(m.as_str(), "tset");
// Find all matches
let matches: Vec<_> = re.find_iter("test tset tsat").collect();
assert_eq!(matches.len(), 3);
}
Streaming Large Data
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
// Process in chunks
for m in stream.feed(b"some hay and niddle here") {
println!("Found: {} at {}", m.as_str(), m.start());
}
// Check position
assert_eq!(stream.position(), 24);
}
Fuzzy Matching Basics
Fuzzy matching allows matching strings that are “close” to the pattern, not just exact matches. The closeness is measured using edit distance.
Edit Distance
Edit distance is the minimum number of character-level operations needed to transform one string into another:
- Insertion: Add a character (cost: 1)
- Deletion: Remove a character (cost: 1)
- Substitution: Replace one character with another (cost: 1)
- Transposition: Swap two adjacent characters (cost: 1)
Example: “hello” → “hallo” (1 substitution)
Fuzziness Markers
Apply fuzziness to a pattern segment using {...}:
| Syntax | Description |
|---|---|
(?:text){e<=N} | Allow up to N total edits |
(?:text){i<=N} | Allow up to N insertions |
(?:text){d<=N} | Allow up to N deletions |
(?:text){s<=N} | Allow up to N substitutions |
(?:text){t<=N} | Allow up to N transpositions |
fn main() {
use fuzzy_regex::FuzzyRegex;
// Allow up to 2 total edits
let re1 = FuzzyRegex::new("(?:hello){e<=2}").unwrap();
// Allow specific edit types
let re2 = FuzzyRegex::new("(?:hello){i<=1,d<=1}").unwrap();
// Allow substitutions only
let re3 = FuzzyRegex::new("(?:hello){s<=2}").unwrap();
println!("re1: {:?}", re1.is_match("hello"));
println!("re2: {:?}", re2.is_match("helo"));
println!("re3: {:?}", re3.is_match("hallo"));
}
Shorthand Syntax
Use ~N as shorthand for {e<=N}:
fn main() {
use fuzzy_regex::FuzzyRegex;
// These are equivalent:
let re1 = FuzzyRegex::new("(?:hello)~2").unwrap();
let re2 = FuzzyRegex::new("(?:hello){e<=2}").unwrap();
// Exact match with ~0
let re3 = FuzzyRegex::new("(?:hello)~0").unwrap();
println!("re1 == re2: {}", re1.is_match("hello") == re2.is_match("hello"));
println!("re3 exact: {}", re3.is_match("hello"));
}
Unlimited Errors
Omit the number to allow unlimited edits:
fn main() {
// Allow unlimited substitutions
let re = FuzzyRegex::new("(?:hello){s}").unwrap();
// Allow any number of errors
let re2 = FuzzyRegex::new("(?:hello){e}").unwrap();
println!("{}", re.is_match("hallo"));
println!("{}", re2.is_match("xyz"));
}
Edit Distance
The edit distance (Damerau-Levenshtein distance) measures how many operations it takes to transform the pattern into the matched text.
Edit Types
Insertion
Adding a character to the text:
Pattern: "cat"
Text: "cart"
Edit: Insert 'r' after 'c'
Cost: 1
Deletion
Removing a character from the text:
Pattern: "cat"
Text: "ct"
Edit: Delete 'a'
Cost: 1
Substitution
Replacing one character with another:
Pattern: "cat"
Text: "bat"
Edit: Replace 'c' with 'b'
Cost: 1
Transposition
Swapping two adjacent characters:
Pattern: "cat"
Text: "act"
Edit: Swap 'c' and 'a'
Cost: 1
Accessing Edit Information
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=2}").unwrap();
let m = re.find("hallo").unwrap();
println!("Similarity: {:.2}", m.similarity());
println!("Edits: {}", m.len()); // Match length
// Edit counts available in the engine
// Use engine::EditCounts for detailed info
}
Similarity Score
The similarity score ranges from 0.0 to 1.0, where 1.0 is an exact match:
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=2}")
.similarity(0.8) // Minimum similarity threshold
.build()
.unwrap();
// Exact match: similarity = 1.0
// 1 edit: similarity around 0.8-0.9
// 2 edits: similarity around 0.6-0.8
println!("{}", re.is_match("hello"));
println!("{}", re.is_match("hallo"));
}
Fuzziness Markers
Detailed syntax for controlling fuzzy matching.
Basic Markers
| Marker | Description | Example |
|---|---|---|
{e<=N} | Total edits ≤ N | {e<=2} |
{i<=N} | Insertions ≤ N | {i<=1} |
{d<=N} | Deletions ≤ N | {d<=1} |
{s<=N} | Substitutions ≤ N | {s<=1} |
{t<=N} | Transpositions ≤ N | {t<=1} |
Combining Markers
fn main() {
use fuzzy_regex::FuzzyRegex;
// Allow 1 insertion AND 1 deletion
let re1 = FuzzyRegex::new("(?:hello){i<=1,d<=1}").unwrap();
// Allow up to 2 substitutions OR up to 1 deletion (combined constraint)
let re2 = FuzzyRegex::new("(?:hello){s<=2,d<=1}").unwrap();
// Each constraint is independent
// The match must satisfy ALL specified constraints
println!("re1 matches 'helo': {}", re1.is_match("helo"));
println!("re2 matches 'hallo': {}", re2.is_match("hallo"));
}
Character Class Restrictions
Restrict which characters can be used for edits:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Only allow substitutions with lowercase letters
let re1 = FuzzyRegex::new(r"(?:hello){s<=1:[a-z]}").unwrap();
assert!(re1.is_match("hallo")); // 'a' is in [a-z]
assert!(!re1.is_match("h3llo")); // '3' is not in [a-z]
// Only allow insertions of digits
let re2 = FuzzyRegex::new(r"(?:hello){i<=1:[0-9]}").unwrap();
// Only allow substitutions with whitespace
let re3 = FuzzyRegex::new(r"(?:hello){s<=1:\s}").unwrap();
}
Min/Max Error Ranges
fn main() {
use fuzzy_regex::FuzzyRegex;
// Require at least 1 edit, allow up to 2
let re = FuzzyRegex::new("(?:hello){e>=1,e<=2}").unwrap();
// Minimum errors with shorthand
let re2 = FuzzyRegex::new("(?:hello){1e<=2}").unwrap();
println!("re matches 'hello': {}", re.is_match("hello"));
println!("re matches 'hallo': {}", re.is_match("hallo"));
}
Editing Classes
Apply fuzziness to specific character classes:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Apply different limits to different parts
let re = FuzzyRegex::new("(?:hello){e<=1} world{e<=1}").unwrap();
println!("{}", re.is_match("helo worled"));
}
Cost-Based Matching
Assign different costs to edit operations for fine-grained control.
Simple Cost Constraint
Use {c<=N} to limit total cost (all operations cost 1):
fn main() {
use fuzzy_regex::FuzzyRegex;
// Total cost ≤ 2 (any combination of edits, each costing 1)
let re = FuzzyRegex::new("(?:hello){c<=2}").unwrap();
assert!(re.is_match("hello")); // 0 edits, cost=0
assert!(re.is_match("hallo")); // 1 sub, cost=1
assert!(re.is_match("helo")); // 1 del, cost=1
assert!(re.is_match("hhello")); // 1 ins, cost=1
assert!(re.is_match("hallo")); // 1 sub, cost=1
}
Weighted Costs
Assign different costs to different edit types:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Insertions cost 2, others cost 1, total ≤ 3
let re = FuzzyRegex::new("(?:ab){2i+1d+1s+1t<=3}").unwrap();
assert!(re.is_match("abc")); // 1 insertion (cost=2)
assert!(re.is_match("a")); // 1 deletion (cost=1)
assert!(re.is_match("ba")); // 1 transposition (cost=1)
assert!(!re.is_match("aabc")); // 2 insertions (cost=4) - exceeds limit
}
Cost Syntax
| Syntax | Description |
|---|---|
{c<=N} | Total cost ≤ N |
{c<N} | Total cost < N |
{Ni+Md+St+Tt<=N} | Custom costs for each type |
{0i+...} | Free insertions |
{Ni...} | Insertions cost N |
fn main() {
use fuzzy_regex::FuzzyRegex;
// All operations cost 1: {1i+1d+1s+1t<=N} = {c<=N}
let re1 = FuzzyRegex::new("(?:test){c<=2}").unwrap();
// Insertions are expensive (typing errors less common)
let re2 = FuzzyRegex::new("(?:test){3i+1d+1s<=3}").unwrap();
// Insertions are free (OCR errors)
let re3 = FuzzyRegex::new("(?:test){0i+1d+1s<=2}").unwrap();
println!("re1: {}", re1.is_match("test"));
println!("re2: {}", re2.is_match("test"));
println!("re3: {}", re3.is_match("tset"));
}
Use Cases
- Typing errors: Substitutions/deletions more common than insertions
- OCR errors: Insertions more common (extra characters)
- Genetic sequences: Different scoring matrices
Pattern Syntax
fuzzy-regex supports standard regex syntax plus fuzzy matching.
Quick Reference
| Syntax | Description |
|---|---|
a | Literal character |
. | Any character except newline |
\d | Digit [0-9] |
\D | Non-digit |
\w | Word character [a-zA-Z0-9_] |
\W | Non-word character |
\s | Whitespace |
\S | Non-whitespace |
[abc] | Character class |
[^abc] | Negated class |
[a-z] | Range |
^ | Start of string |
$ | End of string |
\b | Word boundary |
\B | Non-word boundary |
Escaping
Special characters that need escaping: .^$*+?{}[]\\|()
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match literal dot
let re = FuzzyRegex::new(r"file\.txt").unwrap();
// Match literal backslash
let re2 = FuzzyRegex::new(r"path\\to\\file").unwrap();
println!("dot match: {}", re.is_match("file.txt"));
println!("backslash match: {}", re2.is_match(r"path\to\file"));
}
Character Classes
Define sets of characters to match.
Basic Classes
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match any vowel
let re1 = FuzzyRegex::new("[aeiou]").unwrap();
// Match any consonant
let re2 = FuzzyRegex::new("[^aeiou]").unwrap();
// Match any lowercase letter
let re3 = FuzzyRegex::new("[a-z]").unwrap();
// Match any alphanumeric
let re4 = FuzzyRegex::new("[a-zA-Z0-9]").unwrap();
println!("vowel: {}", re1.is_match("a"));
println!("consonant: {}", re2.is_match("b"));
}
Predefined Classes
By default, \w, \d, \s match ASCII characters only:
| Escape | ASCII Only | Unicode (with (?u) |
|---|---|---|
\d | [0-9] | Unicode digits |
\D | [^0-9] | Non-digit |
\w | [a-zA-Z0-9_] | Unicode word chars |
\W | [^a-zA-Z0-9_] | Non-word |
\s | [ \t\n\r\f] | Unicode whitespace |
\S | [^ \t\n\r\f] | Non-whitespace |
Unicode Mode
Enable Unicode character classes with (?u) flag:
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
// Via inline flag
let re1 = FuzzyRegex::new(r"(?u)\w+").unwrap();
assert!(re1.is_match("привет")); // Cyrillic matched
// Via builder
let re2 = FuzzyRegexBuilder::new(r"\w+")
.unicode(true)
.build()
.unwrap();
}
Unicode Property Escapes
fn main() {
use fuzzy_regex::FuzzyRegex;
// Unicode letters
let re1 = FuzzyRegex::new(r"\p{L}").unwrap();
// Unicode digits
let re2 = FuzzyRegex::new(r"\p{N}").unwrap();
// Emoji
let re3 = FuzzyRegex::new(r"\p{Emoji}").unwrap();
println!("letter: {}", re1.is_match("a"));
println!("digit: {}", re2.is_match("١")); // Arabic-Indic digit
}
Nested Classes
fn main() {
use fuzzy_regex::FuzzyRegex;
// POSIX-like classes (inside character class)
let re = FuzzyRegex::new("[[:alpha:]]").unwrap();
println!("alpha: {}", re.is_match("a"));
}
Quantifiers
Control how many times to match a pattern.
Basic Quantifiers
| Syntax | Description | Example |
|---|---|---|
* | Zero or more | ab* matches “a”, “ab”, “abb” |
+ | One or more | ab+ matches “ab”, “abb” |
? | Zero or one | ab? matches “a”, “ab” |
{n} | Exactly n | a{3} matches “aaa” |
{n,} | At least n | a{2,} matches “aa”, “aaa” |
{n,m} | Between n and m | a{2,4} matches “aa”, “aaa”, “aaaa” |
Lazy Quantifiers
Add ? to make quantifiers lazy (match as few as possible):
fn main() {
use fuzzy_regex::FuzzyRegex;
// Greedy: matches as much as possible
let re1 = FuzzyRegex::new(r"<.+>").unwrap();
// Lazy: matches as little as possible
let re2 = FuzzyRegex::new(r"<.+?>").unwrap();
let text = "<tag>more</tag>";
assert!(re1.find(text).unwrap().as_str() == "<tag>more</tag>");
assert!(re2.find(text).unwrap().as_str() == "<tag>");
}
Possessive Quantifiers
Prevent backtracking (useful for performance):
fn main() {
use fuzzy_regex::FuzzyRegex;
// a*+ doesn't backtrack
let re = FuzzyRegex::new("a*+b").unwrap();
println!("{}", re.is_match("ab"));
}
Quantifiers with Fuzzy Matching
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "ab" with up to 1 error, repeated 1-3 times
let re = FuzzyRegex::new("(?:ab){e<=1}{1,3}").unwrap();
println!("{}", re.is_match("ab"));
}
Groups and Alternation
Group patterns and match alternatives.
Capture Groups
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"(?<user>\w+)@(?<domain>\w+\.\w+)").unwrap();
let caps = re.captures("john@example.com").unwrap();
assert_eq!(caps.name("user").unwrap().as_str(), "john");
assert_eq!(caps.name("domain").unwrap().as_str(), "example.com");
}
Numbered Groups
fn main() {
let re = FuzzyRegex::new(r"(\w+)@(\w+)").unwrap();
let caps = re.captures("test@example").unwrap();
assert_eq!(caps.get(1).unwrap().as_str(), "test");
assert_eq!(caps.get(2).unwrap().as_str(), "example");
}
Non-Capturing Groups
fn main() {
use fuzzy_regex::FuzzyRegex;
// (?:...) doesn't create a capture group
let re = FuzzyRegex::new("(?:http|https)://").unwrap();
println!("{}", re.is_match("http://example.com"));
}
Alternation
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match either option
let re = FuzzyRegex::new("(foo|bar)").unwrap();
assert!(re.is_match("foo"));
assert!(re.is_match("bar"));
assert!(!re.is_match("baz"));
}
Nested Groups
fn main() {
let re = FuzzyRegex::new(r"((ab)(cd))").unwrap();
let caps = re.captures("abcd").unwrap();
assert_eq!(caps.get(0).unwrap().as_str(), "abcd"); // Full match
assert_eq!(caps.get(1).unwrap().as_str(), "abcd"); // Outer group
assert_eq!(caps.get(2).unwrap().as_str(), "ab"); // First inner
assert_eq!(caps.get(3).unwrap().as_str(), "cd"); // Second inner
}
Fuzzy Groups
Apply fuzziness to groups:
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("((?:hello){e<=1})").unwrap();
println!("{}", re.is_match("helo"));
}
Anchors and Boundaries
Match positions in the string.
String Anchors
| Syntax | Description |
|---|---|
^ | Start of string |
$ | End of string |
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match at start
let re1 = FuzzyRegex::new("^hello").unwrap();
assert!(re1.is_match("hello world"));
assert!(!re1.is_match("say hello"));
// Match at end
let re2 = FuzzyRegex::new("hello$").unwrap();
assert!(re2.is_match("say hello"));
assert!(!re2.is_match("hello world"));
// Match entire string
let re3 = FuzzyRegex::new("^hello$").unwrap();
assert!(re3.is_match("hello"));
assert!(!re3.is_match("hello world"));
}
Word Boundaries
| Syntax | Description |
|---|---|
\b | Word boundary |
\B | Non-word boundary |
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match whole word "cat"
let re1 = FuzzyRegex::new(r"\bcat\b").unwrap();
assert!(re1.is_match("cat"));
assert!(re1.is_match("the cat sat"));
assert!(!re1.is_match("category"));
// Match "cat" not at word boundary
let re2 = FuzzyRegex::new(r"\Bcat\B").unwrap();
assert!(re2.is_match("category"));
assert!(!re2.is_match("cat"));
}
Multi-line Mode
With (?m), ^ and $ match line boundaries:
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?m)^hello$").unwrap();
let text = "hello\nhello\nhello";
let matches: Vec<_> = re.find_iter(text).collect();
assert_eq!(matches.len(), 3);
}
Lookahead and Lookbehind
Assert what comes before or after the match without including it.
Lookahead
Positive Lookahead (?=...)
Match only if followed by:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "hello" only if followed by " world"
let re = FuzzyRegex::new("hello(?= world)").unwrap();
assert!(re.is_match("hello world")); // Matches "hello"
assert!(!re.is_match("hello there")); // No match
}
Negative Lookahead (?!...)
Match only if NOT followed by:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "hello" only if NOT followed by " world"
let re = FuzzyRegex::new("hello(?! world)").unwrap();
assert!(re.is_match("hello there")); // Matches "hello"
assert!(!re.is_match("hello world")); // No match
}
Lookbehind
Positive Lookbehind (?<=...)
Match only if preceded by:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "world" only if preceded by "hello "
let re = FuzzyRegex::new("(?<=hello )world").unwrap();
assert!(re.is_match("hello world"));
assert!(!re.is_match("bye world"));
// Get match position
let m = re.find("say hello world here").unwrap();
assert_eq!(m.start(), 9); // After "hello "
}
Negative Lookbehind (?<!...)
Match only if NOT preceded by:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "world" only if NOT preceded by "hello "
let re = FuzzyRegex::new("(?<!hello )world").unwrap();
assert!(re.is_match("bye world"));
assert!(!re.is_match("hello world"));
}
Fuzzy Lookbehind
Lookbehind can include fuzzy matching:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "world" preceded by "hello" with up to 1 error
let re = FuzzyRegex::new("(?<=(?:hello){e<=1})world").unwrap();
assert!(re.is_match("hello world")); // Exact
assert!(re.is_match("hallo world")); // 1 substitution
assert!(!re.is_match("goodbye world")); // Too different
}
Fuzzy Lookbehind Examples
fn main() {
use fuzzy_regex::FuzzyRegex;
// Allow insertions in lookbehind
let re = FuzzyRegex::new("(?<=(?:hello){i<=2})world").unwrap();
assert!(re.is_match("helllo world")); // 2 insertions (ll)
assert!(re.is_match("hellllo world")); // 3 insertions - no match
// Allow deletions in lookbehind
let re = FuzzyRegex::new("(?<=(?:hello){d<=1})world").unwrap();
assert!(re.is_match("helo world")); // 1 deletion (one 'l')
assert!(!re.is_match("heo world")); // 2 deletions - no match
// Combined: substitutions and insertions
let re = FuzzyRegex::new("(?<=(?:hello){s<=1,i<=1})world").unwrap();
assert!(re.is_match("hallo world")); // 1 substitution
assert!(re.is_match("helloh world")); // 1 insertion
assert!(re.is_match("hallo world")); // matches, prefers lower edits
}
Case Insensitive with Fuzzy
Combine case-insensitive matching with fuzzy for robust matching:
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
// Case-insensitive + fuzzy - case differences don't count as edits
let re = FuzzyRegexBuilder::new("(?i)(?:hello){e<=1}")
.build()
.unwrap();
assert!(re.is_match("hello")); // 0 edits
assert!(re.is_match("HELLO")); // 0 edits (case difference is free!)
assert!(re.is_match("hallo")); // 1 edit (actual substitution)
// Builder's case_insensitive flag works the same way
let re2 = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.case_insensitive(true)
.build()
.unwrap();
assert!(re2.is_match("HELLO")); // 0 edits
assert!(re2.is_match("HeLLo")); // 0 edits (all case differences)
// Get edit count to verify
let m = re.find("HELLO").unwrap();
assert_eq!(m.total_edits(), 0); // No edits charged for case!
}
Fuzzy Lookahead
Fuzzy matching also works with lookahead:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Match "hello" followed by "world" with up to 1 error
let re = FuzzyRegex::new("hello(?= {e<=1}world)").unwrap();
assert!(re.is_match("hello world")); // Exact
assert!(!re.is_match("hello world!")); // ! is not space, lookahead fails
}
Variable-Length Lookbehind
fn main() {
use fuzzy_regex::FuzzyRegex;
// Variable-length lookbehind (alternation)
let re = FuzzyRegex::new("(?<=(?:hello|hi) )world").unwrap();
println!("{}", re.is_match("hi world"));
}
Atomic Groups and Possessive Quantifiers
Advanced features for performance and pattern control.
Atomic Groups (?>...)
Atomic groups prevent backtracking once matched:
fn main() {
use fuzzy_regex::FuzzyRegex;
// Without atomic group: can match "abc" then backtrack
let re1 = FuzzyRegex::new("(?:ab|abc)").unwrap();
// Atomic: once "abc" matches, no backtracking
let re2 = FuzzyRegex::new("(?>ab|abc)").unwrap();
println!("re1: {}", re1.is_match("abc"));
println!("re2: {}", re2.is_match("abc"));
}
Possessive Quantifiers
Like atomic groups for quantifiers:
| Syntax | Description |
|---|---|
*+ | Possessive zero-or-more |
++ | Possessive one-or-more |
?+ | Possessive zero-or-one |
fn main() {
use fuzzy_regex::FuzzyRegex;
// Normal: backtracks to find match
let re1 = FuzzyRegex::new("a*b").unwrap();
// Possessive: doesn't backtrack
let re2 = FuzzyRegex::new("a*+b").unwrap();
println!("re1: {}", re1.is_match("ab"));
println!("re2: {}", re2.is_match("ab"));
}
When to Use
Atomic groups and possessive quantifiers are useful when:
- Performance: Prevent exponential backtracking
- Determinism: Control match behavior explicitly
- Greedy matching: When you want maximum match without backtracking
fn main() {
use fuzzy_regex::FuzzyRegex;
// Pattern that could cause backtracking issues:
// Without possessive: "aaaaaaaa" + "a*" + "b" can be slow
let re = FuzzyRegex::new("(?:a++)b").unwrap();
println!("{}", re.is_match("ab"));
}
Match Reset \K
Reset the start of the match:
fn main() {
use fuzzy_regex::FuzzyRegex;
// \K resets match start - keeps "prefix" out of match
let re = FuzzyRegex::new(r"prefix\Kworld").unwrap();
let m = re.find("prefixworld").unwrap();
assert_eq!(m.as_str(), "world");
assert_eq!(m.start(), 6); // Starts at "world", not "prefix"
}
API Reference
Overview of the main API types and methods.
Main Types
| Type | Description |
|---|---|
FuzzyRegex | Compiled regex pattern |
FuzzyRegexBuilder | Builder for customizing regex |
Match | A single match result |
Captures | Match with capture groups |
StreamingMatcher | Stateful matcher for streaming |
StreamingMatch | Match from streaming search |
Quick Method Reference
#![allow(unused)]
fn main() {
use fuzzy_regex::FuzzyRegex;
// Construction
let re = FuzzyRegex::new("pattern").unwrap();
// Searching
re.is_match(text) // bool - does it match?
re.find(text) // Option<Match>
re.find_iter(text) // Iterator<Match>
re.find_rev(text) // Option<Match> - rightmost
re.captures(text) // Option<Captures>
// Replacing
re.replace(text, "x") // String
re.replace_all(text, "x") // String
re.split(text) // Split
// Streaming
re.stream() // StreamingMatcher
re.find_bytes(bytes) // Option<StreamingMatch>
}
FuzzyRegex
The main compiled regex type.
Construction
fn main() {
use fuzzy_regex::FuzzyRegex;
// Simple construction
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
// With error handling
let re = match FuzzyRegex::new("(?:hello){e<=1}") {
Ok(r) => r,
Err(e) => panic!("Invalid pattern: {}", e),
};
println!("Created regex: {}", re.is_match("hello"));
}
Search Methods
is_match
fn main() {
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
println!("{}", re.is_match("hello")); // true
println!("{}", re.is_match("hallo")); // true
println!("{}", re.is_match("world")); // false
}
find
fn main() {
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
if let Some(m) = re.find("say hello world") {
println!("Found '{}' at {}-{}", m.as_str(), m.start(), m.end());
println!("Similarity: {:.2}", m.similarity());
}
}
find_iter
fn main() {
let re = FuzzyRegex::new(r"(?:\w+){e<=1}").unwrap();
for m in re.find_iter("test tset tsat") {
println!("Match: '{}'", m.as_str());
}
}
find_rev
Find rightmost match:
fn main() {
let re = FuzzyRegex::new("(?:foo)").unwrap();
let m = re.find_rev("foo foo foo").unwrap();
assert_eq!(m.start(), 8);
}
Replace Methods
fn main() {
let re = FuzzyRegex::new("world").unwrap();
// Replace first match
let result = re.replace("hello world", "rust");
assert_eq!(result, "hello rust");
// Replace all matches
let result = re.replace_all("hello world world", "rust");
assert_eq!(result, "hello rust rust");
}
Other Methods
fn main() {
let re = FuzzyRegex::new(r"(?<word>\w+)").unwrap();
// Capture groups
let caps = re.captures("hello").unwrap();
let word = caps.name("word").unwrap();
// Split
let parts: Vec<_> = re.split("a1b2c").collect();
// Check pattern info
println!("Capture count: {}", re.capture_count());
}
FuzzyRegexBuilder
Builder for customizing regex construction.
Basic Usage
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello)")
.build()
.unwrap();
println!("Created");
}
Builder Options
Similarity Threshold
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=2}")
.similarity(0.8) // Minimum similarity 0.0-1.0
.build()
.unwrap();
println!("Created");
}
Case Insensitivity
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello)")
.case_insensitive(true)
.build()
.unwrap();
assert!(re.is_match("HELLO"));
assert!(re.is_match("Hello"));
}
Multi-line Mode
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("^hello$")
.multi_line(true) // ^ and $ match line boundaries
.build()
.unwrap();
println!("Created");
}
Dot-all Mode
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("a.b")
.dot_all(true) // . matches newlines
.build()
.unwrap();
println!("Created");
}
Partial Matching
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.partial(true) // Matches at end of text are partial
.build()
.unwrap();
let m = re.find("hello").unwrap();
assert!(m.partial()); // Match reaches end of text
}
Timeout
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
use std::time::Duration;
let re = FuzzyRegexBuilder::new("(?:hello){e<=5}")
.timeout(Duration::from_millis(100))
.build()
.unwrap();
println!("Created");
}
Chaining Options
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.case_insensitive(true)
.similarity(0.7)
.partial(true)
.multi_line(true)
.build()
.unwrap();
println!("Created");
}
Match Results
Methods available on match results.
Basic Methods
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
let m = re.find("say hello world").unwrap();
// Get matched string
println!("{}", m.as_str()); // &str: "hello"
println!("{}", m.start()); // usize: 4 (byte position)
println!("{}", m.end()); // usize: 9
println!("{}", m.len()); // usize: 5
println!("{}", m.is_empty()); // bool: false
}
Similarity
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
let m = re.find("say hello world").unwrap();
// Similarity score 0.0 - 1.0
println!("{}", m.similarity()); // f32: 1.0 = exact, lower = more errors
}
Partial Match
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
let m = re.find("say hello").unwrap();
// Check if match reaches end of input (for partial matching)
println!("{}", m.partial()); // bool: true if match at end
}
Match as Different Types
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
let m = re.find("hello world").unwrap();
// Get raw bytes
let bytes = m.as_bytes();
// Get range
let range = m.range(); // Range<usize>
println!("Range: {:?}", range);
}
Iterating Matches
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"\w+").unwrap();
for m in re.find_iter("hello world") {
println!("{}", m.as_str());
}
}
Match Iteration with Positions
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:foo){e<=1}").unwrap();
for m in re.find_iter("foo fxo fxoo") {
println!("'{}' at [{}, {})", m.as_str(), m.start(), m.end());
}
}
Streaming API
Process large data incrementally without loading into memory.
Creating a Stream
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
println!("Stream created");
}
Feeding Data
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
// Feed data in chunks
for m in stream.feed(b"some hay and niddle here") {
println!("Found '{}' at {}", m.as_str(), m.start());
}
}
Position Tracking
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
stream.feed(b"hello");
assert_eq!(stream.position(), 5);
stream.feed(b" world");
assert_eq!(stream.position(), 11);
}
Cross-Boundary Matches
Matches can span chunk boundaries:
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
// First chunk - no match yet
let _ = stream.feed(b"xxx ");
// Second chunk - match spanning boundary
let matches: Vec<_> = stream.feed(b"hello").collect();
println!("Matches: {}", matches.len());
}
Resetting
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
stream.feed(b"hello world");
stream.reset();
assert_eq!(stream.position(), 0);
}
Finish
Check for remaining matches at end of stream:
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();
stream.feed(b"test hel");
// Check remaining buffer
let final_match = stream.finish();
}
Reader Integration
Process files or other readers:
fn main() {
use fuzzy_regex::FuzzyRegex;
use std::io::BufReader;
use std::fs::File;
let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let file = File::open("Cargo.toml").unwrap();
for m in re.stream().search_reader(BufReader::new(file)) {
println!("Match at byte {}", m.start());
}
}
Byte-Level API
Search byte slices directly:
fn main() {
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
if let Some(m) = re.find_bytes(b"hello world") {
println!("Found at {}-{}", m.start(), m.end());
}
// Iterate over bytes
let matches: Vec<_> = re.find_iter_bytes(b"hello hullo").collect();
println!("Matches: {}", matches.len());
}
Advanced Features
Additional capabilities beyond basic fuzzy matching.
In This Section
This section covers:
- Capture Groups - Extract matched portions
- Partial Matching - Match incomplete input
- Word Lists - Named reference patterns
- Custom Handlers - Invoke custom logic during matching
- Compatibility Layer - Migrate from other libraries
Capture Groups
Extract specific portions of matches.
Named Capture Groups
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"(?<user>\w+)@(?<domain>\w+\.\w+)").unwrap();
let caps = re.captures("john@example.com").unwrap();
assert_eq!(caps.name("user").unwrap().as_str(), "john");
assert_eq!(caps.name("domain").unwrap().as_str(), "example.com");
}
Numbered Capture Groups
fn main() {
let re = FuzzyRegex::new(r"(\w+)@(\w+)").unwrap();
let caps = re.captures("test@example").unwrap();
assert_eq!(caps.get(1).unwrap().as_str(), "test");
assert_eq!(caps.get(2).unwrap().as_str(), "example");
assert_eq!(caps.get(0).unwrap().as_str(), "test@example"); // Full match
}
Nested Groups
fn main() {
let re = FuzzyRegex::new(r"((?P<outer>\w+)-(?P<inner>\w+))").unwrap();
let caps = re.captures("hello-world").unwrap();
assert_eq!(caps.get(0).unwrap().as_str(), "hello-world");
assert_eq!(caps.get(1).unwrap().as_str(), "hello-world");
assert_eq!(caps.name("outer").unwrap().as_str(), "hello");
assert_eq!(caps.name("inner").unwrap().as_str(), "world");
}
Capture Groups with Fuzzy Matching
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"(?<word>\w+){e<=1}").unwrap();
let caps = re.captures("tset").unwrap();
assert_eq!(caps.name("word").unwrap().as_str(), "tset");
}
Fuzzy Edits Information
When using fuzzy matching, you can get detailed edit information:
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"(?:hello){e<=2}").unwrap();
let m = re.find("helllo").unwrap();
// Get total edits
println!("Total edits: {}", m.total_edits()); // 1 (1 insertion)
// Get detailed counts: (insertions, deletions, substitutions)
let (ins, del, sub) = m.fuzzy_counts();
println!("Insertions: {}, Deletions: {}, Substitutions: {}", ins, del, sub);
// Get positions of changes
let (ins_pos, del_pos, sub_pos) = m.fuzzy_changes();
println!("Insertion positions: {:?}", ins_pos);
}
Handler Overrides in Captures
When using handlers with MatchOverride, the overrides are tracked in captures:
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r"(prefix(?call:handler)suffix)")
.handler("handler", |text, pos| {
if text[pos..].starts_with("XYZ") {
HandlerResult::MatchOverride(3, "xyz".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let caps = re.captures("prefixXYZsuffix").unwrap();
// The captured text shows the override
assert_eq!(caps.get(1).unwrap().as_str(), "prefixxyzsuffix");
// Handler overrides track (start_byte, end_byte, override_text)
let overrides = caps.handler_overrides();
println!("Handler overrides: {:?}", overrides);
// Output: [(6, 9, "xyz")]
}
Multiple Capture Groups with Fuzzy
fn main() {
use fuzzy_regex::FuzzyRegex;
// Two fuzzy groups with different limits
let re = FuzzyRegex::new(r"(?<a>\w+){e<=1} (?<b>\w+){e<=1}").unwrap();
let caps = re.captures("helo wrold").unwrap();
assert_eq!(caps.name("a").unwrap().as_str(), "helo"); // 1 substitution
assert_eq!(caps.name("b").unwrap().as_str(), "wrold"); // 1 substitution
// Each group tracks its own edits
// Note: Fuzzy matching applies to the overall pattern
}
Iterating Captures
fn main() {
let re = FuzzyRegex::new(r"(\w+)").unwrap();
let caps = re.captures("hello world").unwrap();
for cap in caps.iter() {
if let Some(m) = cap {
println!("Group: '{}'", m.as_str());
}
}
}
Accessing All Groups
fn main() {
let re = FuzzyRegex::new(r"(\w+)@(\w+)").unwrap();
let caps = re.captures("a@b").unwrap();
assert_eq!(caps.len(), 3); // 0 (full) + 2 groups
// Names of capture groups
for name in caps.iter_names() {
if let Some(n) = name {
println!("Group name: {}", n);
}
}
}
Partial Matching
Match incomplete or streaming input.
Enabling Partial Matching
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.partial(true)
.build()
.unwrap();
println!("Created");
}
How It Works
When enabled, matches that reach the end of the input are marked as “partial”:
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.partial(true)
.build()
.unwrap();
// Match at end of text - partial
let m1 = re.find("hello").unwrap();
assert!(m1.partial());
// Match not at end - not partial
let m2 = re.find("say hello").unwrap();
assert!(!m2.partial());
// Fuzzy match reaching end - also partial
let m3 = re.find("hallo").unwrap();
assert!(m3.partial());
}
Use Cases
Streaming Input
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:command){e<=1}")
.partial(true)
.build()
.unwrap();
let mut stream = re.stream();
// Feed incomplete data
for m in stream.feed(b"cmd") {
if m.partial() {
println!("Partial match: might be complete when more data arrives");
}
}
}
Progressive Matching
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.partial(true)
.build()
.unwrap();
// Check if more input needed
let m = re.find("incomplete");
if m.map(|m| m.partial()).unwrap_or(false) {
// Get more input
}
}
Without Partial
By default (partial(false)), partial matches behave the same as full matches:
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.partial(false) // default
.build()
.unwrap();
let m = re.find("hello").unwrap();
assert!(!m.partial()); // Always false
}
Word Lists
Define named lists of patterns for reuse.
Basic Usage
fn main() {
use fuzzy_regex::FuzzyRegex;
let mut re = FuzzyRegex::new(r"\L<keywords>").unwrap();
// Set the word list
re.set_word_list(&["hello", "world", "test"]).unwrap();
assert!(re.is_match("hello"));
assert!(re.is_match("world"));
assert!(!re.is_match("foo"));
}
With Fuzzy Matching
fn main() {
use fuzzy_regex::FuzzyRegex;
let mut re = FuzzyRegex::new(r"\L<keywords>{e<=1}").unwrap();
re.set_word_list(&["hello", "world"]).unwrap();
assert!(re.is_match("hallo")); // "hello" with 1 error
assert!(re.is_match("wrold")); // "world" with 1 error
}
Multiple Word Lists
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(r"\L<names>|\L<places>").unwrap();
// Note: Multiple lists require builder pattern
println!("Created");
}
Use Cases
- Keyword matching: Match against a list of keywords
- Name matching: Match against database of names
- Dictionary lookup: Match words from a dictionary
Custom Handlers
Custom handlers allow you to invoke arbitrary Rust code during regex matching. This is useful for patterns that are difficult or impossible to express with regular regex alone.
Syntax
Use (?call:handler_name) in your pattern to invoke a handler:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new("(?call:myhandler)")
.handler("myhandler", |text, pos| {
// Your logic here
})
.build()
.unwrap();
}
Handler Signature
Handlers have the signature:
#![allow(unused)]
fn main() {
Fn(&str, usize) -> HandlerResult
}
text: The entire input string being matchedpos: Current position in the text (byte index)HandlerResult::MatchOverride(n, text): Handler matched, consumenbytes, optionally override captured textHandlerResult::NoMatch: Handler didn’t match at this position
Important: The byte count must account for UTF-8 encoding. For multi-byte characters (like Cyrillic, emoji), use byte length, not character count.
Examples
Simple Matching
Match strings that start with a specific prefix:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r"(?call:prefix)")
.handler("prefix", |text, pos| {
if text[pos..].starts_with("foo") {
HandlerResult::MatchOverride(3, "FOO".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let m = re.find("foobar").unwrap();
assert_eq!(m.as_str(), "foo");
}
Capture Override
Handlers can override the captured text while still matching the original:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r"(prefix(?call:handler)suffix)")
.handler("handler", |text, pos| {
if text[pos..].starts_with("XYZ") {
// Match 3 bytes but capture as lowercase
HandlerResult::MatchOverride(3, "xyz".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let caps = re.captures("prefixXYZsuffix").unwrap();
// Captured text is "xyz" (the override), not "XYZ"
assert_eq!(caps.get(1).unwrap().as_str(), "prefixxyzsuffix");
}
Matching Escaped Characters
Match strings with escaped quotes:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r#""((?call:unescape)|.)*""#)
.handler("unescape", |text, pos| {
if pos + 1 < text.len() && text.as_bytes()[pos] == b'\\' {
HandlerResult::MatchOverride(2, String::new()) // consume \ and next char
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let m = re.find(r#""hello \"world\"""#).unwrap();
assert_eq!(m.as_str(), "\"hello \\\"world\\\"\"");
let caps = re.captures(r#""hello \"world\"""#).unwrap();
// The handler_overrides contain the override information
println!("{:?}", caps.handler_overrides());
}
Escape Sequences
Match common escape sequences like \n, \t, \\:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r"(?call:escape)+")
.handler("escape", |text, pos| {
if pos + 1 < text.len() && text.as_bytes()[pos] == b'\\' {
let next = text.as_bytes()[pos + 1];
if matches!(next, b'n' | b't' | b'r' | b'\\' | b'"' | b'\'') {
return HandlerResult::MatchOverride(2, String::new());
}
}
HandlerResult::NoMatch
})
.build()
.unwrap();
re.find(r"hello\nworld"); // matches "\n"
re.find(r"a\tb"); // matches "\t"
}
Context-Sensitive Matching
Match patterns that depend on context:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
// Match "id:123" only when preceded by "user:"
let re = FuzzyRegexBuilder::new(r"user:(?call:id)")
.handler("id", |text, pos| {
if text[pos..].starts_with("id:") {
let rest = &text[pos + 3..];
let mut count = 0;
for (_, c) in rest.char_indices() {
if c.is_ascii_digit() {
count += 1;
} else {
break;
}
}
if count > 0 {
return HandlerResult::MatchOverride(3 + count, String::new());
}
}
HandlerResult::NoMatch
})
.build()
.unwrap();
re.find("user:id:123"); // matches "user:id:123"
re.find("user:name:foo"); // no match (not "id:")
re.find("id:999"); // no match (no "user:" prefix)
}
Unicode Text Transformation
Handle non-ASCII text (note: byte count matters for UTF-8):
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r#""(?call:translate)""#)
.handler("translate", |text, pos| {
let remaining = &text[pos..];
// "привет" is 6 Cyrillic chars = 12 bytes in UTF-8
if remaining.starts_with("привет") {
HandlerResult::MatchOverride(12, "HELLO".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let caps = re.captures("\"привет\"").unwrap();
// The captured text shows the override
assert_eq!(caps.get(0).unwrap().as_str(), "\"HELLO\"");
// Handler overrides track (start_byte, end_byte, override_text)
assert_eq!(caps.handler_overrides(), &[(1, 13, "HELLO")]);
}
Fuzzy Matching with Handlers
Combine handlers with fuzzy matching for powerful patterns:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
// Fuzzy lookbehind + handler for complex validation
let re = FuzzyRegexBuilder::new(r"(?<=(?:hello){e<=2}) (?call:translate)")
.handler("translate", |text, pos| {
let remaining = &text[pos..];
// "привет" = 12 bytes
if remaining.starts_with("привет") {
HandlerResult::MatchOverride(12, "HELLO".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
// Matches with fuzzy lookbehind
let m = re.find("hello привет").unwrap();
assert_eq!(m.as_str(), " привет");
// Also matches with fuzzy lookbehind (1 extra 'l')
let m = re.find("helllo привет").unwrap();
assert_eq!(m.as_str(), " привет");
}
Multiple Handlers
Use multiple handlers in the same pattern:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
let re = FuzzyRegexBuilder::new(r"(?call:foo).*(?call:bar)")
.handler("foo", |text, pos| {
if text[pos..].starts_with("START") {
HandlerResult::MatchOverride(5, "start".to_string())
} else {
HandlerResult::NoMatch
}
})
.handler("bar", |text, pos| {
if text[pos..].starts_with("END") {
HandlerResult::MatchOverride(3, "end".to_string())
} else {
HandlerResult::NoMatch
}
})
.build()
.unwrap();
let caps = re.captures("START middle END").unwrap();
assert_eq!(caps.get(0).unwrap().as_str(), "START middle END");
}
Validation Handlers
Use handlers to implement complex validation rules:
#![allow(unused)]
fn main() {
use fuzzy_regex::{FuzzyRegexBuilder, HandlerResult};
// Match email-like patterns but validate the domain
let re = FuzzyRegexBuilder::new(r"\w+@(?call:domain)")
.handler("domain", |text, pos| {
// Find the end of the domain (until whitespace or end)
let remaining = &text[pos..];
let mut end = 0;
for (i, c) in remaining.char_indices() {
if c.is_whitespace() {
break;
}
end = i + c.len_utf8();
}
if end > 0 {
let domain = &remaining[..end];
// Only allow .com, .org, .net domains
if domain.ends_with(".com") || domain.ends_with(".org") || domain.ends_with(".net") {
return HandlerResult::MatchOverride(end, String::new());
}
}
HandlerResult::NoMatch
})
.build()
.unwrap();
re.find("user@example.com"); // matches
re.find("user@test.org"); // matches
re.find("user@fake.io"); // no match (invalid TLD)
}
Performance Notes
- Handlers are called during NFA simulation, which may be slower than optimized paths
- For best performance, keep handler logic simple
- Consider using handlers only when necessary; standard regex is faster for expressible patterns
- When using
MatchOverride, the override text is stored separately and applied during capture construction
Limitations
- Handlers cannot perform lookahead/lookbehind themselves
- Handler matches are exact (not fuzzy) - they either match or don’t
- Captures inside handlers are limited
- Position is in bytes, not characters - account for UTF-8 encoding
Compatibility Layer
Migrate from other fuzzy matching libraries.
fuzzy-aho-corasick
This is a drop-in replacement for fuzzy-aho-corasick:
#![allow(unused)]
fn main() {
use fuzzy_regex::compat::fac::FuzzyAhoCorasickBuilder;
use fuzzy_regex::types::FuzzyLimits;
let searcher = FuzzyAhoCorasickBuilder::new()
.fuzzy(FuzzyLimits::new().edits(1))
.build(["hello", "world"])
.unwrap();
for m in searcher.find_iter("helo wrld") {
println!("Pattern {} matched at {}", m.pattern_index(), m.start());
}
}
Architecture Overview
How fuzzy-regex works internally.
High-Level Architecture
┌─────────────────────────────────────────────┐
│ FuzzyRegex │
│ (Public API Layer) │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ Fast Path Optimizations │ │
│ │ • Pure greedy .* → instant match │ │
│ │ • .*SUFFIX → reverse search │ │
│ │ • Word boundaries → direct search │ │
│ └────────────────────────────────────────┘ │
└──────────────────┬──────────────────────────┘
│
┌────────┴──────────┐
▼ ▼
┌───────────────┐ ┌─────────────────┐
│ DFA │ │ NFA Engine │
│ (Fast Path) │ │ (Full Engine) │
│ + Capturing │ │ │
│ Groups │ │ │
└───────┬───────┘ └────────┬────────┘
│ │
└────────┬───────────┘
▼
┌─────────────────┐
│ Fuzzy Bridge │
└────────┬────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Bitap │ │ D-L │
│ Matcher │ │ NFA │
└──────────┘ └──────────┘
Components
1. Parser
- Tokenizes regex patterns
- Builds Abstract Syntax Tree (AST)
- Handles fuzzy syntax extensions
2. Compiler
- Converts AST to Intermediate Representation (IR)
- Optimizes patterns
- Handles fuzzy matching compilation
3. Fast Path Optimizations
Before going to the main engines, FuzzyRegex checks for special patterns:
- Pure greedy dot-star:
.*,^.*$,.*$return instantly - Greedy prefix with suffix:
.*testuses reverse search (O(n) vs O(n²)) - Word boundaries: Direct literal search with boundary checks
- DFA compatibility: Patterns with capturing groups now use DFA
4. Matching Engines
DFA (Deterministic Finite Automaton)
- Fast path for exact/non-fuzzy patterns
- No backtracking
- Supports capturing groups
- Limited feature support
NFA (Non-deterministic Finite Automaton)
- Full regex feature support
- Backtracking engine
- Supports fuzzy matching
Bitap Algorithm
- Optimized for short patterns (≤64 chars)
- O(n×k) time complexity
- SIMD-accelerated
Damerau-Levenshtein NFA
- Fuzzy matching via automata
- Supports all edit types
- Used for longer patterns
5. Fuzzy Bridge
- Connects NFA to fuzzy matchers
- Extracts literals for pre-filtering
- Coordinates multiple matchers
Bitap Algorithm
Fast fuzzy matching algorithm for short patterns.
Overview
The Bitap algorithm (also known as the Shift-Or algorithm) is used for fuzzy matching of short patterns (≤64 characters). It achieves O(n×k) time complexity where n is text length and k is maximum edits.
How It Works
Bitap uses bitmasks to track which positions can match:
- Initialize: Create a bitmask for each character in the pattern
- Scan: For each text character, update the bitmask
- Match: When bitmask indicates a match, extract the match
Bitap State
#![allow(unused)]
fn main() {
// Simplified bitap state representation
struct BitapState {
// Bits representing current match positions
// For pattern "hello":
// Position: 5 4 3 2 1 0
// Bits: 0 0 0 0 0 1 = "h"
// 0 0 0 0 1 0 = "e"
// ... continue ...
}
}
Key Properties
Pattern Length Limit
- Maximum 64 characters (uses 64-bit integers)
- Longer patterns use Damerau-Levenshtein NFA
Edit Distance
- Supports insertions, deletions, substitutions
- Transpositions handled specially
SIMD Acceleration
Modern implementations use SIMD for parallelism:
#![allow(unused)]
fn main() {
// SIMD processes multiple characters at once
// 256-bit AVX2: process 32 bytes per iteration
// 512-bit AVX512: process 64 bytes per iteration
}
Performance
| Pattern Length | Text Size | Throughput |
|---|---|---|
| 14 chars | 200KB | ~1.5 Gbps |
| 14 chars | 1MB | ~1.4 Gbps |
| 14 chars, k=1 | 200KB | ~2.0 Gbps |
| 14 chars, k=3 | 200KB | ~700 Mbps |
When Bitap is Used
- Pattern length ≤ 64 characters
- Simple fuzzy matching (no complex regex features)
- Exact matching benefits from optimizations
Implementation Details
State Machine
#![allow(unused)]
fn main() {
// Each iteration:
// 1. Shift state right by 1
// 2. OR with pattern mask for current char
// 3. Check for match (zero in certain bits)
}
Handling Errors
Bitap extends to handle errors by maintaining multiple state masks (one per allowed error):
#![allow(unused)]
fn main() {
// For k errors, maintain k+1 states
// state[0] = exact match
// state[1] = 1 error
// state[k] = k errors
}
Damerau-Levenshtein NFA
Fuzzy matching via finite automata for longer patterns.
Overview
For patterns longer than 64 characters or when complex regex features are needed, fuzzy-regex uses a Damerau-Levenshtein NFA. This is an extension of the classic Damerau-Levenshtein automaton.
Damerau-Levenshtein Automaton
The Damerau-Levenshtein automaton for a pattern P and max distance k accepts all strings within k edits of P.
State Representation
#![allow(unused)]
fn main() {
// Each state represents a position in the pattern
// plus the number of errors used so far
struct NfaState {
pattern_position: usize, // Position in pattern
errors_used: usize, // Errors consumed
}
}
Transitions
For each input character, the NFA can transition to:
- Match: Next character in pattern (0 errors)
- Insertion: Stay at current position (1 error)
- Deletion: Skip pattern character (1 error)
- Substitution: Move to next with 1 error
- Transposition: Swap next two characters (special case)
Building the NFA
#![allow(unused)]
fn main() {
// Pattern: "abc" with k=1
// States: (0,0) → (1,0) → (2,0) → (3,0) [exact path]
// (0,0) → (1,1) → ... [with errors]
}
Construction Algorithm
- Create initial state at position 0
- For each pattern character, create states for:
- Exact match (no error)
- Insertion (adds error)
- Deletion (adds error)
- Substitution (adds error)
- Connect states with transitions
- Mark accepting states
Performance Characteristics
| Aspect | Damerau-Levenshtein NFA |
|---|---|
| Time Complexity | O(n × k × m) |
| Space Complexity | O(k × m) |
| Pattern Length | Unlimited |
| Features | Full regex |
Where:
- n = text length
- k = max edits
- m = pattern length
Comparison with Bitap
| Feature | Bitap | Damerau-Levenshtein NFA |
|---|---|---|
| Max Pattern Length | 64 | Unlimited |
| Time Complexity | O(n×k) | O(n×k×m) |
| SIMD Support | Yes | Limited |
| Regex Features | Limited | Full |
| Memory Usage | Low | Higher |
When Used
- Pattern length > 64 characters
- Complex regex features (lookahead, backreferences)
- Alternation with fuzzy segments
Fuzzy Bridge
Connecting regex engine with fuzzy matching components.
Purpose
The Fuzzy Bridge coordinates between the NFA-based regex engine and specialized fuzzy matchers (Bitap, Damerau-Levenshtein NFA).
Responsibilities
- Literal Extraction: Extract exact-match literals from patterns for pre-filtering
- Search Coordination: Order and combine multiple matchers
- Result Merging: Combine and rank results from different matchers
Architecture
Regex Pattern: hello (?:world){e<=1}
│
▼
┌────────────────┐
│ Fuzzy Bridge │
└───────┬────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ DFA │ │ Fuzzy │
│ (exact) │ │ Matcher │
│ "hello" │ │ (Bitap/NFA) │
└──────────────┘ │ "world"{e<=1}│
└──────────────┘
Literal Extraction
#![allow(unused)]
fn main() {
// Pattern: "prefix (?:target){e<=1} suffix"
// Extracted literals: ["prefix", "suffix"]
}
Extraction Strategy
- Identify fuzzy segments in pattern
- Extract surrounding exact literals
- Use literals for:
- Pre-filtering text positions
- Quick rejection of non-matching areas
Search Flow
Simple Fuzzy Pattern
For (?:hello){e<=1}:
- Use Bitap directly (pattern ≤64 chars)
- Or build Damerau-Levenshtein NFA (longer patterns)
Mixed Pattern
For exact (?:fuzzy){e<=1}:
- Use DFA to find “exact” quickly
- For each “exact” match, check surrounding for fuzzy match
Complex Pattern
For (?:a){e<=1}(?:b){e<=1}:
- Find all fuzzy matches for “a”
- Find all fuzzy matches for “b”
- Check if they can be combined
Optimization Techniques
Prefiltering
Use extracted literals to quickly skip non-matching regions:
#![allow(unused)]
fn main() {
// Text: "blah blah prefix target suffix blah"
// Literal "prefix" found at position X
// Only check fuzzy "target" around position X
}
Early Termination
Stop searching when match found (for find/first):
#![allow(unused)]
fn main() {
// First match found, don't continue
}
Caching
Cache compiled matchers for reuse:
#![allow(unused)]
fn main() {
// Bitap matcher cached for pattern
}
DFA Optimization
Fast path for exact and non-fuzzy patterns.
Deterministic Finite Automaton
The DFA provides the fastest matching path for patterns that don’t require fuzzy matching.
When DFA is Used
- Exact match patterns (no fuzziness)
- Simple character classes
- Quantifiers without fuzzy segments
DFA Properties
| Property | Value |
|---|---|
| Time Complexity | O(n) |
| Space Complexity | O(states) |
| Backtracking | None |
| Memory | Deterministic |
Implementation
State Machine
#![allow(unused)]
fn main() {
// DFA state contains:
// - Transitions for each possible input
// - Whether state is accepting
// - Associated match information
}
Simulation
#![allow(unused)]
fn main() {
// For each input character:
// 1. Look up next state from current state
// 2. If no transition, stop (no match)
// 3. If accepting state, record match position
// 4. Continue until end of input
}
Fast Path Integration
Pattern: hello{e<=1}
│
▼
┌─────────────────┐
│ Has Fuzziness? │
└────────┬────────┘
│
┌──────┴──────┐
▼ ▼
Yes No
│ │
▼ ▼
NFA/ DFA
Bitap
Limitations
The DFA cannot handle:
- Fuzzy matching
- Lookahead/lookbehind
- Backreferences
- Capture groups (for position info)
For these, the NFA engine is used.
Optimization Techniques
State Minimization
Reduce number of states:
#![allow(unused)]
fn main() {
// Before: 100 states
// After minimization: 50 states
}
Lazy Evaluation
Build DFA states on-demand:
#![allow(unused)]
fn main() {
// Only build states that are actually traversed
}
Caching
Cache DFA for reuse:
#![allow(unused)]
fn main() {
// Reuse compiled DFA across searches
}
All-Matches Algorithms
The DFA supports multiple algorithms for finding all matches.
Three-Pass (Default)
The standard approach: find a match, advance, repeat.
#![allow(unused)]
fn main() {
let matches = dfa.find_all(text);
// Time: O(n²) worst case for pathological patterns
}
Two-Pass Algorithm
Uses a reverse prefilter to find candidates efficiently:
- Pass 1: Scan right-to-left, finding candidate positions
- Pass 2: Verify matches at each candidate
#![allow(unused)]
fn main() {
let matches = dfa.find_all_two_pass(text);
// Prefilter reduces candidates in Pass 1
// Still O(n²) worst case in Pass 2
}
Hardened Mode (O(n))
Tracks all active DFA states simultaneously:
- Maintain set of (state, start_position) pairs
- Process each character once, updating all states
- When no continuation possible, emit match
- Continue from next position
#![allow(unused)]
fn main() {
let matches = dfa.find_all_hardened(text);
// Guaranteed O(n) - each character processed once
}
Prefilter Types
| Type | Description |
|---|---|
| SingleByte | memrchr for one byte |
| TwoBytes | memrchr for two bytes |
| ThreeBytes | memrchr for three bytes |
| Teddy | SIMD-accelerated multi-byte |
When to Use
| Scenario | Recommended |
|---|---|
| Few matches, well-behaved pattern | Default |
| Prefilter effective | Two-pass |
| Pathological patterns | Hardened |
Streaming Implementation
Processing data incrementally without loading everything into memory.
Streaming Architecture
Input Data
│
▼
┌──────────────────┐
│ StreamingMatcher │
│ ┌──────────────┐ │
│ │ Input Buffer │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Matcher │ │
│ └──────────────┘ │
└────────┬─────────┘
│
▼
Match Results
Key Components
Input Buffer
Stores unprocessed data between feed calls:
#![allow(unused)]
fn main() {
struct StreamingMatcher {
buffer: Vec<u8>, // Unprocessed bytes
position: usize, // Current position in stream
matcher: Matcher, // Underlying matcher
}
}
Position Tracking
Tracks absolute position in the stream:
#![allow(unused)]
fn main() {
let mut stream = re.stream();
stream.feed(b"hello"); // position = 5
stream.feed(b" world"); // position = 11
assert_eq!(stream.position(), 11);
}
Cross-Boundary Matching
Matches can span chunk boundaries:
#![allow(unused)]
fn main() {
let mut stream = re.stream();
// Chunk 1: no complete match
stream.feed(b"xxx ");
// Chunk 2: completes the match
let matches = stream.feed(b"hello").collect();
// Match found spanning from chunk 1
}
Buffer Management
#![allow(unused)]
fn main() {
// Keep some overlap between chunks
// for cross-boundary matches
let mut stream = re.stream();
// Strategy: Keep last (pattern_len + max_errors) bytes
// This ensures cross-boundary matches are found
}
Reader Integration
Process any Read source:
#![allow(unused)]
fn main() {
use std::io::BufReader;
use std::fs::File;
let file = File::open("large_file.txt")?;
let reader = BufReader::new(file);
for m in re.stream().with_chunk_size(8192).search_reader(reader) {
println!("Match at {}", m.start());
}
}
Byte-Level Streaming
Feed
#![allow(unused)]
fn main() {
stream.feed(b"data chunk");
}
Finish
#![allow(unused)]
fn main() {
// After all data fed, check for remaining matches
stream.finish();
}
Reset
#![allow(unused)]
fn main() {
// Start over with fresh state
stream.reset();
}
Performance Considerations Size
Chunk
| Chunk Size | Use Case |
|---|---|
| Small (1KB) | Low latency, interactive |
| Medium (8KB) | Balanced |
| Large (64KB+) | High throughput |
Memory Usage
Streaming uses O(max_pattern_length + buffer) memory regardless of input size.
Implementation Details
State Management
#![allow(unused)]
fn main() {
// Between feed calls:
// 1. Save current state
// 2. Process new chunk
// 3. Keep partial state for next call
struct State {
nfa_state: Option<NfaState>,
bitap_state: u64,
// etc.
}
}
Partial Results
Report matches as they’re found:
#![allow(unused)]
fn main() {
// Yield matches immediately as they're found
// Don't wait for complete input
}
Algorithm Selection
How fuzzy-regex chooses the matching algorithm.
Decision Tree
Pattern Analysis
│
▼
┌──────────────────┐
│ Is exact match? │
└────────┬─────────┘
│
┌────┴────┐
▼ ▼
Yes No
│ │
▼ ▼
DFA ┌────────────────┐
│ Pure greedy .*? │
└────────┬────────┘
│
┌────┴────┐
▼ ▼
Yes No
│ │
▼ ▼
Instant ┌────────────────┐
Match │ Pattern ≤64? │
└────────┬────────┘
│
┌────┴────┐
▼ ▼
Yes No
│ │
▼ ▼
Bitap Damerau-Levenshtein
NFA
Algorithm Comparison
| Algorithm | Use Case | Complexity | Features |
|---|---|---|---|
| DFA | Exact patterns, capturing groups | O(n) | Limited |
| Instant Match | Pure greedy .* | O(1) | Limited |
| Bitap | Short fuzzy (≤64 chars) | O(n×k) | Most |
| Damerau-Levenshtein NFA | Long fuzzy patterns | O(n×k×m) | Full |
Automatic Selection
The library automatically selects based on:
- Pattern length: Bitap for ≤64 chars
- Fuzzy complexity: Cost-based vs simple
- Regex features: DFA can’t do lookahead
- Streaming: Different code path
- Greedy patterns: Instant match for
.*,^.*$,.*$ - Greedy suffix patterns:
.*SUFFIXuses reverse search
Special Optimizations
Pure Greedy Dot-Star
Patterns like .*, ^.*$, .*$ always match. The engine detects these and returns instantly without scanning:
#![allow(unused)]
fn main() {
let re = FuzzyRegex::new(".*").unwrap();
// Returns match immediately - no text scanning
}
Greedy Prefix with Suffix
Patterns like .*test or .*test~2 use reverse search to avoid O(n²) behavior:
#![allow(unused)]
fn main() {
let re = FuzzyRegex::new(".*test").unwrap();
// Finds "test" from the right, then .* matches everything before it
// O(n) instead of O(n²)
}
DFA with Capturing Groups
DFA now works with capturing groups like (?m)^(.*)test$:
#![allow(unused)]
fn main() {
let re = FuzzyRegex::new("(?m)^(.*)test$").unwrap();
// Uses DFA - much faster than NFA
}
All-Matches Algorithms
When finding all matches in text (find_iter, find_all), fuzzy-regex supports multiple algorithms optimized for different patterns.
The Problem
Some patterns produce many overlapping matches, causing naive “find, advance, repeat” to be O(n²):
#![allow(unused)]
fn main() {
// Pattern: .*a|b on text: "bbbbbbbb"
// Each 'b' matches individually → O(n) matches × O(n) scan = O(n²)
}
Available Algorithms
| Algorithm | Complexity | Best For |
|---|---|---|
find_all | O(n²) worst case | Simple patterns, few matches |
find_all_two_pass | O(n²) worst case | When prefilter is effective |
find_all_hardened | O(n) guaranteed | Pathological patterns |
Two-Pass Algorithm
Uses a reverse prefilter to find candidate positions, then verifies matches:
- Pass 1: Scan right-to-left using prefilter (memchr, Teddy, etc.)
- Pass 2: For each candidate, find the match
#![allow(unused)]
fn main() {
let mut dfa = Dfa::new(".*a|b").unwrap();
let matches = dfa.find_all_two_pass("bbbbbb");
}
Hardened Mode (O(n))
Tracks all active DFA states simultaneously as we scan left-to-right:
#![allow(unused)]
fn main() {
let mut dfa = Dfa::new(".*a|b").unwrap();
let matches = dfa.find_all_hardened("bbbbbb");
}
How it works:
- Start with one state at current position
- Process all characters, tracking active states
- When no state can continue, emit the leftmost match
- Move to next position, repeat
Benchmark: Pathological Pattern
Pattern .*a|b on text of all ’b’s (worst case):
| Text Size | find_all | two_pass | hardened | Speedup |
|---|---|---|---|---|
| 1,000 bytes | 1.08s | 1.10s | 69ms | 15x |
| 5,000 bytes | 5.47s | 5.43s | 69ms | 79x |
| 10,000 bytes | 10.76s | 10.85s | 69ms | 156x |
When to Use Each
#![allow(unused)]
fn main() {
// Default: smart selection based on pattern
let matches = dfa.find_all(text);
// Explicit two-pass: good when prefilter is effective
let matches = dfa.find_all_two_pass(text);
// Explicit hardened: critical for pathological patterns
let matches = dfa.find_all_hardened(text);
}
Prefilter Types
The two-pass algorithm uses prefilters optimized for different patterns:
| Prefilter | Use Case | Method |
|---|---|---|
| SingleByte | Single literal | memrchr |
| TwoBytes | Two literals | memrchr for both |
| ThreeBytes | Three literals | memrchr for all |
| Teddy | Many alternatives | SIMD range matching |
Manual Override
Not currently exposed, but internal selection can be inspected:
#![allow(unused)]
fn main() {
// Check which engine was used
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
if re.supports_streaming() {
// Using Bitap
}
}
SIMD Optimizations
Using vector instructions for faster matching.
SIMD Overview
SIMD (Single Instruction, Multiple Data) allows processing multiple bytes simultaneously.
| SIMD Width | Bytes/Iteration | Architecture |
|---|---|---|
| 128-bit | 16 bytes | SSE4.1, NEON |
| 256-bit | 32 bytes | AVX2 |
| 512-bit | 64 bytes | AVX512 |
SIMD Components
Character Range Search (RevSearchRanges)
Vectorized reverse search for character ranges like [a-z]:
- x86_64: Uses SSE/AVX2 for 16/32-byte iteration
- ARM (NEON): Uses 16-byte vectors with
vmaxvq_u8 - Uses
neon_movemaskto extract bit positions
#![allow(unused)]
fn main() {
// Searches right-to-left for characters in range
let positions = rev_search_ranges.find_last(haystack);
}
Teddy Search
SIMD-accelerated multi-pattern searching:
- Processes 16 bytes per iteration (32 on AVX2)
- Checks 4 character ranges simultaneously
- Uses bitwise operations to combine matches
- Efficient for patterns with multiple byte alternatives
#![allow(unused)]
fn main() {
// Searches for any of multiple bytes
let position = teddy.find_first(haystack);
}
Bitap SIMD
The Bitap algorithm is highly parallelizable with SIMD:
#![allow(unused)]
fn main() {
// Without SIMD: 1 byte per iteration
// With AVX2: 32 bytes per iteration
// ~32x speedup potential
}
Implementation
#![allow(unused)]
fn main() {
// AVX2 implementation
unsafe {
// Load 32 bytes
let data = _mm256_loadu_si256(ptr.as_ptr());
// Apply bitap operations to all 32 simultaneously
// ...
}
}
Performance Impact
| Configuration | Throughput |
|---|---|
| Scalar (no SIMD) | ~50 MB/s |
| SIMD (AVX2) | ~180 MB/s |
| Speedup | ~3.6x |
Enabling SIMD
SIMD is enabled by default. To disable:
[dependencies]
fuzzy-regex = { version = "0.1", default-features = false }
Platform Support
x86_64
- SSE4.1: Minimum for SIMD path
- AVX2: Default, 32-byte vectors
- AVX512: Optional, 64-byte vectors
ARM (Apple Silicon)
- NEON: 16-byte vectors
- Auto-detected: Runtime detection for M1/M2/M3
Detection
Runtime detection automatically uses best available:
#![allow(unused)]
fn main() {
// At compile time: #[target_feature(enable = "avx2")]
// At runtime: CPU feature detection
}
Requirements
SIMD requires:
- CPU support (SSE4.1+ or NEON)
- Compiler with SIMD intrinsics
- Pattern length suitable for Bitap
Benchmark: Character Class Search
Pattern [a-z]+ on 100KB text:
| Implementation | Time |
|---|---|
| Scalar | 45.2 μs |
| NEON | 6.8 μs |
| AVX2 | 4.1 μs |
| Speedup | 6-10x |
Performance Tips
Optimize fuzzy-regex for your use case.
Pattern Design
1. Use Specific Edit Limits
fn main() {
// Good: Specific limit
let _ = fuzzy_regex::FuzzyRegex::new("(?:hello){e<=1}").unwrap();
// Less efficient: Higher limit
let _ = fuzzy_regex::FuzzyRegex::new("(?:hello){e<=5}").unwrap();
println!("Done");
}
Lower edit limits = faster matching.
2. Prefer Shorter Patterns
fn main() {
// Bitap (fast): ≤64 chars
let _ = fuzzy_regex::FuzzyRegex::new("(?:short){e<=1}").unwrap();
// NFA (slower): >64 chars
let _ = fuzzy_regex::FuzzyRegex::new("(?:very_long_pattern_that_exceeds_sixty_four_characters){e<=1}").unwrap();
println!("Done");
}
3. Extract Exact Parts
fn main() {
// Good: Exact prefix and suffix help prefilter
let _ = fuzzy_regex::FuzzyRegex::new("exact_prefix (?:fuzzy){e<=1} exact_suffix").unwrap();
// Slower: Entirely fuzzy
let _ = fuzzy_regex::FuzzyRegex::new("(?:entirely_fuzzy){e<=1}").unwrap();
println!("Done");
}
4. Use Greedy Suffix Patterns
fn main() {
// Good: .*SUFFIX is optimized with reverse search
let _ = fuzzy_regex::FuzzyRegex::new(".*test").unwrap();
let _ = fuzzy_regex::FuzzyRegex::new(".*test~1").unwrap();
// Also works with anchors
let _ = fuzzy_regex::FuzzyRegex::new("^.*test$").unwrap();
println!("Done");
}
Patterns like .*test automatically use reverse search to find the suffix first, then match everything before it. This is O(n) instead of O(n²).
Builder Options
1. Set Similarity Threshold
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
// Skip low-quality matches early
let _ = FuzzyRegexBuilder::new("(?:hello){e<=2}")
.similarity(0.8)
.build();
println!("Done");
}
2. Use Case Insensitive at Builder
fn main() {
use fuzzy_regex::FuzzyRegexBuilder;
// More efficient than inline (?i)
let _ = FuzzyRegexBuilder::new("(?:hello)")
.case_insensitive(true)
.build();
println!("Done");
}
API Usage
1. Use Streaming for Large Data
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
// Good: Process in chunks
let mut stream = re.stream();
let data = b"hello world";
for chunk in data.chunks(8) {
// Process chunk
}
// Bad: Load all into memory
let large_text = "hello world";
let _matches: Vec<_> = re.find_iter(&large_text).collect();
}
2. Use find() for First Match
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
let text = "hello world";
// Good: Stop after first match
if let Some(m) = re.find(text) {
println!("Found: {}", m.as_str());
}
// Unnecessary: Find all when only first needed
let _all: Vec<_> = re.find_iter(text).collect();
}
3. Check supports_streaming()
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();
if re.supports_streaming() {
// Use streaming API for best performance
let mut stream = re.stream();
println!("Streaming supported");
}
}
Build Configuration
1. Release Mode
cargo build --release
2. LTO
[profile.release]
lto = true
codegen-units = 1
3. SIMD
Enabled by default. Ensure target CPU supports it.
Common Pitfalls
| Issue | Solution |
|---|---|
| Slow with high edits | Lower edit limit |
| High memory usage | Use streaming |
| Slow on long text | Use exact prefix |
| Slow compilation | Enable LTO |
Pathological Patterns
Some regex patterns can cause O(n²) behavior in naive implementations:
#![allow(unused)]
fn main() {
// Pattern: .*a|b on text of all 'b's
// Each 'b' matches individually
// Naive: O(n) matches × O(n) scan = O(n²)
}
When It Happens
- Alternation with wildcards:
.*a|b,(a|b)+ - Overlapping matches: Many ways to match the same text
- Backtracking patterns: Complex alternation
Solution: Hardened Mode
Use find_all_hardened() for O(n) guaranteed performance:
#![allow(unused)]
fn main() {
use fuzzy_regex::FuzzyRegex;
let re = FuzzyRegex::new(".*a|b").unwrap();
let text = "bbbbbbbbbbbbbbbb";
// Hardened mode: O(n) guaranteed
let matches = re.find_all_hardened(text);
}
Performance Comparison
| Text Size | Standard | Hardened |
|---|---|---|
| 1,000 bytes | 1.08s | 69ms |
| 10,000 bytes | 10.76s | 69ms |
The hardened mode maintains constant time regardless of text size.
Trade-offs
Hardened mode may be slightly slower for well-behaved patterns (where O(n²) doesn’t occur), but it’s the safest choice when:
- Pattern behavior is unknown
- Text comes from untrusted sources
- Worst-case performance is critical
Pattern Syntax Reference
Complete reference for all pattern syntax.
Special Characters
| Character | Description |
|---|---|
. | Any character except newline |
^ | Start of string |
$ | End of string |
| ` | ` |
\ | Escape |
(, ) | Groups |
[, ] | Character class |
{, } | Quantifier or fuzziness |
Character Classes
| Syntax | Description |
|---|---|
[abc] | a, b, or c |
[^abc] | Not a, b, or c |
[a-z] | a to z |
[A-Za-z] | ASCII letters |
[0-9] | Digits |
\d | Digit |
\D | Non-digit |
\w | Word character |
\W | Non-word |
\s | Whitespace |
\S | Non-whitespace |
Quantifiers
| Syntax | Description |
|---|---|
* | 0 or more |
+ | 1 or more |
? | 0 or 1 |
{n} | Exactly n |
{n,} | n or more |
{n,m} | n to m |
*? | Lazy * |
+? | Lazy + |
?? | Lazy ? |
Fuzziness Markers
| Syntax | Description |
|---|---|
{e<=N} | Max N edits |
{i<=N} | Max N insertions |
{d<=N} | Max N deletions |
{s<=N} | Max N substitutions |
{t<=N} | Max N transpositions |
{c<=N} | Max total cost N |
{Ni+Md...<=N} | Weighted costs |
{e<=N:[class]} | Restricted edits |
~N | Shorthand for {e<=N} |
Groups
| Syntax | Description |
|---|---|
(...) | Capture group |
(?:...) | Non-capture |
(?<name>...) | Named capture |
(?=...) | Lookahead |
(?!...) | Negative lookahead |
(?<=...) | Lookbehind |
(?<!...) | Negative lookbehind |
(?>...) | Atomic group |
Flags
| Syntax | Description |
|---|---|
(?i) | Case insensitive |
(?m) | Multi-line |
(?s) | Dot-all |
(?x) | Verbose |
(?U) | Ungreedy |
(?b) | Best match |
(?e) | Enhance match |
(?p) | POSIX mode |
Error Messages
Understanding and handling errors.
Pattern Errors
Invalid Syntax
Error: regex parse error
Cause: Malformed regex pattern.
Fix: Check for unclosed groups, invalid quantifiers, etc.
Unsupported Feature
Error: Feature not supported: backreferences
Cause: Pattern uses feature not implemented.
Fix: Simplify pattern or restructure.
Empty Pattern
Error: pattern cannot be empty
Cause: Empty regex not allowed.
Fix: Use non-empty pattern.
Compilation Errors
Fuzzy Limit Out of Range
Error: edit limit too high
Cause: Edit limit exceeds maximum.
Fix: Reduce limit.
Invalid Character Class
Error: invalid character class
Cause: Malformed character class.
Fix: Check class syntax.
Runtime Errors
Timeout
Error: timeout after ...
Cause: Matching took too long.
Fix: Simplify pattern, reduce edit limit.
No Match
Not an error - use find() returning Option.
#![allow(unused)]
fn main() {
if let Some(m) = re.find(text) {
// Match found
} else {
// No match
}
}
Migration Guide
Migrating from other libraries to fuzzy-regex.
From regex (standard)
fuzzy-regex extends the standard regex crate with fuzzy matching:
fn main() {
// Standard regex
let re = regex::new("hello").unwrap();
// fuzzy-regex
let re = fuzzy_regex::FuzzyRegex::new("hello").unwrap();
// Add fuzziness
let re = fuzzy_regex::FuzzyRegex::new("(?:hello){e<=1}").unwrap();
println!("Created");
}
From fuzzy-aho-corasick
See Compatibility Layer for a drop-in replacement.
From mrab-regex
The fuzzy syntax is compatible:
fn main() {
// mrab-regex
let _re = regex::new(r"(?i)(?:hello){e<=1}");
// fuzzy-regex (same syntax)
let re = fuzzy_regex::FuzzyRegex::new(r"(?i)(?:hello){e<=1}").unwrap();
println!("Created");
}