Bitap Algorithm
Fast fuzzy matching algorithm for short patterns.
Overview
The Bitap algorithm (also known as the Shift-Or algorithm) is used for fuzzy matching of short patterns (≤64 characters). It achieves O(n×k) time complexity where n is text length and k is maximum edits.
How It Works
Bitap uses bitmasks to track which positions can match:
- Initialize: Create a bitmask for each character in the pattern
- Scan: For each text character, update the bitmask
- Match: When bitmask indicates a match, extract the match
Bitap State
#![allow(unused)]
fn main() {
// Simplified bitap state representation
struct BitapState {
// Bits representing current match positions
// For pattern "hello":
// Position: 5 4 3 2 1 0
// Bits: 0 0 0 0 0 1 = "h"
// 0 0 0 0 1 0 = "e"
// ... continue ...
}
}
Key Properties
Pattern Length Limit
- Maximum 64 characters (uses 64-bit integers)
- Longer patterns use Damerau-Levenshtein NFA
Edit Distance
- Supports insertions, deletions, substitutions
- Transpositions handled specially
SIMD Acceleration
Modern implementations use SIMD for parallelism:
#![allow(unused)]
fn main() {
// SIMD processes multiple characters at once
// 256-bit AVX2: process 32 bytes per iteration
// 512-bit AVX512: process 64 bytes per iteration
}
Performance
| Pattern Length | Text Size | Throughput |
|---|---|---|
| 14 chars | 200KB | ~1.5 Gbps |
| 14 chars | 1MB | ~1.4 Gbps |
| 14 chars, k=1 | 200KB | ~2.0 Gbps |
| 14 chars, k=3 | 200KB | ~700 Mbps |
When Bitap is Used
- Pattern length ≤ 64 characters
- Simple fuzzy matching (no complex regex features)
- Exact matching benefits from optimizations
Implementation Details
State Machine
#![allow(unused)]
fn main() {
// Each iteration:
// 1. Shift state right by 1
// 2. OR with pattern mask for current char
// 3. Check for match (zero in certain bits)
}
Handling Errors
Bitap extends to handle errors by maintaining multiple state masks (one per allowed error):
#![allow(unused)]
fn main() {
// For k errors, maintain k+1 states
// state[0] = exact match
// state[1] = 1 error
// state[k] = k errors
}