reba v0.3.1
Reba
Reba is a Regular Expression Behavioral Analyser for JavaScript and TypeScript. Currently it is under development but some components are ready to use.
Requirements:
A JavaScript env with support for regex lookbehind assertions and String.matchAll, e.g.:
- Node.js v12.0.0+
- Chrome v73+
Usage
Load as ES Module and initialize:
import reba from 'reba';
const Reba = reba();
Features:
Meta scores (Experimental):
With Reba you can get a measure of match confidence based on the number and the types of meta characters in a pattern: \d
, \w
, .
and so on. This is useful for triggering further action on matching patterns with moderate-to-high variability.
let pattern = /1234/;
Reba.score(pattern); // Literal, returns 1
pattern = /12\d{2}/;
Reba.score(pattern); // Two meta characters, '\d', returns 0.55
pattern = /12.{2}/;
Reba.score(pattern); // Two meta chars, '.', returns 0.5008
Combined with a basic threshold and some logic:
const threshold = 0.8;
if (matches && score < threshold) {
// Do review, more data processing, etc.
}
For more information, see the Reba Documentation
Customize Metascore Weights and Odds
Apply weighting when you'd like some meta characters to have less effect on a score. The following example configures Reba to almost negate the effects of whitespace (\s
) characters in the metascore:
const Reba = reba({ ws: 'w-0.001' }); // The (w)eight for whitespace (ws) chars is very small,
// meaning they will be treated almost equal to literal chars
pattern = /ab\sd/; // '\s' matches space, tab, newline and so on
Reba.score(pattern) // Returns a score very close to 1
In addition you can set the odds for wildcard .
characters to change how they affect a metascore. For example, the default odds for the wildcard in Reba are based on several Latin Unicode subsets but can be changed to reflect different character sets (think ASCII, UTF8 or more specific language configs).
const Reba = reba({ wildcard: 'f-256' }); // (f)actor is 256 (ASCII charset), odds become 1/256
pattern = /Hello./; // e.g. Hello!, Hello?, Helloo, ..
Reba.score(pattern) // Returns a slightly higher score than the default wildcard odds
Match Machines (In Development)
These wrap around a native JavaScript RegExp object, apply metascores automatically and will enable advanced matching strategies such as pattern recomposition and decomposition.
For the moment though, matching with a machine simply returns matches and a metascore.
const machine = Reba.machine.from(/pattern/);
const result = machine.match('string');
console.log(result.matches, result.score);
The following additional features are planned:
- Dynamic odds for wildcard quantifiers
*
,+
and{n,}
based on string samples, - Can be constructed from regex notation objects (see below),
- Tools for pattern scalability and maintainability
Regex Notation Objects (Planned)
Create templates and separate concerns between pattern structure and values (high level/low level).
Define patterns on the spot:
const pattern = new RegexNotationObject()
.settings({
template: '(greetings) (?=environments)',
flags: 'i'
})
.values({
greetings: [ 'Hello', 'Good Morning', 'Howdy', 'Hi' ],
environments: [ 'World', 'Planet', 'Universe', 'North Dakota' ]
});
Or from existing data:
const regexData = {
settings: {
template: '(greetings) (?=environments)',
flags: 'i'
},
values: {
greetings: [ 'Hello', 'Good Morning', 'Hi', 'Howdy' ],
environments: [ 'World', 'Planet', 'Universe', 'North Dakota' ]
}
}
const pattern = new RegexNotationObject(regexData);
= new Reno(regexData); // shorthand
Incremental Debugging (Experimental)
If a pattern does not match due to behavioral errors instead of structural ones, figuring out what is wrong can be a challenge. Reba has a debugger that increments a pattern step-by-step until it mismatches on a sample string. It returns the subpattern and the culprit character:
Reba.debug(/Hello (world)!/, 'Hello World!');
Mismatch at increment 6 (zero based),
Desired: Hello World,
Pattern: /Hello (w)/,
Culprit: w,
Culprit_index: 7
Pattern Decomposition (Planned)
Attempt to match a pattern and if null replace literal characters with meta ones a number of times. e.g. changing numbers to \d, letters to \w, and so on. This enables 'safety-first' matching and catching one-off misses.
pattern.blur(rounds); // 1 round means every literal character will be replaced with a meta (\d or \w or \s or .) once