sia.hackernoon.com

Are you among those many developers that get the shivers whenever you encounter regular expressions? Does your blood pressure rise? Do your eyes and ears start involuntary movements?

Well, if you can answer all of the questions with a clear cut “yes Sir” and your psychotherapy is getting too expensive, please listen up.
Regex has become a standard for pattern matching in text. Why? Because it was there when the whole IT business started, with the fundamental ideas reaching back into the 1950s!
Back then you had Rock ‘n Roll, cars with a mileage comparable to modern space ships and telephone operator ladies with lovely voices. It wasn’t all bad and back then regular expressions made sense: in theory!
But nowadays we have alternatives, for example a text processor named

Scripal (https://github.com/scripal-git/scripal).

There are so many concepts and tools from the IT pioneer days, SQL is among them. But in the most areas we’ve seen a radical shift towards new ideas and tools. Today it’s intuitive, easy-to-use NoSQL with MongoDB, Neo4j and Redis. The world isn’t just tables, it’s tree structures, networks and complex links.
Now we have JSON, XML, YAML and not only CSV. Finally Unicode and UTF-8/16 has become the standard, rest in peace ASCII. Today we write our software in Python or C# and it’s just runs on any machine with the necessary support. No more C forever, compiling on a specific machine (Write-Once, Run-Only-On-My-Machine-In-The-Basement-Mom-Don’t-Touch-It) and memory faults.

The world has changed and our everyday tools have become so much more sophisticated and comfortable. Except for: when it comes to string parsing and matching: Hello 1950s, Jerry Lee Lewis “Great Balls of Fire” playing on the AM radio.
I’ve seen very few ideas about alternatives, and we all know, how annoying regex can be.

Here’s the problem:

Unreadable without using LSD in the workplace:

Developers or admins google up the control characters over and over again, because it just doesn’t stay in your brain, unlike the telephone number of the pizza delivery service. We had assembler and just about everyone hates it:

function Sum(var X, Y: Integer): Integer; stdcall;
begin
  asm
    MOV                EAX,X
    MOV                EAX,[EAX]
    MOV                EDX,Y
    ADD                EAX,[EDX]
    MOV                @Result,EAX
  end;
end;

and now regex:

(?:(?<=^|\s)(?=\S)|(?<=\S|^)(?=\s))this (?:(?<=\S)(?=\s|$)|(?<=\s)(?=\S|$))

What’s worse Dude? Why do people accept this prehistorical syntax? We’re talking about a dinosaur here. It’s unreadable for yourself after a while and certainly for your colleagues.
Do you need lifetime enemies? Ask your colleague to give you a regex to match numbers in the range [-3467, 948764].

I’ve seen people tampering with expressions over hours, even though you would’ve written the source with Python’s string functions in a few minutes. But people just use regex, because it’s the standard and because they all use it. “But they all drink Coke too”…well Coke has competitors like Pepsi. Even Assembler allows comments to describe your ideas. Most regex engines don’t, it’s not part of any standard. Oh right, standard: how many regex flavors are out there? You google a solution and surely it’s the Perl flavor and your Node.js just won’t work with the example.

Slow, as in slow like a turtle with 3 legs:

I’ve seen regex doing fine for say 2K text chunks. As soon as they grow bigger, the execution times may explode or a stack overflow might occur. The time can grow exponentially related to the input size, depending on the pattern used. With backtracking we usually have O(2^n) complexity*.* We’ve had some projects which ran quite well with small test data, but during production, the size grew and regex had to be replaced over night with standard string functions and manual parsing, adrenaline levels skyrocketing.

There are so many examples:

Stack Overflow had a 34-minute outage in 2016, caused by
^[\s\u200c]+|[\s\u200c]+$

and roughly 20,000 consecutive whitespace characters on a comment line.
Many backtracking expressions are time bombs.

Primtive. You woman, I man, you make food in kitchen:

Regex doesn’t understand numbers and interprets them as single characters! Well, your CPU knows numbers, why such a lack of functionality? Regex works on single characters and forgets the rest of the world. But we developers think in numbers, words, phrases… sometimes it’s just very hard to formulate regex for even simple string portions. Many engines still offer no true Unicode support and developers are astonished by complaints from Chinese users. It’s gruesome and you notice, just when these engines were developed, often decades ago. Why waste execution time during string parsing in a slow, but highly productive language like Python, if the regex machine runs in Python itself? That’s why the world still has C or C++ and hardcore masochists willing to use these languages.

Hard to analyze or debug: why is this ‘&$%Rgrrr not working? Why me? Why now? Why here?

We coders use debuggers every day to step through the execution of our code. What’s happening in the regex machine and why is the pattern failing at the end of the text? It’s not transparent at all, so you keep on rewriting the pattern, try again, rewrite, try...that’s the way you worked 50 years ago in your well paid job at IBM.

After so much frustration with execution speed and creating regex patterns with many side-effects and pitfalls “but it worked yesterday with the other text”, when it comes to pattern matching and finding text within text or replacing it, is there a 21st century tool out there? Anyone?

I’ve looked at Cucumber Expressions, agp-exp and other solutions.
A very different approach is:Scripal. It’s new, just pushed to our beloved GitHub, still smoking. Finding a single character as a start of match is about the fastest method you can find. It’s written in C++ for interoperability with other languages and pure speed. Scripal compiles the source pattern into fast byte code. It’s syntax is clear cut and more like the high level languages, we use every day. Scripal doesn’t use complex, intrinsic backtracking, you must specify what to look for and after the initial match, the previous text portion may be tested. It’s just a different approach to parsing.

Examples:

Find all telephone numbers in NANP format:

match find( '(' repeat[3]( digit ) ') ' repeat[3]( digit ) '-' repeat[4]( digit ))
end
loop

Find the first integer number in the range 1 to 200:

match find ( [1,200] )

Do we have an IPv4?

match (pure[0,255] ‘.’ pure[0,255] ‘.’ pure[0,255] ‘.’ pure[0,255] )

Scripal’s patterns are so much simpler than regex in most cases.

Debug information:

Set the debug switch and watch how the engine steps through the code.

 set operator ‘match’ at(0)
 start group
 set operator ‘null’ at(1)
 operator done: ‘null’ at(1)
 close operator..

Comments:

Finally the developer can comment his or her source for the rest of the world. Even the first and most primitive programming languages offered comments.

match ( ‘hdu-’ ) // match ‘hdu’ as start of identification string

Similarity:

Scripal can find similar words and phrases, not using patterns:

scripal -f 0.5 “test” “Find the word in testing”

result: “test” at position 17 (in word ‘testing’) with a similarity rank of 0.57142857. Similarity values are in the range 0 (no similarity) to 1 (equal)

Scripal can compare two strings and tell us how similar they are:

scripal -g 0.5 “Compare the text with any other” “Compare the text with something else”

will return 0.857 as a similarity value for the two text portions.

Unicode and internationalization:

Scripal has full Unicode support and accepts the most common encodings like UTF-8, UTF-16, UTF-32 , Windows code pages and others. It uses UTF-8 internally so any language supporting UTF-8 will get the maximum effect with Scripal’s search algorithms.

Character classes and numbers:

Scripal has a concept of character classes, you can match any letter as in “match(letter)” or begin/end of word, sentence, digits and more. Scripal knows Arabic or Chinese digits, no problem. It’s easy to match a number in the range “[3,7]”, even as a hex or binary number.

Variables and templates:

Why repeat patterns, often seen in regex? Why not have templates for standard tasks?

< roadMarker = { any( ~'avenue' ~'ave.' ~'road' ~'street' ~'boulevard' ~'drive' ~'lane' ) } >
match find( int[1,10000] blank repeat[1,3]( !( < roadMarker > ) word ) blank < roadMarker > ) 
ifMatch { 
  matchEnd ( ',' blank int[1,@] repeat[1,3]( blank word ) at any(',' eol eot ))
}

Here we define roadMarker and use it to get an address, a template to find any type of road, case insensitive (~).

You may store results and use them later on, much like variables.

name #code# // store match result
match find ( #code# ) // find next piece like last result

Versatility:

Scripal comes as a C++ library and libs or modules for Node.js, Python, Java, C, C# for Windows and basically any Linux distro. It’s also a console binary / exe which can be used instantly, a handy tool for administrators and DevOps. It can be used to search for files or phrases in files, making it a nice alternative to grep, sed and awk.

Speed:

Scripal’s execution time is linear O(1) with respect to the input size, no bitter surprises on larger strings. You may estimate the required process time for a given maximum data size. It might not be faster than regex in any case, but it’s processing time is roughly computable.

Verdict (guilty, your Honor):

Scripal is not a perfect solution, probably there will never be one for string matching and parsing. There will always be a boundary, if a task is too complicated, you’ll use string functions in your programming language to compare or find patterns. For parsing HTML, XML or JSON there are myriads of libraries and tools out there. Using Scripal or regex for this task is not optimal, even though Scripal has some examples to find data in JSON arrays. But to match patterns quickly in raw text, especially in large strings and files, Scripal is a new idea, a true alternative to a 70 year old technology.

Give it a try, it hurts far less than regex.

If you’re interested, have a look athttps://github.com/scripal-git/scripal

Regex Is a 70-Year-Old Dinosaur—Here’s the Modern Alternative