Brief

In this article, I would like to share an overview of a hard-level task solution published in “Leet code” related to regular expression matching. This task implementation shows the most essential parts of regular expression matching.

Task statement

Statements of the task are:

input is lowercase string of english alphabet

string size limited to 40 symbols

any english alphabet symbol in regex have to have match at tested string

symbol ‘.’ expects that it can match any symbol

combination of english alphabet symbols with ‘*’ right after expects that any number of base symbols should occur ( even 0 symbols ) will match

combination ‘.*’ expect any number of any symbols can match

Samples:

isMatch(/*regex*/ “a”, /*string*/ “a”) is true

isMatch(/*regex*/ “a”, /*string*/ “b”) is false

isMatch(/*regex*/ “a”, /*string*/ “”) is false

isMatch(/*regex*/ “a*”, /*string*/ “”) is true

isMatch(/*regex*/ “a*”, /*string*/ “a”) is true

isMatch(/*regex*/ “a*”, /*string*/ “b”) is false

isMatch(/*regex*/ “.”, /*string*/ “”) is false

isMatch(/*regex*/ “.”, /*string*/ “a”) is true

isMatch(/*regex*/ “.”, /*string*/ “b”) is true

isMatch(/*regex*/ “.*”, /*string*/ “badasddsazxc”) is true

Next, we will look at the approaches to the solution: what are the most significant parts of the task to consider?

Precise task overview:

The number of string-to-regular expression matching combinations we can do is huge. Attempting to combine them all and test the match at each of the incoming strings makes the task hard at first glance.

Simplest regular expressions matching

The only way to find the solution is to limit the number of parameters we have. Let’s consider this task with the most primitive subset of regular expressions. Just the simplest regular expression classes will be considered first.

Now we have ensured that step one of the task solution can be implemented.

Two parts complex regular expressions

The next step is to find a solution for the combination of any 2 possible regular expressions together to find the matching of the string. We can consider all of the combinations of simplest patterns at complex expression, but instead, let’s combine them smartly:

This way we can expect that only 4 combinations of string intervals matching of simplest regular expression exist:

In the last scenario string matches to regex. In all other scenarios, we can expect that dividing the string into intervals for placeholders can be changed. It means that we can move symbols from one interval to another and vice versa to reach the matching (if possible).


It is possible to make this set of moves of placeholder borders canonical and correct to find the match if it exists.

Initialization of solution

Let’s consider that all of the placeholders have some capacity to match the intervals. For “one symbol regular expression” the placeholder will have a capacity equal to 1. For ”any count symbols placeholders” capacity is unlimited.


It means that it is fair to divide the string into intervals by capacity in the initial stage of the matching algorithm. Let’s put all of the symbols of the tested string to placeholders according to placeholders capacity from left to right.

As we can see in the picture, the first interval has unlimited capacity. That is why it contains all of the symbols on the initialization of the algorithm.

This placement is fair:

So to make the algorithm solid at this state we need to:

At this point we have step 2 completed: for just 2 intervals we can implement the solution of the task.

Three parts complex regular expression

Now let’s look at dividing the string into 3 placeholders. It is possible to apply the algorithm for 2 placeholders for 3 intervals with small modifications. Next, it is possible to increase the number of placeholders to any count.

As we can see on Picture 5 the first move is from the first placeholder to the next one — left to right. It is the most atomic action we can do, we have no other actions to be made. But why do we select the first placeholder as the source of the move?

We can define this criterion as “can move”.

In the next step, we need to define a placeholder to be selected as a source of move. If most right placeholders have a symbol and this symbol does not match the placeholder — then the addition of any symbol will not fix the matching of this placeholder. This symbol can not belong to this pattern. It should be moved.

In the second step “source of move placeholder” will be selected by strategy defined above. But if we try to use this strategy for the third step — we will find that no more steps are possible. All placeholders point to the string parts that match to simplest regular expression or to the string without symbols inside.


That is why we need to define additional criteria to detect the “source of move pattern” at the next step that the red arrow in Picture5 shows. It is the special criteria for placeholders that have a match.

Combined placeholder

Let’s consider the first and second placeholders as a combined placeholder. This way we will find that the only option to have a match will be the transition symbol from this combined placeholder to the next placeholder. This is obvious because we have a match at the combined placeholder and do not have it at the next placeholder. It is the same way to act for two placeholders we considered.

But why do we not care to get second and third placeholders as combined placeholders? Without internal movement in this complex placeholder, we can not reach its matching state by moving symbols from outside. That is why we should have equal state placeholders as complex placeholders to correctly use actions from using two placeholders. This means that new criteria should consider this complex placeholder at the beginning of the placeholder collection.

Algorithm definition

Now, we have all the essential parts together and we can define the algorithm internals:

Implementation details

We will talk about the implementation of the task in modern C++. There are several points to discuss about the implementation:

Solution runtime complexity

Let's dive into the runtime complexity of our solution. When considering runtime complexity, we begin by examining each step involved in moving the border of placeholders from one symbol to another. This process involves passing through the symbols one by one. The maximum number of symbols that the count border can pass through is the string length itself. Additionally, the last border cannot be moved at all. Consequently, the worst-case complexity, based on moves, can be defined as O(s²).

However, it's important to note that this complexity is not the final one. At each step, we also need to detect the most right unmatched and matched symbols using strategies one and two. The worst-case complexity of this detection will be linear, denoted as O(r).

Therefore, the total complexity of our solution can be expressed as O(s² * r).

Improving solution runtime complexity

But we can make runtime complexity more effective and reach O(s² * log(r)).

For every step the next move source placeholder needs to be found. Worst case we need to traverse over all of the placeholders to find it. But we can make it better. We can create an ordered binary tree of:

This is not all about reaching the runtime optimization goal: We need to build these ordered binary trees and be able rebuild them fast.

At every move on the algorithm comes possible to change several parameters:

We have all of these states at the ordered sets mentioned later. That is why we need to process maximum:

One more area that can increase the runtime is matching of the simplest regular expression in placeholder and string intervals.

Just imagine if we will recount the simplest regular expression matching symbol by symbol at each algorithm step. The complexity of each step of this solution will be linear O(s). This is not an option — we can make it O(1) by using unmatched symbols count and detecting which symbol comes in/out to the string interval that placeholder points.

Memory footprint

Now let’s talk about memory footprint. In the best-case scenario, it should be O(s+r). This approach is feasible when we cannot store all intervals and simple regular expressions themselves but only reference certain parts of incoming data.

In C++, the most suitable method for this is utilizing std::string_view classes to hold references to regular expressions and string intervals instead of storing values as string members at placeholders. Even when optimizing for runtime, the memory footprint complexity for both algorithm strategies remains O(s+r). This is due to the fact that the space complexity of an ordered binary tree is O(r).

Conclusion:

Algorithms described in the article have: