Finding repetitions¶
Given a string
A repetition is two occurrences of a string in a row.
In other words a repetition can be described by a pair of indices
The challenge is to find all repetitions in a given string
The algorithm described here was published in 1982 by Main and Lorentz.
Example¶
Consider the repetitions in the following example string:
The string contains the following three repetitions:
Another example:
Here there are only two repetitions
Number of repetitions¶
In general there can be up to
On the other hand this fact does not prevent computing the number of repetitions in
There is even the concept, that describes groups of periodic substrings with tuples of size four. It has been proven that we the number of such groups is at most linear with respect to the string length.
Also, here are some more interesting results related to the number of repetitions:
- The number of primitive repetitions (those whose halves are not repetitions) is at most
- If we encode repetitions with tuples of numbers (called Crochemore triples)
-
Fibonacci strings, defined as
are "strongly" periodic. The number of repetitions in the Fibonacci string
Main-Lorentz algorithm¶
The idea behind the Main-Lorentz algorithm is divide-and-conquer.
It splits the initial string into halves, and computes the number of repetitions that lie completely in each halve by two recursive calls. Then comes the difficult part. The algorithm finds all repetitions starting in the first half and ending in the second half (which we will call crossing repetitions). This is the essential part of the Main-Lorentz algorithm, and we will discuss it in detail here.
The complexity of divide-and-conquer algorithms is well researched.
The master theorem says, that we will end up with an
Search for crossing repetitions¶
So we want to find all such repetitions that start in the first half of the string, let's call it
Their lengths are approximately equal to the length of
Consider an arbitrary repetition and look at the middle character (more precisely the first character of the second half of the repetition).
I.e. if the repetition is a substring
We call a repetition left or right depending on which string this character is located - in the string
We will now discuss how to find all left repetitions. Finding all right repetitions can be done in the same way.
Let us denote the length of the left repetition by
We will fixate this position
For example:
The vertical lines divides the two halves.
Here we fixated the position
It is clear, that if we fixate the position
Criterion for left crossing repetitions¶
Now, how can we find all such repetitions for a fixated
Let's again look at a visualization, this time for the repetition
Here we denoted the lengths of the two pieces of the repetition with
Let us generate necessary and sufficient conditions for such a repetition at position
- Let
- Let
- Then we have a repetition exactly for any pair
To summarize:
- We fixate a specific position
- All repetition which we will find now have length
- We find
- Then all suitable repetitions are the ones for which the lengths of the pieces
Therefore the only remaining part is how we can compute the values
- To can find the value
- To precompute all values
So this is enough to find all left crossing repetitions.
Right crossing repetitions¶
For computing the right crossing repetitions we act similarly:
we define the center
Then the length
Thus we can find the values
After that we can find the repetitions by looking at all positions
Implementation¶
The implementation of the Main-Lorentz algorithm finds all repetitions in form of peculiar tuples of size four:
Notice that if you want to expand these tuples to get the starting and end position of each repetition, then the runtime will be the runtime will be
vector<int> z_function(string const& s) {
int n = s.size();
vector<int> z(n);
for (int i = 1, l = 0, r = 0; i < n; i++) {
if (i <= r)
z[i] = min(r-i+1, z[i-l]);
while (i + z[i] < n && s[z[i]] == s[i+z[i]])
z[i]++;
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
}
return z;
}
int get_z(vector<int> const& z, int i) {
if (0 <= i && i < (int)z.size())
return z[i];
else
return 0;
}
vector<pair<int, int>> repetitions;
void convert_to_repetitions(int shift, bool left, int cntr, int l, int k1, int k2) {
for (int l1 = max(1, l - k2); l1 <= min(l, k1); l1++) {
if (left && l1 == l) break;
int l2 = l - l1;
int pos = shift + (left ? cntr - l1 : cntr - l - l1 + 1);
repetitions.emplace_back(pos, pos + 2*l - 1);
}
}
void find_repetitions(string s, int shift = 0) {
int n = s.size();
if (n == 1)
return;
int nu = n / 2;
int nv = n - nu;
string u = s.substr(0, nu);
string v = s.substr(nu);
string ru(u.rbegin(), u.rend());
string rv(v.rbegin(), v.rend());
find_repetitions(u, shift);
find_repetitions(v, shift + nu);
vector<int> z1 = z_function(ru);
vector<int> z2 = z_function(v + '#' + u);
vector<int> z3 = z_function(ru + '#' + rv);
vector<int> z4 = z_function(v);
for (int cntr = 0; cntr < n; cntr++) {
int l, k1, k2;
if (cntr < nu) {
l = nu - cntr;
k1 = get_z(z1, nu - cntr);
k2 = get_z(z2, nv + 1 + cntr);
} else {
l = cntr - nu + 1;
k1 = get_z(z3, nu + 1 + nv - 1 - (cntr - nu));
k2 = get_z(z4, (cntr - nu) + 1);
}
if (k1 + k2 >= l)
convert_to_repetitions(shift, cntr < nu, cntr, l, k1, k2);
}
}