Skip to content

Rabin-Karp algorithm in C++ to find recurring substrings in a string and use them to construct an edit list

Notifications You must be signed in to change notification settings

michaeljclark/match

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

match

Rabin–Karp algorithm in C++ to find recurring substrings in a string and use them to construct an edit list.

Overview

This is a single header C++ implementation of the Rabin–Karp algorithm to find recurring substrings in a string. This code finds self-matching substrings using a rolling hash known as a Rabin–Karp signature, which is a hash sum product, composed and recorded in a hashtable incrementally using multiple string length combinations offset from an increasing mark. These signatures are then looked up in the hashtable at each step to find matching substrings.

Rabin–Karp algorithm illustration

The algorithm records offsets of substring occurances in a hash table ('head') indexed by the Rabin–Karp signature and uses the hash table information to compose a chain of occurances in a second table ('prev'), indexed by offset, which leads from the current match to the last match. Each position in the previous match table contains an offset to the last match for the same signature recorded at a prior offset.

The search algorithm, after updating the hash table and previous match table, checks for a hit, and if there is a hit, it follows the chains of past offsets leading through all prior occurances of a particular Rabin–Karp signature, all the way back to the earliest match. The algorithm then verifies these approximate matches, and if valid, creates copy instructions for the best match, or, alternatively emits literal instructions for new text. The matches are approximate because there is a collision probability, however, there are multiple offsets that can be used to match a prior occurance, which aids with the probabalistic nature of the algorithm.

The output is a list of instructions referencing new data or copies of previous data.

  • Vector<Match>
    • Type type (* Literal or Copy *)
    • Size offset
    • Size length

Example

The following is an example invocation of the included test program:

Test command:

$ ./build/match -v -t TGGGCGTGCGCTTGAAAAGAGCCTAAGAAGAGGGGGCGTCTGGAAGGAACCGCAACGCCAAGGGAGGGTG

Expected output:

OriginalText: TGGGCGTGCGCTTGAAAAGAGCCTAAGAAGAGGGGGCGTCTGGAAGGAACCGCAACGCCAAGGGAGGGTG
[  0] : Literal [   0,  7 )   # "TGGGCGT"
[  1] :    Copy [   4,  3 )   # "GCG"
[  2] : Literal [   0, 14 )   # "CTTGAAAAGAGCCT"
[  3] :    Copy [   8,  4 )   # "AAGA"
[  4] :    Copy [  11,  4 )   # "AGAG"
[  5] :    Copy [  31,  3 )   # "GGG"
[  6] :    Copy [  32,  4 )   # "GCGT"
[  7] : Literal [   0,  1 )   # "C"
[  8] :    Copy [  40,  3 )   # "TGG"
[  9] :    Copy [  27,  3 )   # "AAG"
[ 10] :    Copy [  33,  3 )   # "GAA"
[ 11] : Literal [   0,  6 )   # "CCGCAA"
[ 12] :    Copy [   5,  3 )   # "CGC"
[ 13] :    Copy [   6,  3 )   # "CAA"
[ 14] :    Copy [  60,  3 )   # "GGG"
[ 15] : Literal [   0,  1 )   # "A"
[ 16] :    Copy [  64,  3 )   # "GGG"
[ 17] : Literal [   0,  2 )   # "TG"
DataSize/Literals/Copies: 70/31/39
OuterIterations/InnerIterations: 1018/750

About

Rabin-Karp algorithm in C++ to find recurring substrings in a string and use them to construct an edit list

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published