Computing Damerau-Levenshtein Distance

At some point in your programming career, you'll likely find yourself wanting to quantify how "close" two strings are to one another. You will quickly encounter "Levenshtein distance" which defines the distance between two strings A & B as the minimum number of insertions, deletions & single-character changes required to transform A to B (so the Levenshtein distance between "cat" & "act" is 2). For myself, my most frequent typographic error is transposition: this is included in Damerau-Levenshtein distance, which adds transpositions to the set of edit operations under consideration (so the Damerau-Levenshtein distance between "cat" & "act" is just 1).

Damerau-Levenshtein distance initially found application in spell-checkers & other sorts of fuzzy lookups, but more recently seems to have also found use in biology for measuring the distance between strings used to encode various entities (protein sequences, e.g.). Surprisingly, despite the basic computation of Damerau-Levenshtein distance being well-understood (it's laid out on Wikipedia), there doesn't seem to be a reference C/C++ implementation (it is a feature request for the boost string library). A Github search turned up several repositories, but the ones I examined lacked documentation or test suites.

I spent some time digging into the literature & coding up the principal algorithms for computing Damerau-Levenshtein distance (TL;DR; use that of Berghel & Roach). I wrote-up a little survey along with the code & posted them to Github, but you can also download tarballs here. Finally, if you're interested in fuzzy string matching in general, beyond this particular method, this is a nice little survey I found.

06/16/20 08:30

Have a comment on this post? Start a discussion in my public inbox by sending an email to ~sp1ff/public-inbox@lists.src.ht(mailing list etiquette), or see existing discussions