Edit distance

In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Different definitions of an edit distance use different sets of string operations. The Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the Levenshtein distance is usually what is meant by "edit distance".[1]

Formal definition and properties

Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:[2]

Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
Deletion of a single symbol changes uxv to uv (x→ε).
Substitution of a single symbol x for a symbol yx changes uxv to uyv (xy).

In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x, y) with the operations.[2]

Additional primitive operations have been suggested. A common mistake when typing text is transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv.[3][4] For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.[4]

Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.[1]:37 Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.[1] Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.

Example

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

  1. kitten → sitten (substitution of "s" for "k")
  2. sitten → sittin (substitution of "i" for "e")
  3. sittin → sitting (insertion of "g" at the end).

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

  1. delete k at 0
  2. insert s at 0
  3. delete e at 4
  4. insert i at 4
  5. insert g at 6

for a total cost/distance of 5 operations.

Properties

Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met:[1]:37

With these properties, the metric axioms are satisfied as follows:

d(a, b) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.
d(a, b) > 0 when ab, since this would require at least one operation at non-zero cost.
d(a, b) = d(b, a) by equality of the cost of each operation and its inverse.
Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).[5]

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.[1]

Other useful properties of unit-cost edit distances include:

Regardless of cost/weights, the following property holds of all edit distances:

Computation

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.[6]

Common algorithm

Using Levenshtein's original operations, the edit distance between and is given by , defined by the recurrence[2]

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.[3]

The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer,[7] although it has a history of multiple invention.[2][3] After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at .

This algorithm has a time complexity of Θ(mn). When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations.[3] A linear-space solution to this problem is offered by Hirschberg's algorithm.[8]:634

Improved algorithms

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants,[9] one of which takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s²) or O(s), depending on whether the edit sequence needs to be read off.[3]

Applications

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

Language edit distance

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet Σ, the language edit distance d(L, x) is given by[11], where is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. [12] For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.[13]

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem. [11] [14]

See also

References

  1. 1 2 3 4 5 6 7 Navarro, Gonzalo (1 March 2001). "A guided tour to approximate string matching" (PDF). ACM Computing Surveys. 33 (1): 31–88. doi:10.1145/375360.375365. Retrieved 19 March 2015.
  2. 1 2 3 4 Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
  3. 1 2 3 4 5 Esko Ukkonen (1983). On approximate string matching. Foundations of Computation Theory. Springer. pp. 487–495.
  4. 1 2 3 4 Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652Freely accessible. doi:10.1007/s10032-002-0082-8.
  5. Lei Chen; Raymond Ng (2004). On the marriage of Lₚ-norms and edit distance. Proc. 30th Int'l Conf. on Very Large Databases (VLDB). 30.
  6. Kukich, Karen (1992). "Techniques for Automatically Correcting Words in Text" (PDF). ACM Computing Surveys. 24 (4): 377–439. doi:10.1145/146370.146380.
  7. R. Wagner; M. Fischer (1974). "The string-to-string correction problem". J. ACM. 21: 168–178. doi:10.1145/321796.321811.
  8. Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. ISBN 1-849-96720-2.
  9. "Algorithms for approximate string matching" (PDF). Information and Control. 64 (1–3): 100–118. 1985. doi:10.1016/S0019-9958(85)80046-2.
  10. Esko Ukkonen (1985). "Finding approximate patterns in strings". J. Algorithms. 6: 132–137. doi:10.1016/0196-6774(85)90023-9.
  11. 1 2 Bringmann, Karl and Grandoni, Fabrizio and Saha, Barna and Williams, Virginia Vassilevska (2016). Truly Sub-cubic Algorithms for Language Edit Distance and RNA Folding via Fast Bounded-Difference Min-Plus Product. FOCS
  12. Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages". SIAM Journal on Computing. 1 (4): 305–312. doi:10.1137/0201022. ISSN 0097-5397.
  13. Robert A. Wagner. 1974. Order-n correction for regular languages. Commun. ACM 17, 5 (May 1974), 265-268. DOI=http://dx.doi.org/10.1145/360980.360995
  14. Saha, B. (2014-10-01). "The Dyck Language Edit Distance Problem in Near-Linear Time". 2014 IEEE 55th Annual Symposium on Foundations of Computer Science: 611–620. doi:10.1109/FOCS.2014.71.
This article is issued from Wikipedia - version of the 12/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.