Phonetic algorithm
Appearance
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages.[1] Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.[2]
Algorithms
[edit]Among the best-known phonetic algorithms are:
- Soundex, which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.
- Daitch–Mokotoff Soundex, which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.
- Cologne phonetics: This is similar to Soundex, but more suitable for German words.
- Metaphone and Double Metaphone which are suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers.
- New York State Identification and Intelligence System (NYSIIS), which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding.
- Match Rating Approach developed by Western Airlines in 1977 - this algorithm has an encoding and range comparison technique.
- Caverphone, created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand.
Common uses
[edit]- Spell checkers can often contain phonetic algorithms. The Metaphone algorithm, for example, can take an incorrectly spelled word and create a code. The code is then looked up in directory for words with the same or similar Metaphone. Words that have the same or similar Metaphone become possible alternative spellings.
- Search functionality will often use phonetic algorithms to find results that don't match exactly the term(s) used in the search. Searching for names can be difficult as there are often multiple alternative spellings for names. An example is the name Claire. It has two alternatives, Clare/Clair, which are both pronounced the same. Searching for one spelling wouldn't show results for the two others. Using Soundex all three variations produce the same Soundex code, C460. By searching names based on the Soundex code all three variations will be returned.
- Data deduplication efforts use phonetic algorithms to easily bucket records into groups of similar sounding names for further evaluation.
- Speech to text modules use phonetic encoding to find the set of dictionary words that are pronounced similarly to the phonemes output by the processed audio signal.
See also
[edit]References
[edit]- ^ Li, Nan; Hitchcock, Peter; Blustein, James; Bliemel, Michael (2011). H. Raghav Rao; Raj Sharman; T. S. Raghu (eds.). Exploring the grand challenges for next generation E-Business : 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised selected papers. Berlin: Springer. p. 232. ISBN 9783642174483. Retrieved 31 December 2020.
- ^ Cohen, Eli B. (2009). Growing Information: Part 2. Santa Rosa, Calif.: Informing Science. p. 498. ISBN 978-1-932886-17-7.
- This article incorporates public domain material from Paul E. Black. "phonetic coding". Dictionary of Algorithms and Data Structures. NIST.
External links
[edit]- Algorithm for converting words to phonemes and back.
- StringMetric project a Scala library of phonetic algorithms.
- clj-fuzzy project a Clojure library of phonetic algorithms.
- SoundexBR library of phonetic algorithm implemented in R.
- Talisman a JavaScript library collecting various phonetic algorithms that one can try online.