Let’s dip our toes in to computational English by exploring the classic algorithm, Soundex. Soundex was created in 1918 by Robert Russel and Margaret Odel King, but most people know it from work by Donald Knuth.
Soundex indexes words based on their ’sound’. It is one way to checking spelling based on similar sounds, so we should get similar scores for ‘been’ and ‘bean’ and, of course, ‘their’, ‘there’, and ‘they’re’.
Is it useful?
Probably not. There are newer, better phonetic algorithms like Metaphone and modern spell checkers are interested in keyboard distances between words. So it’s just for funsies. And that seems like a good enough reason for me!
Understanding the Algorithm
There is an interesting part of the wikipedia article where the editors re-order the steps to simplify some of the duplication detection. Great! Let’s use THAT version, and not necessarily the National Archives and Records Administration’s General Information Leaflet 55, “Using the Census Soundex”.
- Save the first letter, remove all occurrences of ‘h’ and ‘w’ except the first.
- Replace all consonants (including the first letter) with digits
- bfpv 1
- cgjkqsxz 2
- dt 3
- l 4
- mn 5
- r 6
- Replace all adjacent same digits with one digit
- Remove all vowels (including y), except the first letter
- If the first symbol is a digit, replace with the letter
- Pad with zeros or trim so it only has a letter and three digits
Some Intuition and Test Cases
Grate and great should be the same. Same and sim. In fact most homophones, like homafoen, should be the same.
The wikipedia article gives us some example values.
- Robert ‘R163’
- Rupert ‘R150’
- Ashcroft ‘A261’
- Tymczak ’T522’
- Pfister ‘P236’
We can include a couple of hand picked cases, like ‘the’ and ‘a’ which are all zeros.
- the ’T000’
- a ‘A000’
The code for these examples is clear and readable.
module SoundexTests exposing (..)
import Expect
import Fuzz exposing (int, list, string, tuple)
import Soundex exposing (..)
import String
import Test exposing (..)
aboutSoundex : Test
aboutSoundex =
describe "Soundex.soundex"
[ describe "Pure Examples"
[ test "Robert = R163" <|
\() -> Expect.equal "R163" <| soundex "Robert"
, test "Rupert = R163" <|
\() -> Expect.equal "R163" <| soundex "Rupert"
, test "Rubin = R150" <|
\() -> Expect.equal "R150" <| soundex "Rubin"
, test "Ashcroft = Ashcraft" <|
\() -> Expect.equal (soundex "Ashcroft") (soundex "Ashcraft")
, test "Ashcroft = A261" <|
\() -> Expect.equal "A261" <| soundex "Ashcroft"
, test "Tymczak = T522" <|
\() -> Expect.equal "T522" <| soundex "Tymczak"
, test "Pfister = P236" <|
\() -> Expect.equal "P236" <| soundex "Pfister"
, test "a = A000" <|
\() -> Expect.equal "A000" <| soundex "a"
, test "the = T000" <|
\() -> Expect.equal "T000" <| soundex "the"
]
, describe "Starbucks Names?"
[ test "Brian & brain" <|
\() -> Expect.equal (soundex "Brian") (soundex "brain")
]
]
But the Implementation!?
The first function pretty much describes the algorithm using the steps I listed above. I use types to help guide what comes next and focus on each piece as I can. I made the initial implementations of each step as dumb as possible, mostly passing from one stage to the next. As I filled out each step, I could see what was happening to my examples. I could see the error cases coalesce toward correctness. The output words show missing letters. Then numeric substitutions. And duplicates dropped. Then vowels go missing. Until it finally worked!
module Soundex exposing (Soundex, soundex)
import Regex
type alias Soundex =
String
soundex : String -> Soundex
soundex word =
( String.toUpper <| String.left 1 word, String.toLower word )
|> removeHandW
|> consonantsToDigits
|> removeAdjacentDigits
|> removeVowelsAfterFirst
|> setFirstCharacter
|> paddOrTrimToThreeDigits
removeHandW : ( String, String ) -> ( String, String )
removeHandW =
Tuple.mapSecond (Regex.replace Regex.All (Regex.regex "[hw]") (\_ -> ""))
consonantToDigit : Char -> Char
consonantToDigit a =
if List.member a [ 'b', 'f', 'p', 'v' ] then
'1'
else if List.member a [ 'c', 'g', 'j', 'k', 'q', 's', 'x', 'z' ] then
'2'
else if List.member a [ 'd', 't' ] then
'3'
else if 'l' == a then
'4'
else if List.member a [ 'm', 'n' ] then
'5'
else if 'r' == a then
'6'
else
a
consonantsToDigits : ( String, String ) -> ( String, String )
consonantsToDigits =
Tuple.mapSecond (\body -> body |> String.toList |> List.map consonantToDigit |> String.fromList)
removeAdjacentDigits : ( String, String ) -> ( String, String )
removeAdjacentDigits =
let
rdcr : Char -> ( Char, List Char ) -> ( Char, List Char )
rdcr current ( prev, acc ) =
if prev /= current then
( current, current :: acc )
else
( prev, acc )
in
Tuple.mapSecond
(\body ->
body
|> String.toList
|> List.foldr rdcr ( ' ', [] )
|> Tuple.second
|> String.fromList
)
removeVowelsAfterFirst : ( String, String ) -> ( String, String )
removeVowelsAfterFirst =
Tuple.mapSecond
(Regex.replace Regex.All
(Regex.regex "[aeiouy]")
(\match ->
if match.index == 0 then
match.match
else
""
)
)
setFirstCharacter : ( String, String ) -> String
setFirstCharacter ( lead, a ) =
lead ++ String.dropLeft 1 a
paddOrTrimToThreeDigits : String -> String
paddOrTrimToThreeDigits a =
a
++ "000"
|> String.left 4
Reductio ad absurdum
Soundex is for matching things that ‘sound like’ other things. Like when my five year old mishears a Taylor Swift lyric to hilarious results. Or when the well-intentioned hard worker on the other side of the counter has to listen over the bustle of other customers at busy shop while standing next to HVAC and loud brewing equipment to hear you speak in your best indoor voice and has to make a best guess as to what your name might be.
I’ve written a simple web toy that uses an open-sourced word list to attempt to find an English word that sounds like your name. Just like a barista might mishear or simplify your name.
- Generate a Soundex hash of all the words
- Generate the Soundex of the name
- Look up an array of matched words and pick a random one
- If misses, return the name because you have an unmistakable unique name.
For fun I added a list of US census first names to search for likely misses. The majority of which (6 out of 10) are spelling variants of ‘Yolanda’
Give my Starbucks Name Generator a try, but you need to keep a few things in mind. It takes a while (10 – 20 seconds) to load all the words in to a Soundex lookup before you see anything. I could do this preloading in a web worker, but that would take me longer because the Elm doesn’t yet have a robust backgrounding story and the easy path would be to rewrite the algorithm in JavaScript, and that would defeat the point of using Elm. That said, after the loading, it’s pretty fast! Also, feel free to check out the elm-soundex code on GitHub.
#blog/draft