Browsing the archives for the functional tag.

Soundex

work safe

Let’s dip our toes in to computational English by exploring the classic algorithm, Soundex. Soundex was created in 1918 by Robert Russel and Margaret Odel King, but most people know it from work by Donald Knuth.

Soundex indexes words based on their ’sound’. It is one way to checking spelling based on similar sounds, so we should get similar scores for ‘been’ and ‘bean’ and, of course, ‘their’, ‘there’, and ‘they’re’.

Is it useful?

Probably not. There are newer, better phonetic algorithms like Metaphone and modern spell checkers are interested in keyboard distances between words. So it’s just for funsies. And that seems like a good enough reason for me!

Understanding the Algorithm

There is an interesting part of the wikipedia article where the editors re-order the steps to simplify some of the duplication detection. Great! Let’s use THAT version, and not necessarily the National Archives and Records Administration’s General Information Leaflet 55, “Using the Census Soundex”.

  1. Save the first letter, remove all occurrences of ‘h’ and ‘w’ except the first.
  2. Replace all consonants (including the first letter) with digits
  • bfpv 1
  • cgjkqsxz 2
  • dt 3
  • l 4
  • mn 5
  • r 6
  1. Replace all adjacent same digits with one digit
  2. Remove all vowels (including y), except the first letter
  3. If the first symbol is a digit, replace with the letter
  4. Pad with zeros or trim so it only has a letter and three digits

Some Intuition and Test Cases

Grate and great should be the same. Same and sim. In fact most homophones, like homafoen, should be the same.

The wikipedia article gives us some example values.

  • Robert ‘R163’
  • Rupert ‘R150’
  • Ashcroft ‘A261’
  • Tymczak ’T522’
  • Pfister ‘P236’

We can include a couple of hand picked cases, like ‘the’ and ‘a’ which are all zeros.

  • the ’T000’
  • a ‘A000’

The code for these examples is clear and readable.


module SoundexTests exposing (..)

import Expect
import Fuzz exposing (int, list, string, tuple)
import Soundex exposing (..)
import String
import Test exposing (..)


aboutSoundex : Test
aboutSoundex =
    describe "Soundex.soundex"
        [ describe "Pure Examples"
            [ test "Robert = R163" <|
                \() -> Expect.equal "R163" <| soundex "Robert"
            , test "Rupert = R163" <|
                \() -> Expect.equal "R163" <| soundex "Rupert"
            , test "Rubin = R150" <|
                \() -> Expect.equal "R150" <| soundex "Rubin"
            , test "Ashcroft = Ashcraft" <|
                \() -> Expect.equal (soundex "Ashcroft") (soundex "Ashcraft")
            , test "Ashcroft = A261" <|
                \() -> Expect.equal "A261" <| soundex "Ashcroft"
            , test "Tymczak = T522" <|
                \() -> Expect.equal "T522" <| soundex "Tymczak"
            , test "Pfister = P236" <|
                \() -> Expect.equal "P236" <| soundex "Pfister"
            , test "a = A000" <|
                \() -> Expect.equal "A000" <| soundex "a"
            , test "the = T000" <|
                \() -> Expect.equal "T000" <| soundex "the"
            ]
        , describe "Starbucks Names?"
            [ test "Brian & brain" <|
                \() -> Expect.equal (soundex "Brian") (soundex "brain")
            ]
        ]

But the Implementation!?

The first function pretty much describes the algorithm using the steps I listed above. I use types to help guide what comes next and focus on each piece as I can. I made the initial implementations of each step as dumb as possible, mostly passing from one stage to the next. As I filled out each step, I could see what was happening to my examples. I could see the error cases coalesce toward correctness. The output words show missing letters. Then numeric substitutions. And duplicates dropped. Then vowels go missing. Until it finally worked!

module Soundex exposing (Soundex, soundex)

import Regex


type alias Soundex =
    String


soundex : String -> Soundex
soundex word =
    ( String.toUpper <| String.left 1 word, String.toLower word )
        |> removeHandW
        |> consonantsToDigits
        |> removeAdjacentDigits
        |> removeVowelsAfterFirst
        |> setFirstCharacter
        |> paddOrTrimToThreeDigits


removeHandW : ( String, String ) -> ( String, String )
removeHandW =
    Tuple.mapSecond (Regex.replace Regex.All (Regex.regex "[hw]") (\_ -> ""))


consonantToDigit : Char -> Char
consonantToDigit a =
    if List.member a [ 'b', 'f', 'p', 'v' ] then
        '1'
    else if List.member a [ 'c', 'g', 'j', 'k', 'q', 's', 'x', 'z' ] then
        '2'
    else if List.member a [ 'd', 't' ] then
        '3'
    else if 'l' == a then
        '4'
    else if List.member a [ 'm', 'n' ] then
        '5'
    else if 'r' == a then
        '6'
    else
        a


consonantsToDigits : ( String, String ) -> ( String, String )
consonantsToDigits =
    Tuple.mapSecond (\body -> body |> String.toList |> List.map consonantToDigit |> String.fromList)


removeAdjacentDigits : ( String, String ) -> ( String, String )
removeAdjacentDigits =
    let
        rdcr : Char -> ( Char, List Char ) -> ( Char, List Char )
        rdcr current ( prev, acc ) =
            if prev /= current then
                ( current, current :: acc )
            else
                ( prev, acc )
    in
    Tuple.mapSecond
        (\body ->
            body
                |> String.toList
                |> List.foldr rdcr ( ' ', [] )
                |> Tuple.second
                |> String.fromList
        )


removeVowelsAfterFirst : ( String, String ) -> ( String, String )
removeVowelsAfterFirst =
    Tuple.mapSecond
        (Regex.replace Regex.All
            (Regex.regex "[aeiouy]")
            (\match ->
                if match.index == 0 then
                    match.match
                else
                    ""
            )
        )


setFirstCharacter : ( String, String ) -> String
setFirstCharacter ( lead, a ) =
    lead ++ String.dropLeft 1 a


paddOrTrimToThreeDigits : String -> String
paddOrTrimToThreeDigits a =
    a
        ++ "000"
        |> String.left 4

Reductio ad absurdum

Soundex is for matching things that ‘sound like’ other things. Like when my five year old mishears a Taylor Swift lyric to hilarious results. Or when the well-intentioned hard worker on the other side of the counter has to listen over the bustle of other customers at busy shop while standing next to HVAC and loud brewing equipment to hear you speak in your best indoor voice and has to make a best guess as to what your name might be.

Tobias has a 'Starbucks Name' of Tubes

My “Starbucks Name” Generator in action.

I’ve written a simple web toy that uses an open-sourced word list to attempt to find an English word that sounds like your name. Just like a barista might mishear or simplify your name.

  1. Generate a Soundex hash of all the words
  2. Generate the Soundex of the name
  3. Look up an array of matched words and pick a random one
  4. If misses, return the name because you have an unmistakable unique name.

For fun I added a list of US census first names to search for likely misses. The majority of which (6 out of 10) are spelling variants of ‘Yolanda

Give my Starbucks Name Generator a try, but you need to keep a few things in mind. It takes a while (10 – 20 seconds) to load all the words in to a Soundex lookup before you see anything. I could do this preloading in a web worker, but that would take me longer because the Elm doesn’t yet have a robust backgrounding story and the easy path would be to rewrite the algorithm in JavaScript, and that would defeat the point of using Elm. That said, after the loading, it’s pretty fast! Also, feel free to check out the elm-soundex code on GitHub.

#blog/draft

No Comments

Unsoliceted Code Review – Haskell Chat Server Edition

work safe

Our most recent Sep book club has been The Passionate Programmer. One of the things mentioned was taking some third party code and reviewing it.

This is not a new idea. Many people has said the best way to understand the code you write is to understand the good and bad of code written by other people. In addition, reading code in a language you are learning exposes you to functions you might not understand as well as idioms you don’t understand. We had an idea of meeting occasionally over lunch and tackling some code. Here’s kinda what I’m thinking that would look like. Please comment and let me know if there’s something else I’m missing.

With that in mind, I found an article on the Haskell Wiki where they implement a chat server in Haskell.

I’m only posting the code review for the final version. Technically, the final version and my changes to make it compile. There are more about that in the comments. I added my initials, BJB, to the comments I made and if I had to look up an API, I provided a link to where I learned about it.

This chat server has a couple of interesting features. It uses lightweight threading to handle multiple incoming and outgoing IO. It uses channels to communicate across these threads. The forking calls are also used to create separate “loops of control” for each IO channel being handled.

Below are the comments I’ve added to the program.

Continue Reading »

No Comments

Just passing through

work safe

Continuation Passing Style is a functional pattern where all work that can’t be passed into a tail recursive function call is passed in as an additional argument via a closure or anonymous function, delaying execution.

What got me confused for years on this was the explanation, almost like a mantra, “A continuation is like a callback.” I don’t use callbacks much and I believe that phrase is actually expressing a different pattern, simple first-class functions. So what do I think a continuation is? It is a promise to get to it later, after a tail call. I do agree with the literature that it turns the function inside out. And Don Syme’s reasoning that you’re trading heap to avoid spending stack.

I know you hate fibonacci, but it’s simple to explain. Here it is in all it’s glory.

You note, however, that since the results of two recursive calls need to be stored, we can’t dump the stackframe and tail optimize the call, right? We could pass around two accumulators, essentially mimicking the imperative for-loop solution to the problem

This works because the Fibonacci sequence is just that, a sequence. It can also be represented by a tree, and that’s what’s causing it to be a problem for recursion in the first place. We are still left with, “How do I process a tree without risking a blown stack?” That’s what Continuation Passing hopes to solve by using a continuation (read: anonymous function) that wraps our core functionality in a heap object so we don’t call down one leve. I’m not making this clear, so I’ll show this in pictures.

Continutation Passing Style, step one.

Step One: We're abstracting the problem to it's simplest form

Step Two: We're making a promise to do something later

Step Two: We're making a promise to do something later

Step Three: We're expanding the first abstraction

Step Three: We're expanding the first abstraction

Step Four: We're expanding the second abstraction

Step Four: We're expanding the second abstraction

Step Five: Fill in the base-cases for both our promise, and the function.

And now for the punchline. Why do I care? This goes back to the spreadsheet challenge I talked about earlier. I was evaluating my abstract tree and noticed a lot of code that looked like: | Sum(a,b) -> eval a + eval b and I knew it wasn’t going to scale past my toy problem. I had a few tests, so I branched it with Git and tried replacing it with a tail recursive version. I know a spreadsheet doesn’t need that, but it really didn’t add any complexity once I figured out what it was actually doing. Just thought I’d share.

3 Comments