Kata: Anagram

Few weeks ago, with the Software Crafters Lyon, we have tackled the Anagram kata.

The idea is rather simple, given a word and a word list, find the pairs of words in the latter which have the same frequencies of letters (number of occurrences character by character).

For example, with the 16k word list, "watermelon" will give: "lemon" + "water", "melon" + "water", "metal" + "owner".

The simplest way to check if two strings are anagrams is to sort them and do an equality check, we will use this approach to normalize words.

The first tactic to solve this kata is to do a cartesian product, listing all the possible pairs (eliminating the duplicates), and only keep the ones which normalized form of concatenated of pair of words is equal to the normal form of the lookup words.

It looks like the following snippet:

anagrams0 :: [String] -> String -> [(String, String)]
anagrams0 ws x = [(w0, w1) | w0 <- ws, w1 <- ws, w0 < w1, xn == normalize (w0 <> w1)]
  where
    normalize = sort
    xn = normalize x

We can also create a benchmark, as follows:

main :: IO ()
main = do
  wordsList <- lines <$> readFile "test/AnagramsWords.txt"
  C.defaultMain
    [ C.bgroup
        "Anagrams"
        [ C.bench "Naive" $
            C.nf (anagrams0 wordsList) "watermelon"
        ]
    ]

Which gives the following results:

benchmarking Anagrams/Naive
time                 1.151 s    (1.092 s .. 1.184 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.147 s    (1.136 s .. 1.156 s)
std dev              11.47 ms   (5.565 ms .. 15.80 ms)
variance introduced by outliers: 19% (moderately inflated)

A bit more than one second for a 16k word list, English Dictionary has more than 700k and French Dictionary more than 400k.

We can iterate from here.

The computationally intensive part of the algorithm is going through the list of words over and over, we can try to cache it in a dedicated structure which will pre-compute each pairs, store it in a map, so the actual algorithm will become a simple map lookup.

It looks like the following snippet:

newtype NormalizedWord
  = NormalizedWord {unNormalizedWord :: String}
  deriving newtype (Eq, Ord, Show, NFData)

instance Semigroup NormalizedWord where
  NormalizedWord x <> NormalizedWord y = mkNormalizedWord $ x <> y

mkNormalizedWord :: String -> NormalizedWord
mkNormalizedWord = NormalizedWord . sort

type AssociatedWords = Map.Map NormalizedWord [(String, String)]

mkAssociatedWords :: [String] -> AssociatedWords
mkAssociatedWords ws =
  Map.map sort $
    Map.fromListWith (<>) [(mkNormalizedWord (w0 <> w1), [(w0, w1)]) | w0 <- ws, w1 <- ws, w0 < w1]

anagrams1 :: AssociatedWords -> String -> [(String, String)]
anagrams1 assocs x = Map.findWithDefault [] (mkNormalizedWord x) assocs

We can improve our benchmark suite, as follows:

C.bgroup
  "Cached associations"
  [ C.bench "Running" $
      let assocs = mkAssociatedWords wordsList
       in C.nf (anagrams1 assocs) "watermelon",
    C.bench "Building cache" $
      C.nf mkAssociatedWords wordsList
  ]

Which gives the following results:

benchmarking Anagrams/Cached associations/Running
time                 1.816 μs   (1.800 μs .. 1.832 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.794 μs   (1.784 μs .. 1.806 μs)
std dev              39.58 ns   (29.77 ns .. 54.09 ns)
variance introduced by outliers: 26% (moderately inflated)

benchmarking Anagrams/Cached associations/Building cache
time                 10.61 s    (9.494 s .. 11.40 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 10.51 s    (10.29 s .. 10.70 s)
std dev              237.4 ms   (97.98 ms .. 297.6 ms)
variance introduced by outliers: 19% (moderately inflated)

2 µs for the "run" is not a big surprise, but 11 seconds is really long, given the previous benchmark, we should run it 12 times to amortize the cost of building the cache.

We need to reduce the cost of cache creation, to do so, we can only store the grouping of single words by normalized form (which is a complexity of O(w), instead of O(w²), w being the number of words).

The next step is to rework the "run" part, we have to somehow generate pairs of words to test, the first thing is to split the input in many sub-words, it can be done with subsequences, which also keep the letters orders, as follows:

subsequences "hat"
-- ["","h","a","ha","t","ht","at","hat"]

Note that we will only keep the words with at least two letters and not the full word, to limit the number of words to check, the complexity is O(2^l), l being the length of the input word, it is a big complexity, but l is small enough.

Then, we should compute the "complement word" which is the list of all letters missing from the input word in the sub-word, which can be done with (\\) (difference on lists).

It looks like the following snippet:

subWords :: NormalizedWord -> [NormalizedWord]
subWords (NormalizedWord ws) =
  [ NormalizedWord ws'
    | let ol = length ws,
      ws' <- subsequences ws,
      let nl = length ws',
      nl >= 2 && nl < ol
  ]

complementWord :: NormalizedWord -> NormalizedWord -> NormalizedWord
complementWord (NormalizedWord fs) (NormalizedWord ss) = NormalizedWord (fs \\ ss)

type IndexedWords = Map.Map NormalizedWord [String]

mkIndexedWords :: [String] -> IndexedWords
mkIndexedWords ws =
  Map.fromListWith (<>) [(mkNormalizedWord w, [w]) | w <- ws]

anagrams2 :: IndexedWords -> String -> [(String, String)]
anagrams2 ws x =
  Set.toList $
    Set.fromList $
      [ (w0, w1)
        | let nx = mkNormalizedWord x,
          n0 <- subWords nx,
          let n1 = complementWord nx n0,
          w0 <- Map.findWithDefault [] n0 ws,
          w1 <- Map.findWithDefault [] n1 ws,
          w0 < w1
      ]

We can also create a benchmark, as follows:

C.bgroup
  "Cached indexed"
  [ C.bench "Running" $
      let ws = mkIndexedWords wordsList
       in C.nf (anagrams2 ws) "watermelon",
    C.bench "Building cache" $
      C.nf mkIndexedWords wordsList
  ]

Which gives the following results:

benchmarking Anagrams/Cached indexed/Running
time                 1.138 ms   (1.131 ms .. 1.145 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.136 ms   (1.132 ms .. 1.143 ms)
std dev              18.83 μs   (11.17 μs .. 27.88 μs)

benchmarking Anagrams/Cached indexed/Building cache
time                 2.000 ms   (1.985 ms .. 2.022 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 2.014 ms   (1.999 ms .. 2.035 ms)
std dev              57.28 μs   (34.83 μs .. 86.47 μs)
variance introduced by outliers: 15% (moderately inflated)

1 ms to run, 2 ms to build the cache, not bad.

My favorite data-structure is the prefix tree, it is a tree, in which, each node path represents a word.

This data structure, when used properly, mirror the hierarchical structure of a problem, which often lead in huge gains in terms of performances.

It can be defined as follows:

data NormalizedTrieWords = Node
  { values :: [String],
    children :: Map.Map Char NormalizedTrieWords
  }
  deriving stock (Generic)
  deriving anyclass (NFData)

mkNormalizedTrieWords :: [String] -> NormalizedTrieWords
mkNormalizedTrieWords ws =
  buildNode [((mkNormalizedWord w).unNormalizedWord, w) | w <- ws]
  where
    buildNode nws =
      let leaves :: [String]
          sub :: [(Char, [(String, String)])]
          (leaves, sub) =
            partitionEithers $ (\(s, w) -> maybe (Left w) (\(k, rs) -> Right (k, [(rs, w)])) s) . first uncons <$> nws
       in Node leaves $ buildNode <$> Map.fromListWith (<>) sub

The next step, is to rework our algorithm, so for each letter, we have to recursively check each child in the remaining letters.

To do so, we have to accumulate the letters we have "skipped", this is the equivalent of computing sub-word and complement at the was time, which can be done zipping inits and tails as follows:

let ws = "hat"
zip (inits ws) (tails ws)
-- [("","hat"),("h","at"),("ha","t"),("hat","")]

Combining everything gives the following snippet:

anagrams3 :: NormalizedTrieWords -> String -> [(String, String)]
anagrams3 ws x = Set.toList $ Set.fromList $ go ws [] (mkNormalizedWord x).unNormalizedWord
  where
    go ts acc =
      \case
        [] -> []
        ps ->
          let complements = trieLookup ws $ reverse acc <> ps
              currentResults :: [(String, String)]
              currentResults =
                [(w, c) | w <- ts.values, c <- complements, w < c]
           in currentResults
                <> [ rs
                     | (ds, (e : es)) <- zip (inits ps) (tails ps),
                       child <- maybeToList $ Map.lookup e ts.children,
                       rs <- go child (reverse ds <> acc) es
                   ]
    trieLookup ts =
      \case
        [] -> ts.values
        (p : ps) -> maybe [] (\ss -> trieLookup ss ps) (Map.lookup p ts.children)

Note that we are looking up for the complement in the prefix tree, which is a bit less efficient than Map.

Which gives the following benchmark:

benchmarking Anagrams/Cached Trie/Cached Trie
time                 583.4 μs   (578.4 μs .. 587.7 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 576.1 μs   (574.1 μs .. 578.6 μs)
std dev              7.605 μs   (5.861 μs .. 9.433 μs)

benchmarking Anagrams/Cached Trie/Building cache
time                 13.09 ms   (12.69 ms .. 13.88 ms)
                     0.983 R²   (0.953 R² .. 1.000 R²)
mean                 12.90 ms   (12.74 ms .. 13.48 ms)
std dev              722.5 μs   (164.7 μs .. 1.436 ms)
variance introduced by outliers: 25% (moderately inflated)

Run time has been cut in half, but the build time exploded.

Let's use the indexed words for the lookup, as follows:

anagrams4 :: NormalizedTrieWords -> IndexedWords -> String -> [(String, String)]
anagrams4 ws is x = Set.toList $ Set.fromList $ go ws [] (mkNormalizedWord x).unNormalizedWord
  where
    go ts acc =
      \case
        [] -> []
        ps ->
          let complements = Map.findWithDefault [] (NormalizedWord $ reverse acc <> ps) is
              currentResults :: [(String, String)]
              currentResults =
                [(w, c) | w <- ts.values, c <- complements, w < c]
           in currentResults
                <> [ rs
                     | (ds, (e : es)) <- zip (inits ps) (tails ps),
                       child <- maybeToList $ Map.lookup e ts.children,
                       rs <- go child (reverse ds <> acc) es
                   ]

Which gives the following benchmark:

benchmarking Anagrams/Cached Trie + indexed
time                 538.7 μs   (534.4 μs .. 546.8 μs)
                     0.997 R²   (0.992 R² .. 1.000 R²)
mean                 543.4 μs   (537.4 μs .. 559.6 μs)
std dev              28.08 μs   (12.04 μs .. 51.11 μs)
variance introduced by outliers: 45% (moderately inflated)

In summary, we get the following performances:

ApproachBuildRun only115201001,00020,000
Naive01.151 s1.2 s17.3 s23 s115 s19 m 11 s6 h 24 m
Cached associations10.61 s1.816 µs10.61 s10.61 s10.61 s10.61 s10.61 s10.65 s
Cached indexed2 ms1.138 ms3.138 ms19.07 ms24.76 ms115.8 ms1.14 s22.8 s
Cached Trie13.09 ms583.4 µs13.67 ms21.84 ms24.76 ms71.43 ms596.49 ms11.68 s
Cached Tried + indexed15.09 ms538.7 µs15.63 ms23.17 ms25.86 ms68.96 ms553.79 ms10.79 s

Picking an implementation will depends on the workload, but also on the maintenance costs.