Kata: Anagram
Few weeks ago, with the Software Crafters Lyon, we have tackled the Anagram kata.
The idea is rather simple, given a word and a word list, find the pairs of words in the latter which have the same frequencies of letters (number of occurrences character by character).
For example, with the 16k word list, "watermelon" will give: "lemon" + "water", "melon" + "water", "metal" + "owner".
The simplest way to check if two strings are anagrams is to sort them and do an equality check, we will use this approach to normalize words.
The first tactic to solve this kata is to do a cartesian product, listing all the possible pairs (eliminating the duplicates), and only keep the ones which normalized form of concatenated of pair of words is equal to the normal form of the lookup words.
It looks like the following snippet:
anagrams0 ws x = [(w0, w1) | w0 <- ws, w1 <- ws, w0 < w1, xn == normalize (w0 <> w1)]
where
normalize = sort
xn = normalize x
We can also create a benchmark, as follows:
main = do
wordsList <- lines <$> readFile "test/AnagramsWords.txt"
C.defaultMain
[ C.bgroup
"Anagrams"
[ C.bench "Naive" $
C.nf (anagrams0 wordsList) "watermelon"
]
]
Which gives the following results:
benchmarking Anagrams/Naive
time 1.151 s (1.092 s .. 1.184 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.147 s (1.136 s .. 1.156 s)
std dev 11.47 ms (5.565 ms .. 15.80 ms)
variance introduced by outliers: 19% (moderately inflated)
A bit more than one second for a 16k word list, English Dictionary has more than 700k and French Dictionary more than 400k.
We can iterate from here.
The computationally intensive part of the algorithm is going through the list of words over and over, we can try to cache it in a dedicated structure which will pre-compute each pairs, store it in a map, so the actual algorithm will become a simple map lookup.
It looks like the following snippet:
newtype NormalizedWord
= NormalizedWord {unNormalizedWord :: String}
deriving newtype (Eq, Ord, Show, NFData)
NormalizedWord x <> NormalizedWord y = mkNormalizedWord $ x <> y
mkNormalizedWord = NormalizedWord . sort
type AssociatedWords = Map.Map NormalizedWord [(String, String)]
mkAssociatedWords ws =
Map.map sort $
Map.fromListWith (<>) [(mkNormalizedWord (w0 <> w1), [(w0, w1)]) | w0 <- ws, w1 <- ws, w0 < w1]
anagrams1 assocs x = Map.findWithDefault [] (mkNormalizedWord x) assocs
We can improve our benchmark suite, as follows:
C.bgroup
"Cached associations"
[ C.bench "Running" $
let assocs = mkAssociatedWords wordsList
in C.nf (anagrams1 assocs) "watermelon",
C.bench "Building cache" $
C.nf mkAssociatedWords wordsList
]
Which gives the following results:
benchmarking Anagrams/Cached associations/Running
time 1.816 μs (1.800 μs .. 1.832 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.794 μs (1.784 μs .. 1.806 μs)
std dev 39.58 ns (29.77 ns .. 54.09 ns)
variance introduced by outliers: 26% (moderately inflated)
benchmarking Anagrams/Cached associations/Building cache
time 10.61 s (9.494 s .. 11.40 s)
0.999 R² (0.996 R² .. 1.000 R²)
mean 10.51 s (10.29 s .. 10.70 s)
std dev 237.4 ms (97.98 ms .. 297.6 ms)
variance introduced by outliers: 19% (moderately inflated)
2 µs for the "run" is not a big surprise, but 11 seconds is really long, given the previous benchmark, we should run it 12 times to amortize the cost of building the cache.
We need to reduce the cost of cache creation, to do so, we can only store the grouping of single words by normalized form (which is a complexity of O(w), instead of O(w²), w being the number of words).
The next step is to rework the "run" part, we have to somehow generate pairs of words to test, the first thing is to split the input in many sub-words, it can be done with subsequences, which also keep the letters orders, as follows:
subsequences "hat"
-- ["","h","a","ha","t","ht","at","hat"]
Note that we will only keep the words with at least two letters and not the full word, to limit the number of words to check, the complexity is O(2^l), l being the length of the input word, it is a big complexity, but l is small enough.
Then, we should compute the "complement word" which is the list of all letters missing from the input word in the sub-word, which can be done with (\\) (difference on lists).
It looks like the following snippet:
subWords (NormalizedWord ws) =
[ NormalizedWord ws'
| let ol = length ws,
ws' <- subsequences ws,
let nl = length ws',
nl >= 2 && nl < ol
]
complementWord (NormalizedWord fs) (NormalizedWord ss) = NormalizedWord (fs \\ ss)
type IndexedWords = Map.Map NormalizedWord [String]
mkIndexedWords ws =
Map.fromListWith (<>) [(mkNormalizedWord w, [w]) | w <- ws]
anagrams2 ws x =
Set.toList $
Set.fromList $
[ (w0, w1)
| let nx = mkNormalizedWord x,
n0 <- subWords nx,
let n1 = complementWord nx n0,
w0 <- Map.findWithDefault [] n0 ws,
w1 <- Map.findWithDefault [] n1 ws,
w0 < w1
]
We can also create a benchmark, as follows:
C.bgroup
"Cached indexed"
[ C.bench "Running" $
let ws = mkIndexedWords wordsList
in C.nf (anagrams2 ws) "watermelon",
C.bench "Building cache" $
C.nf mkIndexedWords wordsList
]
Which gives the following results:
benchmarking Anagrams/Cached indexed/Running
time 1.138 ms (1.131 ms .. 1.145 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.136 ms (1.132 ms .. 1.143 ms)
std dev 18.83 μs (11.17 μs .. 27.88 μs)
benchmarking Anagrams/Cached indexed/Building cache
time 2.000 ms (1.985 ms .. 2.022 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 2.014 ms (1.999 ms .. 2.035 ms)
std dev 57.28 μs (34.83 μs .. 86.47 μs)
variance introduced by outliers: 15% (moderately inflated)
1 ms to run, 2 ms to build the cache, not bad.
My favorite data-structure is the prefix tree, it is a tree, in which, each node path represents a word.
This data structure, when used properly, mirror the hierarchical structure of a problem, which often lead in huge gains in terms of performances.
It can be defined as follows:
data NormalizedTrieWords = Node
{ values :: [String],
}
deriving stock (Generic)
deriving anyclass (NFData)
mkNormalizedTrieWords ws =
buildNode [((mkNormalizedWord w).unNormalizedWord, w) | w <- ws]
where
buildNode nws =
let leaves :: [String]
(leaves, sub) =
partitionEithers $ (\(s, w) -> maybe (Left w) (\(k, rs) -> Right (k, [(rs, w)])) s) . first uncons <$> nws
in Node leaves $ buildNode <$> Map.fromListWith (<>) sub
The next step, is to rework our algorithm, so for each letter, we have to recursively check each child in the remaining letters.
To do so, we have to accumulate the letters we have "skipped", this is the equivalent of computing sub-word and complement at the was time, which can be done zipping inits and tails as follows:
let ws = "hat"
zip (inits ws) (tails ws)
-- [("","hat"),("h","at"),("ha","t"),("hat","")]
Combining everything gives the following snippet:
anagrams3 ws x = Set.toList $ Set.fromList $ go ws [] (mkNormalizedWord x).unNormalizedWord
where
go ts acc =
\case
[] -> []
ps ->
let complements = trieLookup ws $ reverse acc <> ps
currentResults =
[(w, c) | w <- ts.values, c <- complements, w < c]
in currentResults
<> [ rs
| (ds, (e : es)) <- zip (inits ps) (tails ps),
child <- maybeToList $ Map.lookup e ts.children,
rs <- go child (reverse ds <> acc) es
]
trieLookup ts =
\case
[] -> ts.values
(p : ps) -> maybe [] (\ss -> trieLookup ss ps) (Map.lookup p ts.children)
Note that we are looking up for the complement in the prefix tree, which is a bit less efficient than Map.
Which gives the following benchmark:
benchmarking Anagrams/Cached Trie/Cached Trie
time 583.4 μs (578.4 μs .. 587.7 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 576.1 μs (574.1 μs .. 578.6 μs)
std dev 7.605 μs (5.861 μs .. 9.433 μs)
benchmarking Anagrams/Cached Trie/Building cache
time 13.09 ms (12.69 ms .. 13.88 ms)
0.983 R² (0.953 R² .. 1.000 R²)
mean 12.90 ms (12.74 ms .. 13.48 ms)
std dev 722.5 μs (164.7 μs .. 1.436 ms)
variance introduced by outliers: 25% (moderately inflated)
Run time has been cut in half, but the build time exploded.
Let's use the indexed words for the lookup, as follows:
anagrams4 ws is x = Set.toList $ Set.fromList $ go ws [] (mkNormalizedWord x).unNormalizedWord
where
go ts acc =
\case
[] -> []
ps ->
let complements = Map.findWithDefault [] (NormalizedWord $ reverse acc <> ps) is
currentResults =
[(w, c) | w <- ts.values, c <- complements, w < c]
in currentResults
<> [ rs
| (ds, (e : es)) <- zip (inits ps) (tails ps),
child <- maybeToList $ Map.lookup e ts.children,
rs <- go child (reverse ds <> acc) es
]
Which gives the following benchmark:
benchmarking Anagrams/Cached Trie + indexed
time 538.7 μs (534.4 μs .. 546.8 μs)
0.997 R² (0.992 R² .. 1.000 R²)
mean 543.4 μs (537.4 μs .. 559.6 μs)
std dev 28.08 μs (12.04 μs .. 51.11 μs)
variance introduced by outliers: 45% (moderately inflated)
In summary, we get the following performances:
| Approach | Build | Run only | 1 | 15 | 20 | 100 | 1,000 | 20,000 |
|---|---|---|---|---|---|---|---|---|
| Naive | 0 | 1.151 s | 1.2 s | 17.3 s | 23 s | 115 s | 19 m 11 s | 6 h 24 m |
| Cached associations | 10.61 s | 1.816 µs | 10.61 s | 10.61 s | 10.61 s | 10.61 s | 10.61 s | 10.65 s |
| Cached indexed | 2 ms | 1.138 ms | 3.138 ms | 19.07 ms | 24.76 ms | 115.8 ms | 1.14 s | 22.8 s |
| Cached Trie | 13.09 ms | 583.4 µs | 13.67 ms | 21.84 ms | 24.76 ms | 71.43 ms | 596.49 ms | 11.68 s |
| Cached Tried + indexed | 15.09 ms | 538.7 µs | 15.63 ms | 23.17 ms | 25.86 ms | 68.96 ms | 553.79 ms | 10.79 s |
Picking an implementation will depends on the workload, but also on the maintenance costs.