A software engineer website

Types for building and types for running

Gautier DI FOLCO March 06, 2024 [Software engineering] #haskell #design #engineering #type-driven design

Since I have introduced Type-Driven Development, I have applied it to business logic, to tests, to interfacing, now I want to distinguish types by usage stages.

A long time ago, I worked on a linter which was working on XML documents.

An XML document is a tree of named nodes:

newtype Tree = TreeNode [(String, Tree)]
  deriving stock (Eq, Show)

I had a Java program with were defining roughly 100 rules associated to node paths:

data LintSpec = LintSpec
  { path :: String,
    lintSpec :: NodeLint
  }
  deriving stock (Eq, Show)

runLint :: NodeLint -> Tree -> [LintResult]

The ruleset was collected in the codebase and applied one by one to hundreds of files:

lintNaive :: [LintSpec] -> Tree -> [LintResult]
lintNaive specs tree =
  [ result
    | spec <- specs,
      targetNode <- findNodes (splitPath spec.path) tree,
      result <- runLint spec.lintSpec targetNode
  ]
  where
    findNodes :: [String] -> Tree -> [Tree]
    findNodes ps' tree' =
      case (ps', tree') of
        ([], _) -> [tree']
        (p : ps, TreeNode ts) -> concatMap (findNodes ps . snd) $ filter ((== p) . fst) ts
    splitPath =
      unfoldr $
        \remaining ->
          case break (== '/') remaining of
            ("", _) -> Nothing
            (p, ps) -> Just (p, ps)

The result was really, really bad, it was taking 45 minutes.

The issue was caused by a lack of awareness of the search space, so, each lint path was taken independently, and the tree was visited entirely each time which gives a time complexity of O(number-of-rules * number-of-matching-node).

My hint was that most of the rules where operating on the same nodes, so we could traverse nodes only once.

I have picker prefix tree since we can view top-down exploration as a tree.

The idea was to, first define the built data types:

data LintRule = LintRule {subRules :: LintRules, lints :: [NodeLint]}
  deriving stock (Eq, Show)

newtype LintRules = LR {lintRules :: Map.Map String LintRule}
  deriving newtype (Eq, Show, Semigroup, Monoid)

Then to have a trivial way to build them from a list of lints:

buildRules :: [LintSpec] -> LintRules
buildRules = foldl' addNode mempty . map splitByPath
  where
    splitByPath spec = (splitPath spec.path, spec.lintSpec)
    splitPath =
      unfoldr $
        \remaining ->
          case break (== '/') remaining of
            ("", _) -> Nothing
            (p, ps) -> Just (p, ps)
    addNode :: LintRules -> ([String], NodeLint) -> LintRules
    addNode (LR rules) (path, lint) =
      LR $
        case path of
          [p] ->
            Map.alter
              (Just . maybe (LintRule mempty [lint]) (\rule -> rule {lints = lint : rule.lints}))
              p
              rules
          (p : ps) ->
            let addOn sn = addNode sn (ps, lint)
             in Map.alter
                  (Just . maybe (LintRule (addOn (LR mempty)) mempty) (\rule -> rule {subRules = addOn rule.subRules}))
                  p
                  rules

And finally to run it tree by tree:

lintFromRules :: LintRules -> Tree -> [LintResult]
lintFromRules rules tree@(TreeNode nodes) = concatMap runSubRules nodes
  where
    runSubRules :: (String, Tree) -> [LintResult]
    runSubRules (nodeName, node) =
      let subRule = Map.lookup nodeName rules.lintRules
       in concatMap (flip runLint tree) (maybe mempty (.lints) subRule)
            ++ lintFromRules (maybe mempty (.subRules) subRule) node

This code is interesting as it mostly consist of zipping the XML Tree and the prefix tree, ensuring only an only one traversal.

Moreover, building LintRules is only done once, which reduce the work to the minimum.

Which lead to a running time of 30 seconds, on one thread.

Sometime decoupling is just having two functions, some other time we need two types, task oriented.

Back to top