In this post we’ll explore how to do this with help of Docker and Kubernetes. After the code has been changed and tested locally, the first thing to do is to create a Docker image and push it; once that image can be pulled by the production Kubernetes cluster, we’ll typically need to swap the Docker image of one of the pods in the service we need to hijack and hit a particular endpoint that will run the code.

Using `kubectl`

we can isolate a `Pod`

by updating its label

```
# Add a label to make it obvious that this pod is hijacked and remove the ones
# that match the selector in the corresponding replica set / deployment
kubectl label pods <pod-name> hijacked=true <label-to-rm>-
```

and update its Docker image

`kubectl set image pod/<pod-name> <container-name>=<docker-image> --record`

and wait for Kubernetes to update the `Pod`

, which usually happens in a matter of seconds.

One short digression: it’s not uncommon to have multiple copies of the same code running at the same time, something Kubernetes handles with `ReplicaSet`

s. Their job is to try their best to make sure that the specified number of `Pod`

s are running at any given time; how do `ReplicaSet`

s identify which `Pod`

s they need to manage? They use labels and this is how we picked the label that we needed to remove. We’re essentially telling to the `ReplicaSet`

to forget about that particular `Pod`

. Does it mean that now we might have caused a degradation of the perfomance in production? The answer is no because `ReplicaSet`

will notice that a `Pod`

suddently “disappeared” and it will create a new `Pod`

.

Consumers in the outside world are agnostic to the fact that there are multiple replicas running at the same time and they send requests to a single endpoint, a load balancer will then pick which replica should handle a request. Kubernetes uses the `Service`

abstraction to model components like load balancers, and how does a `Service`

identify the replicas it needs to send requests to? The answer is once again labels - typically the same labels used by the `ReplicaSet`

. That means that by removing that label, we effectively prevented the `Pod`

from handling requests coming from outside world: this is good because we don’t necessarely want it to handle requests (remember, it’s running an ad-hoc modification of the backend code) but wow you might ask: how can we send a request to the hijacked `Pod`

now if it cannot be reached?

Kubernetes to the rescue! Using port forwarding we can forward a port in our `localhost`

to a port that the hijacked `Pod`

exposes

`kubectl port-forward pod/<pod-name> <local-port>:<remote-port>`

and than hit that endpoint in our local machine to send a request to the `Pod`

. For example we can execute `curl http://localhost:<local-port>/<fix-db>`

to hit an endpoint that will run the code that will fix the problem.

Which remote port should you pick? The easiest way is to look at how the `Pod`

is configured and find which port(s) it exposes

```
# Look for Containers.<container-name>.Ports
kubectl describe pod/<pod-name>
```

Once we checked that the fix has been successfully applied we won’t need the hijacked `Pod`

anymore and we can simply remove it

`kubectl delete pod -l hijacked=true`

Hijacking a production environment for good was first suggested to me by my former colleague Daniel Gorin and it turned out to be extremely handy in fixing one-off bugs or reconciling state in the database. That said, it should be used parsimoniously as it is tempting to use it to bypass the regular deployment process, which is potentially a sign that the deployment process is too slow or unreliable.

]]>`AND`

which is the `(&&)`

function in Haskell. In an effectful context, we can lift `(&&)`

and use it without having to rewrite it using combinators like `liftA2`

or `liftM2`

. Let’s take some simple example usages:
```
import Data.Maybe (fromMaybe)
ƛ: liftA2 (&&) (pure False) (const True <$> fromMaybe undefined Nothing)
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
undefined, called at <interactive>:10:53 in interactive:Ghci3
```

What’s going on? Let’s apply the reduction rules to understand why the exception was thrown

```
-- definition of liftA2
liftA2 :: Applicative f => (a -> b -> c) -> f a -> f b -> f c
liftA2 f x y = fmap f x <*> y
-- evaluation
1. liftA2 (&&) (pure False) (const True <$> fromMaybe undefined Nothing)
2. fmap (&&) (pure False) <*> (const True <$> fromMaybe undefined Nothing)
3. pure (&& False) <*> (const True <$> fromMaybe undefined Nothing)
4. pure (&& False) <*> (const True <$> fromMaybe undefined Nothing)
5. liftA2 id (pure (&& False)) (const True <$> fromMaybe undefined Nothing)
6. fmap id (pure (&& False)) (const True <$> fromMaybe undefined Nothing)
7. pure (&& False) (const True <$> undefined)
*** Exception: Prelude.undefined
```

The catch here is that effects are always evaluated. The lifted `(&&)`

will be lazy then in the evaluation of the **values** but not in the evaluation of the **effects**. This might be counterintuitive or surprising at first glance: `(&&)`

appears to “lose some of its laziness” when it’s lifted. If effects need to be lazily evaluated, we need to write our own “lazy lifted” version of `(&&)`

:

```
(.&&) f g = do
p <- f
if p
then do { p' <- g ; pure (p && p') }
else pure p
ƛ: (.&&) (pure False) (const True <$> fromMaybe undefined Nothing)
False
```

Lifting a pure function adds an “effectful dimension” to it: when it comes to how the lifted function evaluates *values* think about the semantics of the pure function, when it comes to how the lifted function evaluates *effects* think about the semantics of the underlying type class.

`CHECK`

constraints are a powerful way to control data stored in a SQL database. Combined with choosing the appropriate type for we can enforce that the data written adheres to our business rules.
Taking PostgreSQL as our SQL database of choice, the docs show some basic examples, i.e.

```
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0),
discounted_price numeric CHECK (discounted_price > 0),
CHECK (price > discounted_price)
);
```

Here the schema imposes three different but related constraints, namely that the price and discounted price of a product must not be negative amounts if they exist and that the discounted price must be less than the normal price. The docs stop here but recently I found myself wanting to express something like: if the price is *less than an amount n*, there cannot be a discount price. Essentially I wanted to express a *conditional* constraint on a column that depends on the value of another one.

There’s no examples of `CHECK`

constraints using conditional expressions and I wasn’t totally sure if the resulting expression would actually be a valid SQL expression; it turns out it does and with hindsight it makes total sense. Knowing that, we can express that business rule in a straightforward way:

```
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0),
discounted_price numeric CHECK (discounted_price > 0),
CHECK (price > discounted_price),
CONSTRAINT products_bargain_no_discount
CHECK (CASE WHEN price < 1 THEN discounted_price IS NULL END)
);
```

Now if we try to insert a product that with a bargain and a discounted price, the `INSERT`

statement will be rejected:

```
postgres=# INSERT INTO products VALUES (1,'test',0.99,0.55);
ERROR: new row for relation "products" violates check constraint "products_bargain_no_discount"
DETAIL: Failing row contains (1, test, 0.99, 0.55).
```

]]>The following table gives a pretty good idea of how pervasive graphs are and why anyone should care to answer the question in the first place.

Problems involving graphs are also not unusual during job interviews and this is actually where my curiosity about functional graph algorithms really took off: I was eager to learn how to approach those kind of problems functionally. When I started searching I honestly didn’t expect to have such a hard time finding material, and I do not even mean *good* material but any material at all! Maybe I didn’t look for it hard enough - if that’s the case please let me know! - but basically the only book on the subject of functional data structures out there is Purely Functional Data Structures by Chris Okasaki, released in 2008 (and it’s pretty advanced material) and the only book I am aware of that focused on functional algorithms is Pearls of Functional Algorithm Design by Richard Bird. Graphs and graph algorithms are no exception: there is a massive amount of literature available for imperative languages but it takes some DuckDuckGo-fu to find literature on the topic for purely functional languages, and more often than not that literature comes in the form of academic papers. After a decent amount of digging my understanding is that lots of purely functional algorithms do exist but they are not as efficient as the imperative counterparts; this might be one of the reasons why they are basically shovelled under the carpet and not used in practice. So let’s try to answer a slightly different question first: How *can* I implement a graph algorithm in a functional programming language?

One option could be “translating” graph algorithms from the imperative world to the functional world but that turns out to be unsurprisingly unpleasant: one of the main reasons is that imperative graph algorithms rely heavily on state and side effects (sometimes for efficiency reasons). Let’s take Haskell as our functional programming language of choice, and try to translate the depth-first search (DFS) algorithm as in The Algorithm Design Manual by Steven S. Skiena:

```
{-# LANGUAGE KindSignatures #-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ScopedTypeVariables #-}
import Data.Foldable (foldlM)
import qualified Data.Vector.Mutable as MV
import qualified Data.Sequence as Seq
import Control.Monad.State.Strict (StateT, evalStateT, gets, modify', get, lift)
import Control.Monad.ST (ST, runST)
import Control.Applicative (liftA2)
import Control.Monad.Primitive (PrimMonad)
data Graph weight label = Graph [(label, [EdgeNode weight label])] Directed Int deriving Eq
type EdgeNode weight label = (label, weight)
type VertexState s = MV.MVector s VState
data DFSState s label weight =
DFSState { dfsVertex :: label
, dfsConnectedComponent :: ConnectedComponent a weight
, dfsVertexState :: VertexState s
}
-- undiscovered, discovered or processed
data VState = U | D | P deriving (Show, Eq, Ord)
type ConnectedComponent weight label = Tree weight label
data Tree weight label = Nil | Node !label [(weight, Tree weight a)] deriving (Show, Eq)
-- Let's assume for simplicity that vertices and weights are integers
dfs :: Graph Int Int -> [ConnectedComponent Int Int]
dfs g =
runST $ do
vstates <- MV.replicate (verticesCount g) U
loop (vertices g) vstates
where
loop :: forall s. [Int]
-> MV.MVector s VState
-> ST s [ConnectedComponent Int Int]
loop vs vstates = do
mv <- findNextUndiscoveredVertex vstates
maybe (return []) processVertex mv
where
processVertex v =
liftA2 (:) (evalStateT dfs' (DFSState v (Node v []) vstates))
(loop vs vstates)
dfs' :: StateT (DFSState s Int Int) (ST s) (ConnectedComponent Int Int)
dfs' = do
DFSState v tree vstates' <- get
MV.write vstates' v D
tree' <- foldlM (\tree' edge@(v', _) -> do
vstate <- MV.read vstates' v'
lift $ processEdgeNode v tree' vstate edge)
tree
(adjacent v g)
MV.write vstates v P
modify' (\s -> s{ dfsConnectedComponent = tree' })
gets dfsConnectedComponent
processEdgeNode :: Int -> Tree Int Int -> VState -> EdgeNode Int Int -> ST s (Tree Int Int)
processEdgeNode v tree Undiscovered edgeNode@(v', _) =
evalStateT dfs' (DFSState v' (buildTree v edgeNode tree) vstates)
processEdgeNode _ tree _ _ = return tree
findNextUndiscoveredVertex :: forall (m :: * -> *). PrimMonad m
=> MV.MVector (PrimState m) VState
-> m (Maybe Int)
findNextUndiscoveredVertex vstates =
go 0 (MV.length vstates)
where
go idx size
| idx == size = return Nothing
| otherwise = do
vstate <- MV.read vstates idx
case vstate of
U -> return (Just idx)
_ -> go (idx + 1) size
```

This code is possibly better than an imperative-style implementation in some aspects - for example state and side effects are now explicit and pattern matching makes the code a bit clearer in some places - but one might argue that monadic code makes the algorithm even harder to follow.

There **must** be a better way of doing this! Some online research on the subject led me to this page in the Haskell wiki that has a few links to research papers that tackle graphs and graph algorithms using a functional programming language. Two of them caught my attention and I’d like to illustrate the solutions proposed in those papers.

The first paper is titled “Structuring Depth First Search Algorithms in Haskell” , written by David King’s and John Launchbury’s in 1995. The main goals of the paper are:

- implementing depth-first search and related algorithms using a functional style without any performance penalty - this means traversing the graph in linear time
- achieving greater code modularity
- being able to formally prove the critical properties of the considered algorithms

I would like to highlight this last aspect: it’s probably the first time I read material on graph algorithms that takes it into consideration and it can be really useful, for example in property testing. The paper approaches graph traversal as a combinatorial problem and employs a common technique in that kind of problems: generate and prune. Before illustrating the gist of that technique, let’s define some types and auxiliary functions:

```
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE BangPatterns #-}
import Data.Array (accumArray, bounds, indices)
import qualified Data.Array.ST as MA
import Control.Monad.ST
type Table = Array Vertex
type Graph = Table [EdgeNode]
-- Let's assume for simplicity that vertices and weights are integers
type Vertex = Int
type Weight = Int
type Bounds = (Vertex, Vertex)
buildG :: Bounds -> [(Vertex, EdgeNode)] -> Graph
buildG bounds = accumArray (flip (:)) [] bounds
mkEmpty :: (Ix i, MA.MArray (MA.STUArray s) Bool m)
=> (i, i) -- min & max bound
-> m (MA.STUArray s i Bool)
mkEmpty bnds = MA.newArray bnds False
contains :: (Ix i, MA.MArray (MA.STUArray s) Bool m)
=> MA.STUArray s i Bool -> i -> m Bool
contains = MA.readArray
include :: (Ix i, MA.MArray a Bool m) => a i Bool -> i -> m ()
include arr v = MA.writeArray arr v True
```

Now let’s consider this very simple graph:

and let’s look at the generate and prune technique at a high level: the “generate” step describes how to create all possible trees from a given vertex. The following picture illustrates it for the sample graph above, notice that the generated tree is infinite. The nodes that are greyed-out are not yet generated and will be only if it’s necessary:

The “prune” step discards the sub-trees that violate to the invariants of DFS, namely those that have already been discovered. Back to our example, when the algorithm reaches `b`

, it will discard the tree with root `a`

because it has been already been discovered and traverses the tree whose root is labelled `c`

instead:

The same thing happens after `c`

is traversed leaving the final DFS spanning tree:

The approach guarantees the efficiency of the algorithm because the evaluation strategy of languages with non-strict semantics (call-by-need or lazy evaluation) assures that an expression is evaluated only once and on-demand; also, the discarded trees will never be used - that is traversed - so they will never be created in the first place. Let’s have a look now at the implementation:

```
dfs :: Graph -> [Vertex] -> Forest Vertex
dfs g = prune (bounds g) . map (generate g)
where
-- create all possible trees for each vertex...
generate :: Graph -> Vertex -> Tree Vertex
generate g v = Node v (map (generate g . fst) (g ! v))
-- ...and discard the ones that are unused
prune :: Bounds -> Forest Vertex -> Forest Vertex
prune bnds ts = runST $ do
s <- mkEmpty bnds :: forall s. ST s (MA.STUArray s Vertex Bool)
chop ts s
```

Notice that the type signature for the `mkEmpty bnds`

is mandatory, more info can be found here. The `chop`

function discards the trees that have already been discovered:

```
chop :: (MA.MArray (MA.STUArray s) Bool m)
=> Forest Vertex -> MA.STUArray s Vertex Bool -> m (Forest Vertex)
chop [] _arr = return []
chop (Node v ts:ns) arr = do
visited <- contains arr v
if visited
-- prune ts
then chop ns arr
else do
-- label vertex
include arr v
-- traverse left-to-right
ts' <- chop ts arr
-- traverse top-to-bottom
ns' <- chop ns arr
return $ Node v ts' : ns'
```

Two qualities of this solution that can be highlighted are:

- for performance reasons it uses a mutable array to keep track of the state of each vertex. The paper points out that this is not strictly necessary and if a logarithmic increase in the time complexity of the algorithm is acceptable, a
`Set`

data structure can be used to avoid the need for monadic code. - the algorithm does use a functional style but the data structure chosen to represent a graph is an adjacency list, which is usually the preferred way of representing graphs in the imperative programming languages. Why this is important will become apparent in the next paragraph.

Section “5. Implementing depth-first search” states that

The choice of pruning patterns determines whether the forest ends up being depth-first (traverse in a left-most, top-most fashion) or breadth-first ( top-most, left-most)

but without providing any code for it and I honestly could not wrap my head around on how to write a breadth-first traversal with the algorithm proposed in the paper. If anybody has some pointers again please let me know!

The second paper is Martin Erwig’s “Inductive Graphs and Functional Graph Algorithms” and it was published in 2001. The main goals of the paper are:

- describing an inductive definition of graphs and graph algorithms as recursive functions
- providing efficient implementations of graph algorithms that can be used in real-world scenarios
- providing clear algorithms that can be used to teach graph algorithms

At the very beginning of the paper Martin Erwig asks the following question:

How should I implement a graph algorithm in a functional programming language?

which was exactly the one that started my exploration of the topic. The paper acknowledges lots of the functional graph algorithms already developed but also considers them all not completely satisfactory either because they introduce constructs that are not currently available in today’s programming languages or because they entail some imperative-style strategy - i.e. keeping track of visited nodes by labelling them - that contaminates the clarity of the algorithm, makes it harder to reason about it and to prove its correctness. The solution the paper proposes is to think about graphs in a new way.

An observation in the paper particularly caught my attention: lists and trees algorithms are much simpler and more modular than graph algorithms and do not require additional bookkeeping: why is that? The answer is two-fold: their definition and the definitions of functions on them are *inductive* and besides that *pattern matching* helps a great deal when it comes to clarity and succinctness. Now let’s have a look at the definition of graphs: they are usually defined as a pair `G = (V, E)`

where `V`

is the set of vertices and `E`

the set of edges, where edge is defined as a pair of vertices in `V`

. Imperative algorithms on graphs discover edges and vertices incrementally and usually need to keep track of the visited vertices either using a separate data structure or by storing more data in the graph itself. In this sense the usual definition of graphs is monolithic and this is the reasons why algorithms that use this API are doomed if what they strive for is clarity and modularity. Would it be possible to define graphs inductively? If so how? A valid definition for a graph data structure defined inductively might look like the following:

```
infixr 5 :&:
data Graph weight label
= Empty
| (Context weight label) :&: (Graph weight label)
deriving Show
type Context weight label =
( Adj weight -- inbound edges
, Vertex
, label
, Adj weight -- outbound edges
)
-- adjacent weighted edges
type Adj weight = [(weight, Vertex)]
type Vertex = Int
```

The definition should look familiar if you’ve already seen one for trees or lists: a graph is either empty or it has a context and another graph. A `Context`

contains information about a given vertex, namely its value, label (if any) and its adjacent edges classified as inbound or outbound. So far so good: now taking the following graph as an example:

how can we build an inductive graph from a list of vertices and edges? One possible way of building the inductive graph would be the following:

```
ƛ: read "mkG [('a', 1), ('b', 2), ('c', 3)] [(1, 2, 5), (2, 1, 3), (2, 3, 1), (3, 1, 4)]"
([(4,3),(3,2)],1,'a',[(5,2)]) :&: (([],2,'b',[(1,3)]) :&: (([],3,'c',[]) :&: Empty))
```

But that’s not the only valid representation of an inductive graph, another valid inductive graph is the following:

```
ƛ: read "mkG [('c', 3), ('b', 2), ('a', 1)] [(1, 2, 5), (2, 1, 3), (2, 3, 1), (3, 1, 4)]"
([(1,2)],3,'c',[(4,1)]) :&: (([(5,1)],2,'b',[(3,1)]) :&: (([],1,'a',[]) :&: Empty))
```

At this point we can start defining some of the properties of inductive graphs:

- given a list of vertices and a list of edges, multiple inductive graphs can be built depending on the order of insertion of its vertices
- equality is not defined by their “shapes” but rather by the set of vertices and edges they represent
- the adjacent inbound and outbound edges in a
`Context`

are lists of vertices that have*already been discovered* - inductive graphs are fully persistent data structures

Pattern matching was identified as one of the ingredients that made lists and trees algorithms clean and succinct, the paper refers to an extension of pattern matching for graphs named *“active graph pattern”* whose main goal is as far as I understood to make the notation more compact, augmenting the classic pattern matching by allowing a function to be called before the matching is applied. It is very similar to view patterns but it is not currently available in Haskell as far as I know; the following code is made up and **will not** type-check but hopefully will provide a good intuition:

```
deg :: Vertex -> Graph weight label -> Int
deg v ((ins, _, _, outs) (:&: <!> v) g) = length ins + length out
```

The expression `(:&: <!> v)`

can be interpreted as: *“find the Context for the vertex v in the graph g if it exists and try to match the given pattern”*. Active graph patterns are not essential when implementing inductive graphs and it is possible do pattern matching without them, all that is needed is a function

`match`

. An extremely naive implementation might look like:```
match :: Vertex -> Graph weight label -> Maybe (Context weight label, Graph weight label)
match qv = matchHelp ([], [])
where
matchHelp _ Empty = Nothing
matchHelp (lvs, wes) ((ins, v, l, outs) :&: g)
| qv == v =
-- rebuild the graph inserting `v` last
let (:&:) !ctx !g' = mkG g ((l, v):lvs) es'
-- return `v`'s context and the new inductive graph
in Just (ctx, g')
| otherwise = matchHelp ((l, v):lvs, es') g
where
-- build a list of edges to rebuild the graph
es' =
map (\(w, fromv) -> (fromv, v, w)) ins
++ map (\(w, tov) -> (v, tov, w)) outs
++ wes
```

Now that we defined graphs inductively, it’s time to show how that can be leveraged to write clear, recursive graph algorithms. Let’s have a look at some fundamental graph algorithms: depth-first search (DFS), breadth-first search (BFS), Dijkstra’s shortest path and Prims’ algorithm to find the minimum spanning tree (MST).

Using a depth-first search strategy to visit a graph essentially means: traverse each vertex **once** and visit **successors before siblings**. Here’s what the algorithm looks like:

```
dfs :: [Vertex] -> Graph weight label -> [Vertex]
dfs _ Empty = []
dfs [] _ = []
dfs (v:vs) g = case v `match` g of
Nothing -> dfs vs g
Just ((_,vtx,_,outs), g') -> vtx : dfs (destvs outs ++ vs) g'
-- extracts destination vertices from the outbound edges of a context
destvs :: Context label weight -> [Vertex]
```

`dfs`

is a recursive function that takes a list of input vertices and a graph and returns a list of vertices sorted by traversing the graph in DFS-style. If the graph or their input vertices are empty it returns the empty list, otherwise it `match`

es the current vertex `v`

against the graph. If `v`

is a vertex in the graph, `match`

will first return its context and a new graph without it, append `v`

to the results list and finally the recursion will happen using as input the list of destination vertices for all outbound edges of `v`

appended to the remaining input vertices and the new graph returned by the `match`

function; if `v`

is not a vertex in the graph then it is simply ignored. There key observations about the algorithm are:

- destination vertices are appended
*to the left end*of the input vertices: this is what makes the algorithm traversing the input graph depth-first. This is exactly what the second invariant of DFS dictates: visit successors before siblings. - the
`match`

function returns a new graph*without*the query vertex: this is what the first invariant of DFS dictates: visit each vertex exactly once. Since the new graph doesn’t contain the query vertex there is no need for keeping track of the visited vertices therefore no bookkeeping is necessary.

Let’s have a look at a very simple example using one of the sample graphs above:

```
ƛ: let g = read "mkG [('c', 3), ('b', 2), ('a', 1)] [(1, 2, 5), (2, 1, 3), (2, 3, 1), (3, 1, 4)]"
([(1,2)],3,'c',[(4,1)]) :&: (([(5,1)],2,'b',[(3,1)]) :&: (([],1,'a',[]) :&: Empty))
ƛ: dfs (vertices g) g
[1, 2, 3]
```

One of the applications of DFS is finding the spanning forest (set of trees) of a graph. The algorithm needs to build the spanning forest by traversing the graph in such a way that only when DFS traversal is completed for a connected component it will proceed with the next one. Let’s define some types first:

```
data Tree a = Nil | Node !a (Forest a) deriving Show
type Forest a = [Tree a]
dff :: [Vertex] -> Graph weight label -> Forest Vertex
dff vs g = fst (dff' vs g)
```

The `dff`

function calls an auxiliary function `dff'`

that does the heavy lifting, let’s have a look at it:

```
dff' :: [Vertex] -> Graph weight label -> (Forest Vertex, Graph weight label)
dff' [] g = ([], g)
dff' (v:vs) g = case v `match` g of
Nothing -> dff' vs g
Just (ctx, g') -> (Node v ts : forest, g'')
where
(ts, (forest, g'')) = let (_,g'') = dff' (destvs ctx) g' in (ts, dff' vs g'')
-- or more succinctly: (ts, (forest, g'')) = second (dff' vs) (dff' (destvs ctx) g')
-- `second` applies the function `dff' vs` to the second element of
-- the pair returned by `dff' (destvs ctx) g'`
-- extracts destination vertices from the outbound edges of a context
destvs :: Context label weight -> [Vertex]
```

The `dff'`

function is another recursive function: if `match`

ing the vertex `v`

with the graph `g`

succeeds, `dff'`

calls itself passing its siblings and the new graph as arguments until the list of vertices is empty; when the list is empty the recursion continues for the remaining vertices `vs`

and the most recent version of the graph. Again let’s have a look at a very simple example built on top of the previous one:

```
ƛ: let g = read "mkG [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)] [(1, 2, 5), (2, 1, 3), (2, 3, 1), (3, 1, 4), (4, 5, 7)]" :: Graph Int Char
([(4,3),(3,2)],1,'a',[(5,2)]) :&: (([],2,'b',[(1,3)]) :&: (([],3,'c',[]) :&: (([],4,'d',[(7,5)]) :&: (([],5,'e',[]) :&: Empty))))
ƛ: dff (vertices g) g
[Node 1 [Node 2 [Node 3 []]], Node 4 [Node 5 []]]
```

Using a breadth-first search strategy to visit a graph essentially means: traverse each vertex **once** and visit **siblings before successors**. Here’s what the algorithm looks like:

```
bfs :: [Vertex] -> Graph weight label -> [Vertex]
bfs vs g =
| isEmpty g || null vs = []
| otherwise = case v `match` g of
Nothing -> bfs g vs'
Just (ctx, g') -> v : bfs g' (vs' ++ destvs ctx)
(v, vs') = (head vs, tail vs)
-- extracts destination vertices from the outbound edges of a context
destvs :: Context label weight -> [Vertex]
```

There key facts to notice about the algorithm are:

- siblings are appended
*at the right end of*the input vertices: this is what makes the algorithm traversing the input graph breadth-first. This is exactly what the second invariant of BFS dictates : visit siblings before the successor. - the
`match`

function returns a new graph*without*the current vertex: this is what the first invariant of BFS dictates: traverse each vertex exactly once. Since the new graph doesn’t contain the current vertex there is no need for keeping track of the visited vertices. - the algorithm is mostly the same as
`dfs`

, the only thing that changes is where siblings are appended: in case of BFS they’re appended at the end of the list, in case of DFS in front of it. To fully appreciate this it might be useful to think of these algorithms in terms of the data structures they use: LIFO in case of DFS and a FIFO in case of BFS.

One of the applications of BFS is finding the shortest path in a unweighted graph. For convenience the paper chooses a different representation for the spanning forest: a list of labelled paths. Let’s have a look at the implementation of the shortest path algorithm:

```
type Path = [Vertex]
-- Roots tree
type RTree = [Path]
shortestPath :: Vertex -> Vertex -> Graph weight label -> Path
shortestPath src dst = reverse . pathTo ((==dst) . head) . bft src
pathTo :: (a -> Bool) -> [a] -> a
pathTo p = head . filter p
```

The `esp`

function requires a source vertex and a destination vertex, filters the path to the destination and reverses it (why this is necessary will become clear in a moment). Notice that since Haskell has non-strict semantics, `esp`

stops as soon as the path to the target destination vertex is found. Now let’s have a look at the implementation of the `bft`

function:

```
bft :: Vertex -> Graph weight label -> RTree
bft v = bf [[v]]
bf :: [Path] -> Graph weight label -> RTree
bf paths = bf' paths
where
bf' :: [Path] -> Graph weight label -> RTree
bf' paths g
| null paths || isEmpty g = []
| otherwise = case v `match` g of
Nothing -> bf' paths' g
Just ((_,_,_,outs), g') -> path : bf' (paths' ++ map (:path) (destvs outs)) g'
-- gets the current vertex from the first path in the list and the remaining paths
-- paths will never be empty because `bf` is called using a non-empty list
(path@(v:_), paths') = let (pss, pss') = splitAt 1 paths in (head pss, pss')
-- more succinctly: (path@(v:_), paths') = first head (splitAt 1 paths)
-- extracts destination vertices from the outbound edges of a context
destvs :: Context label weight -> [Vertex]
```

Instead of explaining what the function does step-by-step, let’s have a look at an example on a simple graph as it might be easier to understand:

```
ƛ: let g = read "mkG [('a', 1), ('b', 2), ('c', 3)] [(1, 2, ()), (2, 1, ()), (2, 3, ()), (3, 1, ())]" :: Graph () Char
([(4,()),(3,())],1,'a',[(5,())]) :&: (([],2,'b',[(1,())]) :&: (([],3,'c',[]) :&: Empty))
ƛ: bf [[1]] g
[[1],[2,1],[3,2,1]]
```

Notice that an unweighted graph is a graph whose weight is `()`

and that the resulting spanning tree contains the shortest path from the source vertex to all other vertices in *reverse order*. The `bf`

function builds complete paths from a source to a destination vertex but doesn’t waste any memory because list prefixes are shared.

Finding the shortest path between two vertices means finding the cheapest path between them (where “cheap” is dependent upon the weight or cost of the edges). Dijkstra’s algorithm to find the shortest path in a weighted graph essentially chooses always the next cheapest edge taking into account the distance traversed so far. First let’s define two new auxiliary types:

```
-- Labelled vertex
type LVertex label = (label, Vertex)
-- Only needed to be able to define `Eq` and `Ord` instances
newtype LPath label = LPath { getLPath :: [LVertex label] }
-- Labelled R-Tree (or Root Tree)
type LRTree label = [LPath label]
instance Eq label => Eq (LPath label) where ...
instance Ord label => Ord (LPath label) where ...
type Weight = Int
```

The algorithm uses a min-heap and some auxiliary functions to keep track of the cheapest path:

```
import qualified Data.Heap as Heap
getPath :: Vertex -> LRTree label -> Path
getPath v = reverse . map snd . getLPath . pathTo ((==v) . lv2v)
where
lv2v = snd . head . getLPath
expand :: Int -> LPath Weight -> Context Weight label -> [LPath Weight]
expand d (LPath p) (_, _, _, outs) = map (\(w, v) -> LPath ((w + d, v):p)) outs
mergeAll :: [LVertex Weight] -> Heap.Heap (LPath Weight) -> Context Weight label -> Heap.Heap (LPath Weight)
mergeAll p@((dist, _):_) h ctx = foldr Heap.insert h (expand dist (LPath p) ctx)
```

The `expand`

function builds new `LPath`

s whose label is the sum of the distance walked so far - let’s assume weights are positive integers for simplicity - and the weight of the outbound edge. The `mergeAll`

function takes these paths and inserts them in the heap. The `getPath`

function just extracts the path to the given destination vertex from the list of paths. Now let’s have a look at the core of the algorithm:

```
dijkstra :: Heap.Heap (LPath Weight) -> Graph Weight label -> LRTree Weight
dijkstra h g
| isEmpty g = []
| otherwise = case Heap.viewMin h of
Nothing -> []
Just (lpath, h') -> dijkstra' (lpath, h')
where
dijkstra' (LPath p@((_, v):_), h') =
case v `match` g of
Nothing -> dijkstra h' g
Just (ctx, g') -> LPath p : dijkstra (mergeAll p h' ctx) g'
shortestPathTree :: Vertex -> Graph Weight label -> LRTree Weight
shortestPathTree src = dijkstra (Heap.singleton $ LPath [(mempty, src)])
shortestPath :: Vertex -> Vertex -> Graph Weight label -> Path
shortestPath src dst g = getPath dst (shortestPathTree src g)
```

The `sp`

function kicks off the algorithm by providing the source node to the `spt`

function which in turn calls `dijkstra`

with a singleton min-heap that contains a path to the source vertex with weight zero - this is how expensive it is to walk from the source vertex to the source vertex. The `dijkstra`

function is a recursive function that peeks the cheapest path from the min-heap, and if the current vertex `v`

is contained in the graph - that is, the vertex hasn’t been already visited - appends it to the resulting `LRTree`

and calls itself recursively with a new min-heap that contains up-to-date costs and a new graph that doesn’t contain `v`

. The recursion stops if the graph is empty - that is all vertices has been visited - or the min-heap is empty - that is all edges have been traversed. This is definitely a bit more complex than the other algorithms but it’s quite elegant and modular. Let’s have a look at an example on the following graph:

```
ƛ: let g = read "mkG [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5), ('f', 6), ('g', 7)] [(1,2,12),(1,3,7),(1,4,5),(2,3,4),(2,7,7),(3,4,9),(3,5,4),(3,7,3),(4,5,7),(5,6,5),(5,7,2),(6,7,2)]" :: Graph Int Char
-- weights should be wrapped in a `Sum` constructor to form a monoid for addition on Ints but let's forget about that for the sake of simplicity
-- `undir` simply transforms a directed graph to an undirected one
ƛ: shortestPathTree 1 (undir g)
[LPath {getLPath = [(0,1)]},LPath {getLPath = [(5,4),(0,1)]},LPath {getLPath = [(7,3),(0,1)]},LPath {getLPath = [(10,7),(7,3),(0,1)]},LPath {getLPath = [(11,5),(7,3),(0,1)]},LPath {getLPath = [(11,2),(7,3),(0,1)]},LPath {getLPath = [(12,6),(10,7),(7,3),(0,1)]}]
ƛ: shortestPath 1 6 (undir g)
[1,3,7,6]
```

Prim’s algorithm to find the minimum spanning tree (MST) always traverses the cheapest edge among the discovered edges - like Dijkstra’s it’s a greedy algorithm. The two algorithms are notoriously very similar and this becomes evident using recursive functions. We’ll re-use the same types defined for the shortest path algorithm but define different auxiliary functions:

```
mergeAll :: [LVertex Weight] -> Heap.Heap (LPath Weight) -> Context Weight label -> Heap.Heap (LPath Weight)
mergeAll lvs h ctx = foldr Heap.insert h (addEdges (LPath lvs) ctx)
addEdges :: LPath Weight -> Context Weight label -> [LPath Weight]
addEdges (LPath p) (_, _, _, outs) = map (LPath . (:p)) outs
```

The `addEdges`

function is very similar to the `expand`

function but it doesn’t take into account the distance walked so far, only the weight of the edges. The core of the algorithm shouldn’t be anything new:

```
prim :: Heap.Heap (LPath Weight) -> Graph Weight label -> LRTree Weight
prim h g
| isEmpty g = []
| otherwise = case Heap.viewMin h of
Nothing -> []
Just (lpath, h') -> prim' lpath h'
where
prim' (LPath p@((_, v):_), h') =
case v `match` g of
Nothing -> prim h' g
Just (ctx, g') -> LPath p : prim (mergeAll p h' ctx) g')
mst :: Vertex -> Graph Weight label -> LRTree Weight
mst src = prim (Heap.singleton (LPath [(0, src)]))
```

Now that the MST can be build, let’s find the path between two vertices:

```
mstPath :: Vertex -> Vertex -> LRTree weight -> Path
mstPath src dst t = joinPaths (getPath src t) (getPath dst t)
joinPaths :: Path -> Path -> Path
joinPaths p2src p2dst = joinAt (head p2src) (tail p2src) (tail p2dst)
joinAt :: Vertex -> Path -> Path -> Path
joinAt _src (v:vs) (v':vs')
| v == v' = joinAt v vs vs'
joinAt src ps ps' = reverse ps ++ (src:ps')
```

Let’s again have a look at an example on the following graph:

```
ƛ: let g = read "mkG [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5), ('f', 6), ('g', 7)] [(1,2,12),(1,3,7),(1,4,5),(2,3,4),(2,7,7),(3,4,9),(3,5,4),(3,7,3),(4,5,7),(5,6,5),(5,7,2),(6,7,2)]" :: Graph Int Char
-- weights should be wrapped in a `Sum` constructor to form a monoid for addition on Ints but let's forget about that for the sake of simplicity
ƛ: let mstTrees = mst 1 (undir g)
[LPath {getLPath = [(0,1)]},LPath {getLPath = [(5,4),(0,1)]},LPath {getLPath = [(7,5),(5,4),(0,1)]},LPath {getLPath = [(2,7),(7,5),(5,4),(0,1)]},LPath {getLPath = [(2,6),(2,7),(7,5),(5,4),(0,1)]},LPath {getLPath = [(3,3),(2,7),(7,5),(5,4),(0,1)]},LPath {getLPath = [(4,2),(3,3),(2,7),(7,5),(5,4),(0,1)]}]
ƛ: mstPath 3 5 mstTrees
[3,7,5]
```

I mentioned that inductive graphs and related algorithms are meant to be as efficient as the non-inductive counterparts. The implementations shown so far are not - and they were not meant to be in the first place - but hopefully they provided a good intuition about inductive graphs. An efficient implementation would rely internally on more efficient data structures, and a key aspect to achieve asymptotically optimal running times for the algorithms shown above is that active patterns must execute in constant time. The fgl library is a real-world implementation based on Martin Erwig’s paper, and if you’re curious to know how it is possible to implement inductive graphs efficiently I’ll encourage to look at the source code; digging into the internals of the library is a whole different topic, possibly for a future blog post.

One of the trade-offs to achieve clear and elegant graph algorithms seemed to be shifting the complexity from the algorithm itself to the supporting data structures: for example implementing an inductive graph is more complex than implementing an adjacency list, and using a min-heap in the shortest path or MST algorithms eliminates the need for bookkeeping when deciding which edge should be traversed next.

This exploration into graphs and related algorithms in functional programming started with a simple question that was surprisingly hard to answer: *“How should I implement a graph algorithm in a functional programming language?”* The plethora of resources about graphs in the imperative world is not matched in the functional world, where adequate solutions to the problem have surfaced only in the last 20 years. Starting with an unsatisfactory monadic implementation, we had a look at a better solution that leverages a mix of functional and imperative constructs and finally described an implementation based on inductive graphs that manages to be elegant, clear and efficient - with some caveats - by leveraging inductive data structures and functions.

- This blog post is an expanded version of a presentation I gave at the Berlin Haskell User Group in November 2017 where I gathered lots of valuable feedback that made it into this post. A special thanks goes to Matthias, Ben & Adrian.
- Fixed
`bfs`

implementation and rewordings based on Erik’s feedback

I faced the problem I while ago while working on a pet-project: the Docker image was almost 2GB(!) but the only thing the application was doing was validating a YAML file. I didn’t find a good solution until a few days ago the feram.io folks pointed me to this blog post (thanks guys!). Multi-stage builds?!…I didn’t even know that was possible! So I got back to my pet-project and see how that would work out in Haskell (the application in the blog post is written in Go lang). The final solution I ended up implementing after some painful and time-consuming trial and error was a bit more involved than what is described in that post but it was worthy: the final size of the Docker image dropped **from 2GB to 17.1MB - 5MB compressed size!** I first used plain multi-stage builds but that had an issue: since all the Haskell dependencies have to be compiled, the first part of the multi-stage build was taking a long time to complete while the second part was taking only a few seconds. For this reason I ended up splitting the two, basically going back to the builder patter the blog post mentions: I first built a base image with all needed Haskell dependencies compiled and than used a multi-stage build to create the executable image. The `Dockerfile`

for the base image is not that interesting:

```
# Dockerfile.builder
# docker build -t futtetennista/serverless-validator-builder --file Dockerfile.builder .
FROM haskell:8.0
# Install dependecies needed to compile Haskell libraries
RUN apt-get update && apt-get install --yes \
xz-utils \
make
RUN stack --resolver lts-9.14 install base \
protolude \
text \
aeson \
yaml \
unordered-containers \
case-insensitive \
regex-compat
```

It just installs some Linux dependencies and builds the Haskell dependencies. The one for the executable image is a bit more exciting:

```
# Dockerfile
# docker build -t futtetennista/serverless-validator .
FROM futtetennista/serverless-validator-builder as builder
WORKDIR "/home/serverless-validator/"
# copy the contents of the current directory in the working directory
COPY . .
RUN stack --resolver lts-9.14 install && \
strip /root/.local/bin/serverless-validator
FROM fpco/haskell-scratch:integer-gmp
COPY --from=builder /root/.local/bin/serverless-validator /bin/
ENTRYPOINT ["/bin/serverless-validator"]
```

First it compiles and links the executable in the base container, removes some unwanted piece of data as `man strip`

illustrates

strip removes or modifies the symbol table attached to the output of the assembler and link editor. This is useful to save space after a program has been debugged and to limit dynamically bound symbols.

and finally copies the executable from the base container to the executable container. The fpco/haskell-scratch Docker image was created by a personal Haskell-hero of mine, Michael Snoyberg and introduced in this blog post a while back. It’s a minimal Linux image (~2MB) that can be used as a base image to run Haskell applications; the image hasn’t been updated in two years but it still works flawlessly (there is another Docker image tagged snoyberg/haskell-scratch but I guess it has been deprecated).

Thanks once again to Michael and the FP Complete folks for solving **so** many practical problems Haskellers face in their day-to-day coding!

This technique is applicable whenever your application needs to build upon an existing framework or library: for example this very website! Building it on CircleCI without any caching and compiling all needed dependencies took almost 14 minutes, with caching that went down to 1:32 minutes and with a base image with pre-compiled dependencies to 1:18 minutes.

]]>`docker-compose`

since a user-defined network is created by default (at least in recent versions).
The postgres official image on Docker Hub contains an example for the former case: when creating the container running the `psql`

command using the default `bridge`

network interface, we need to supply the `--link`

option with the name (and optionally alias) of the postgres instance and the `--host`

option with its **alias** (if no alias is supplied the name is also the alias):

`$ docker run -it --rm --link=postgres:postgres-instance postgres psql --host=postgres-instance --username=postgres`

There is no example for the latter case though, so let’s see how it can be achieved. First let’s create a user-defined network and give it an arbitrary name - i.e. `my_bridge`

- by simply typing: `docker network create my_bridge`

. When using user-defined networks is essential to know the **IP** of the container running the postgres instance, this can be retrieved like this:

```
$ docker run -itd --name=postgres-instance postgres # Make sure there's an instance running
$ docker inspect --format='{{ .NetworkSettings.Networks.my_bridge.IPAddress }}' postgres-instance
172.19.0.2
```

Now let’s create another container that connects to the postgres instance and runs the `psql`

command, and use the configuration options to set the user-defined network and to provide the IP of the postgres instance the container has to connect to:

`$ docker run -it --rm --network=my_bridge postgres psql --host=172.19.0.2 --username=postgres`

Alternatively, if we leverage the `--add-host`

option, the command is going to look very similar to the one we used when connecting containers using links:

`$ docker run -it --rm --network=my_bridge --add-host=postgres-instance:172.19.0.2 postgres psql --host=postgres-instance --username=postgres`

The `--add-host`

option simply adds an entry to the `/etc/hosts`

file of the container.

`sitemap.xml`

and `robots.txt`

to a website built with Hakyll is not explicitly documented but it ended up being quite easy with the help of some DuckDuckGo-fu. A quick research returns this post. With a few amendments the solution proposed in the post works like a charm, here’s the revised version I use for my website:
```
create ["sitemap.xml"] $ do
route idRoute
compile $ do
posts <- recentFirst =<< loadAll "posts/**"
pages <- loadAll "pages/*"
let
crawlPages =
sitemapPages pages ++ posts
sitemapCtx =
mconcat [ listField "entries" defaultContext (return crawlPages)
, defaultContext
]
makeItem ""
>>= loadAndApplyTemplate "templates/sitemap.xml" sitemapCtx
>>= relativizeUrls
match (fromList ["robots.txt", "CNAME"]) $ do
route idRoute
compile $ getResourceBody >>= relativizeUrls
```

Notice how the `robots.txt`

and `CNAME`

files - the latter is needed by my domain name registrar - are simply copied since there’s no need to apply any processing to them.

Configuration files in CircleCI 2.0 are quite different from 1.0 so I couldn’t reuse much of the code in this post but I found the docs quite good and the whole configuration options quite intuitive. CircleCI 2.0 adds great support for Docker, so I ended up creating a custom docker image for my website after trying the official Haskell image and stumbling upon two main issues:

`ssh`

isn’t installed by default: this is a problem when checking out or pushing to a remote git repository (checkout actually works somehow thanks to some tricks CircleCI does but it logs a warning in its console)`make`

isn’t installed by default: my website uses hakyll-sass that has a C++ dependency -`libsass`

- that needs to be built

After that the project was building but it was compiling all dependecies. Building a site from scratch takes quite a bit - ~20 minutes on my local machine and ~12 minutes in CircleCI - so it’s critical to use CircleCI’s caching to speed things up. The `save_cache`

and `restore_cache`

job-level keys are the ones to configure in order to speed up the build, this is how the caching section looks for my project:

```
- restore_cache:
key: v1-stack-{{ checksum "futtetennismo.cabal" }}
...
- save_cache:
paths:
- ~/futtetennismo/.stack-work
- /root/.stack/
key: v1-stack-work-{{ checksum "futtetennismo.cabal" }}
```

It’s very simple: just let CircleCI know that it should cache and how the cache should be named in order to be retrieved at a later time. The improvements on build time are dramatic: from ~20 minutes in case of build with no cache to ~2 minutes in the worst case (when a new cache archive needs to be created and uploaded) to ~30 seconds in the average case!

Here’s something to keep in mind when caching in CircleCI 2.0

The cache for a specific key is immutable and cannot be changed once written.

In early experiments the `save_cache`

job-level key in my `config.yml`

looked like this

```
- save_cache:
paths:
- ~/futtetennismo/.stack-work
key: stack-work-{{ checksum "futtetennismo.cabal" }}
```

but every new build was again taking a long time because the cache was mostly useless. Then I added `root/.stack`

to the `paths`

but still nothing, the cache was just a few under KB. At that point I noticed this tip in the docs:

Tip: Given the immutability of caches, it might be helpful to start all your cache keys with a version prefix v1-… . That way you will be able to regenerate all your caches just by incrementing the version in this prefix.

That meant that `save_cache`

never overwrites an existing cache! I ended up building a new cache following that tip.

The missing piece in the puzzle is now setting up ssh key in CircleCI to be able to checkout the project and push new versions of the website. I ended up creating a read/write deployment key for pushing to github and a checkout key for fetching from github. The latter is just a matter of a few clicks, the former involves some manual work but the docs by the Github folks are easy to follow.

**UPDATE**: if a read/write deployment key is created and added to a project in CircleCI, there’s actually no need to also have a checkout key.

One feature that CircleCI provides (and which proved to be extremely handy when debugging some issues setting up ssh keys) is the ability to connect to a running container via ssh. Bravo to the CircleCI folks!

The workflow I ended up adopting is a mix of this two tutorials. The `master`

branch - which is the one deployed by Github Pages in case of a user page - contains all the static assets of the website like images, html, css etc. Another branch - that I arbitrarily named `source`

- contains all source code and each time a new commit is pushed to that branch CircleCI will build a version of my website and will eventually push it to Github (if the build succeeds). For convenience, I also configured my build in such a way that CircleCI builds *only* that branch. Here’s a snippet of the `deploy`

job-level key in the `config.yml`

:

```
version: 2
jobs:
build:
working_directory: ~/futtetennismo
branches:
only:
- source
...
- deploy:
name: Deploy master to Github Pages
command: |
git config --global user.email robots@circleci.com
git config --global user.name CircleCI
stack exec site rebuild
git checkout master
git pull --rebase
# Overwrite existing files with new files
cp -a _site/. .
# Commit
git add --all
git commit -m "[`date '+%F %T %Z'`] New release"
# Push
git push origin master:master
```

As a last little bonus, I wrote a simple `pre-push`

hook to open my browser and follow the deployment:

```
#!/usr/bin/env sh
if [ $(git rev-parse --abbrev-ref HEAD) == 'source' ]
then
open "https://circleci.com/gh/futtetennista/futtetennista.github.com"
fi
```

I described a workflow to be able to deploy websites built with Hakyll in a fully automated fashion using Github Pages and CircleCI 2.0 and I pointed out some of the gotchas I learned in the process. The code is open source and can be found on Github.

]]>`Job`

that extracts all the well-formed urls, a `Task`

for each url that needs to be checked and puts it in a job queue that worker threads poll to get new urls to check. The program waits until all urls are checked and prints out some statistics about those URLs. Let’s start by having a look a the typess:
`> data Task = Done | Check URL`

`> type URL = Lazy.ByteString`

```
> data JobState =
> JobState { linksSeen :: Set.Set URL
> , linksFound :: !Int
> , linkQueue :: TChan Task
> }
```

```
> newtype Job a =
> Job { runJob :: StateT JobState IO a }
> deriving (Functor, Applicative, Monad, MonadState JobState, MonadIO)
```

The `checkUrls`

function glues together a few things: extracting the urls from the input file, filtering out duplicates, enqueueing the tasks in the job queue and updating the statistics

```
> checkURLs :: FilePath -> Job ()
> checkURLs f = do
> src <- liftIO $ Lazy.readFile f
> let
> urls = extractLinks src
> uniqueUrls <- filterM seenURI urls
> mapM_ insertURI uniqueUrls
> enqueueTasks uniqueUrls
> updateStats (length urls)
```

```
> updateStats :: Int -> Job ()
> updateStats numUrls =
> modify $ \s -> s { linksFound = linksFound s + numUrls }
> seenURI :: URL -> Job Bool
> seenURI url =
> (not . Set.member url) <$> gets linksSeen
> insertURI :: URL -> Job ()
> insertURI url =
> modify $ \s -> s { linksSeen = Set.insert url (linksSeen s) }
```

```
> enqueueTasks :: [URL] -> Job ()
> enqueueTasks urls = do
> q <- gets linkQueue
> liftIO . atomically $ mapM_ (writeTChan q . Check) urls
```

```
> extractLinks :: Lazy.ByteString -> [URL]
> extractLinks =
> Lazy.lines -- filtering of invalid urls omitted
```

For the version using conduits we’ll aim to remove everything that relies on lazy IO, using strict `ByteString`

s and conduits. There is just one change needed in the types, namely re-defining the `URL`

type alias

`> data Task' = Done' | Check' URL'`

`> type URL' = Strict.ByteString`

```
> data JobState' =
> JobState' { linksSeen' :: Set.Set URL'
> , linksFound' :: !Int
> , linkQueue' :: TChan Task'
> }
```

```
> newtype Job' a =
> Job' { runJob' :: StateT JobState' IO a }
> deriving (Functor, Applicative, Monad, MonadState JobState', MonadIO)
```

The `checkURL`

function is - as you might expect - quite different given how the conduit library is designed. In conduit “everything is driven by the downstream” so I found it useful to ask myself this question: what output does the function need to produce? In this case `checkURLs`

needs to do essentially two things: 1) creating and enqueuing `Task`

s to be picked up by worker threads and 2) updating some statistics in `JobState`

. The first shift in thinking is that I found necessary is to think only in terms of pipelines and leave out `let`

bindings. This poses a challenge though: the extracted urls are needed for both 1) and 2) but once they go through 1) urls are transformed into a job and that’s not what 2) expects as an input. I found two to three possible solutions to the problem: changing the signatures of the helper functions so that the input urls are always returned wrapped in a monad (this reminded me of the “fluent” style used for example for builders in languages like Java) to allow the stream to “keep flowing downstream”, using zipped conduits and a mix of the two. The `ZipCounduit`

is a handy type that makes it possible to split the stream into two identical streams that can be consumed by two different downstream conduits: this way both 1) and 2) can get the input data they expect. I’m not entirely sure what’s more idiomatic or elegant or - more importantly - clear though.

First let’s start with the helper functions (I’ll just write type signatures for their variations in the following snippets), their implementation is the same but the type signature of most of them is slightly different - more on this below

```
> extractLinks' :: Strict.ByteString -> [URL']
> extractLinks' =
> Strict.lines -- filtering of invalid urls omitted
```

```
> updateStats' :: MonadState JobState' m => Int -> m ()
> updateStats' numUrls =
> modify $ \s -> s { linksFound' = linksFound' s + numUrls }
```

```
> seenURI' :: MonadState JobState' m => URL' -> m Bool
> seenURI' url = do
> (not . Set.member url) <$> gets linksSeen'
```

```
> insertURI' :: MonadState JobState' m => URL' -> m ()
> insertURI' url = do
> modify $ \s -> s { linksSeen' = Set.insert url (linksSeen' s) }
```

```
> enqueueTasks' :: (MonadState JobState' m, MonadIO m) => [URL'] -> m ()
> enqueueTasks' urls = do
> queue <- gets linkQueue'
> liftIO . atomically $ mapM_ (writeTChan queue . Check') urls
```

The the first solution - no `ZipConduits`

s involved - looks like this

```
> checkURLs' :: FilePath -> Job' ()
> checkURLs' fp =
> Job' $
> runConduitRes $ sourceFileBS fp
> .| mapC extractLinks'
> .| filterMCE seenURI'
> .| mapMCE insertURI'
> .| mapMC enqueueTasks'
> .| mapM_C (updateStats' . length)
>
> updateStats' :: MonadState JobState' m => Int -> m ()
> insertURI' :: MonadState JobState' m => URL' -> URL' ()
> enqueueTasks' :: (MonadState JobState' m, MonadIO m) => [URL'] -> m [URL']
```

The second solution uses two `ZipConduit`

s

```
> checkURLs' :: FilePath -> Job' ()
> checkURLs' fp =
> Job' $
> runConduitRes $ sourceFileBS fp
> .| mapC extractLinks'
> .| setupJob
> where
> setupJob :: Consumer [URL'] (ResourceT (StateT JobState' IO)) ()
> setupJob =
> getZipConduit $
> ZipConduit insertAndEnqueue
> *> ZipConduit (mapM_C (updateStats' . length))
```

```
> insertAndEnqueue :: Consumer [URL'] (ResourceT (StateT JobState' IO)) ()
> insertAndEnqueue =
> filterMCE seenURI' .| (getZipConduit $
> ZipConduit (mapM_CE insertURI')
> <* ZipConduit (mapM_C enqueueTasks'))
```

Finally the third solution uses one `ZipConduits`

s and modifies `insertURI'`

to return a `URL'`

so that the stream can “keep flowing down”

```
> checkURLs' :: FilePath -> Job' ()
> checkURLs' fp =
> Job' $
> runConduitRes $ sourceFileBS fp
> .| mapC extractLinks'
> .| setupJob
> where
> setupJob :: Consumer [URL'] (ResourceT (StateT JobState' IO)) ()
> setupJob =
> getZipConduit $
> ZipConduit (filterMCE seenURI'
> .| mapM_CE insertURI'
> .| mapM_C enqueueTasks')
> *> ZipConduit (mapM_C (updateStats' . length))
>
> insertURI' :: MonadState JobState' m => URL' -> URL' ()
```

The type signatures of most of this helper functions is slightly different - namely it’s more general: why is this needed? If the type signature of `updateStats'`

was `updateStats' :: Int -> Job' ()`

the compiler would complain with the following error: `Couldn't match type ‘Job’ with ‘ResourceT (StateT JobState IO)’`

. It took me a bit to fix this and make the compiler happy, again I’m not entirely sure that’s the best way of solving the issue but it works. My first try was - following compiler errors - to make `Job`

an instance of `MonadThrow`

, `MonadBase`

but I stopped before implementing an instance for `MonadBaseControl`

since it couldn’t be derived atomatically and I was under the impression that it was too much of a hassle giving that `Job'`

is just a `newtype`

wrapper for `StateT`

, which is already an instance of `MonadBaseControl`

. If I could take the `StateT`

transformer and then just wrap it in a `Job'`

constructor then that would do the job…and that’s made possible by modifying the type signatures of those functions. Actually if I had just type inference do its job it’d have inferred the types correctly, but I’m used to write type signatures first and then write an implementation and that bit me this time.

To check that all this works as expected, let’s try it out in GHCI

```
ƛ queue <- newTChanIO :: IO TChan Task'
queue :: TChan Task'
ƛ let job = checkURLs' "urls.txt" -- urls.txt contains a list of urls
job :: Job' ()
ƛ st <- execStateT (runJob' job) (JobState' Set.empty 0 queue)
st :: JobState'
ƛ :m +Control.Exception
ƛ assert (linksFound' st > 0) ("Found " ++ linksFound' st ++ " links")
"Found 2 links"
ƛ assert (linksSeen' st > 0) ("Seen " ++ linksSeen' st ++ " links")
"Seen 3 links"
ƛ emptyQueue <- atomically $ isEmptyTChan queue
emptyQueue :: Bool
ƛ assert (not emptyQueue) "Queue not empty"
"Queue not empty"
```

In this post I shown how to refactor a piece of code using lazy IO to use the conduit library to write a little program that reads data from files efficiently when it comes to memory usage, illustrated some of the challenges I faced while doing that and explained some of possible solutions I found.

]]>`foldl`

to sum a list of ints, with the result that instead of returning a result using constant space, it ends up taking an outrageous amount of memory before returning a result because thunks pile up (this behaviour is known as space leak). Most of the times I personally find it tricky to add strictness to a piece of Haskell code, so I’d like to share my latest experience doing that.
We’ll be using the Bloom filter implemented in chapter 26 of Real World Haskell as an example, the version contained in the book creates the filter lazily: our goal will be to create a strict version of that particular piece of code. In a nutshell, a Bloom filter is a probabilistic data structure that consists of several hash functions and a bit array whose API allows only insertion and membership querying. The latter API might return false positives with an expected error rate decided when the filter is instantiated. Here’s a function that builds a Bloom filter lazily:

```
-- file: BloomFilter/BloomFilter.hs
import BloomFilter.Immutable as B (IBloom, fromList)
import BloomFilter.Hash (Hashable, doubleHash)
import Data.List (genericLength)
import Data.Either (either)
mkFromList :: Hashable a => Int -> [a] -> Either String (B.IBloom a)
mkFromList errRate xs =
either Left (Right . mkBFilt) $ suggestSizing (genericLength xs) errRate
where
mkBFilt (bits, numHashes) =
B.fromList (doubleHash numHashes) bits xs
```

The function `suggestSizing`

provides the optimal size of the underlying array and the number of hashes to generate given the length of the input list and the desired rate of false positives, but it’s not important for the topic of this article. Let’s try this code out in GHCI:

```
ƛ :set +s -- to print timing/memory stats after each evaluation
ƛ :load BloomFilter.BloomFilter
ƛ let ebf = mkFromList 0.01 ([1..10^6] :: [Int])
ebf :: Either String (B.IBloom Int)
(0.01 secs, 4658656 bytes)
```

The fact that `ebf`

has not been fully evaluated should be clear since the evaluation took almost no time, but let’s ask GHCI for help:

```
ƛ :print ebf -- prints a value without forcing its computation
ebf = (_t2::Either String (B.IBloom Int))
```

GHCI is telling us that `ebf`

is a thunk `_ts`

of type `Either String (B.IBloom Int)`

. If we’re still not convinced that `ebf`

is not evaluated we can ask it if an element is contained in the Bloom filter:

```
ƛ either (const False) (1 `B.elem`) ebf
True
it :: Bool
(19.44 secs, 13818404512 bytes)
ƛ either (const False) (11 `B.elem`) ebf
True
it :: Bool
(0.01 secs, 3118248 bytes)
```

From the timing/memory information should be pretty clear now that the evaluation was forced when we explicitly asked for a membership test. That expected given Haskell’s non-strict semantic. If we ask GHCI to give us information about `ebf`

we can see that now it gives us a different answer:

```
ƛ :print ebf
ebf = Right
(B.IB
(_t3::Int -> [Word32])
(Data.Array.Base.UArray
(GHC.Word.W32# 0) (GHC.Word.W32# 9592954) 9592955
(_t4::ghc-prim-0.5.0.0:GHC.Prim.ByteArray#)))
```

Let’s not focus on the types - again not important - GHCI is telling us the value of `ebf`

after evaluation. We’d like to force this evaluation *before* the first time the Bloom filter is used, namely when it is created.

In GHCI we **always** need to explicitly specify type annotations for bindings that need to be forced, otherwise the interpreter will infer the most general type and won’t force the evaluation of the term. This is related to “the dreaded monomorphism restriction”.

There are various ways of forcing evaluation in Haskell, the main ones are: `seq`

, `deepseq`

, `rnf`

(the last two can be found in the `Control.DeepSeq`

module and require the argument to be an instance of the `NFData`

type class) or the handy `BangPatterns`

extension, which is syntactic sugar for `seq`

. As a first try, let’s force the evaluation of `ebf`

using bang patters and see what happens:

```
ƛ :set -XBangPatterns
ƛ let !ebf' = mkFromList 0.01 ([1..10^6] :: [Int])
ebf' :: Either String (B.IBloom Int)
(0.34 secs, 197720920 bytes)
ƛ :print ebf'
ebf' = Right (_t5::B.IBloom Int)
```

That did something, specifically it evaluated `ebf'`

a bit so that now we already know that the construction of the Bloom filter succeeded but did we manage to instantiate it? By carefully reading the output of GHCI it should be clear that we’re not quite there yet but let’s again double check:

```
ƛ either (const False) (11 `B.elem`) ebf'
True
it :: Bool
(19.02 secs, 13624548640 bytes)
```

The membership test took still 19 seconds, as we expected. So what’s happening here? Now it’s probably a good point to introduce some terminology that will help us out understanding what’s happening and how to go forward.

A reducible expression (or redex) is an expression that can be evaluated until a value is obtained, i.e. `let x = 1 + 6`

is a redex since it can be evaluated to obtain `let x = 5`

. Let’s again double check it in GHCI:

```
ƛ let x = 1 + 5 :: Int
x :: Int
ƛ :print x
x = (_t6::Int)
ƛ let !x = 1 + 5 :: Int
x :: Int
ƛ :print x
x = 6
```

At this point `x`

cannot be futher evaluated and is said to be in normal (or canonical) form. Now what about an expression like `Right (1 + 5)`

? It should be clear that it’s not in normal form so can we just force evaluation by adding a bang pattern? Let’s see if that works:

```
ƛ let !x = Right (1 + 5) :: Either a Int
x :: Either a Int
ƛ :print x
x = Right (_t8::Int)
```

What’s happening here?! It turns out that an expression in Haskell can be in other form called weak head normal form when it’s not a redex itself and further evaluation of its sub-expressions cannot make it a redex. `Right (1 + 5)`

isn’t a redex (`Right`

is a constructor for the `Either`

type) and it cannot be made one if the sub-expression `1 + 5`

is evaluated. Does that mean we have to unwrap the sub-expression in order for it to be evaluated? Not necessarily. We have a few options, namely forcing the evaluation of the sub-expression before we wrap it:

```
ƛ let !x = let !y = 1 + 5 :: Int in Right y
x :: Either a Int
ƛ :print x
x = Right 6
```

or levaraging some of the functions in the `Control.DeepSeq`

module:

```
ƛ let !x = let x = Right (1 + 5) :: Either a Int in x `deepseq` x
x :: Either a Int
ƛ :print x
x = Right 6
ƛ let !x = Right (1 + 5) :: Either a Int
x :: Either a Int
ƛ rnf x
()
it :: ()
ƛ :print x
x = Right 6
```

`deepseq`

is like `seq`

on steroids, it reduces an expression and all its sub-expressions to normal form (`rnf`

which stays for “reduce to normal form” does exactly the same). Again keep in mind is that in order to use these two functions the argument must be an instance of `NFData`

(Normal Form Data).

A more in depth explanation and a bunch of very informative links and more examples can be found in Stephen Diehl’s What I wish I knew when learning Haskell

Now that we’are aware of all this, let’s create a strict version of our `mkFromList`

function and let’s call it `mkFromList'`

(using the convention other functions like `foldr'`

use). The first function we need to change is `(Right . mkBFilt)`

: this is equivalent to `\x -> Right (mkBFilt x)`

(using eta-expansion) and to `\pair -> let bfilt = mkBFilt pair in Right bfilt`

if we massage the lambda a bit. Here `bfilt`

needs to be evaluated so again the easiest thing to do is to add a bang pattern: `\pair -> let !bfilt = mkBFilt pair in Right bfilt`

. A quick note for about point-free style: adding strictness is a bummer in that respect. Let’s have a look at the following code

```
-- file: BloomFilter/BloomFilter.hs
mkFromList' errRate xs =
either Left (Right . mkFilt') $ suggestSizing (genericLength xs) errRate
where
mkBFilt' (bits, numHashes) =
let !bfilt = B.fromList (doubleHash numHashes) bits xsin bfilt
```

By eta-expanding `(Right . mkFilt')`

we obtain `\pair -> Right (mkFilt' pair)`

that is a function that will be evaluated lazily. Are we done yet? Almost. Let’s have a look at the type of `ebf'`

again: `Either String (B.IBloom Int)`

. What’s `IBloom`

(the ‘I’ stays for “immutable”)? Here’s how it’s defined:

```
-- file: BloomFilter/Internals.hs
data IBloom =
IB { hash :: (a -> [Word32])
, array :: UArray Word32 Bool
}
```

This closely reflects the definition of a Bloom filter, we have a function that returns a list of hashes for a given value and an array of bits. Keeping in mind that a constructor is also a function, we might notice that there is still something we need to force evaluation upon: the `array`

field. In order to do this let’s write a strict version of `mkBFilt`

, this time using `seq`

for a change:

```
-- file: BloomFilter/BloomFilter.hs
mkBFilt' (bits, numHashes) =
let bfilt = B.fromList (doubleHash numHashes) bits xs
in array bfilt `seq` bfilt
```

Equivalently, we could have pattern-matched on `bfilt`

and used a bang pattern on its `array`

field. The final version of our `mkFromList'`

function looks something like this:

```
-- file: BloomFilter/BloomFilter.hs
mkFromList' :: Hashable a => ErrorRate -> [a] -> Either String (B.IBloom a)
mkFromList' errRate xs =
either Left rightBFilt' $ suggestSizing (genericLength xs) errRate
where
rightBFilt' x = let !bfilt = mkBFilt' x in Right bfilt
mkBFilt' (bits, numHashes) =
let bfilt = B.fromList (doubleHash numHashes) bits xs
in array bfilt `seq` bfilt
```

Let’s test it in GHCI:

```
ƛ let !ebf'' = mkFromList' 0.01 ([1..10^6] :: [Int])
ebf'' :: Either String (B.IBloom Int)
(19.29 secs, 13819004104 bytes)
ƛ :print ebf''
ebf'' = Right
(B.IB
(_t1::Int -> [Word32])
(Data.Array.Base.UArray
(GHC.Word.W32# 0) (GHC.Word.W32# 9592954) 9592955
(_t2::ghc-prim-0.5.0.0:GHC.Prim.ByteArray#)))
```

And YES! We finally managed to fully evaluate our Bloom filter before its first use in our code.

- There are multiple ways we can use to introduce strictness in Haskell code:
`seq`

, the`BangPatterns`

extension or the functions in the`Control.DeepSeq`

module - Using GHCI and leveraging the
`:print`

command and the`+s`

flag can help us understanding how our code is evaluated while developing - Keep in mind the difference between NF and WHNF: if we cannot manage to force evaluation of an expression it’s because some sub-expression is still in WHNF
- Carefully analyse our code to identify where strictness needs to be added