Usually when programming in Haskell, I rely a lot on compiler. It catches
many errors that are caused by inattention, which occur especially often
when refactoring large portions of code. In some sense famous “it compiles,
therefore it works” mantra is true (assuming that you don’t have logical
errors). The problem with this approach for me for a long time was that
I ran compiler manually and then read its output. The output was usually
quite long, and I had to read it and start fixing errors from the top, otherwise
I might end up fixing wrong error. It was a tedious routine, even with a help of
bash scripts and file watchers, but thanks to ghcid1, I am free from it now.
ghcid can be interpreted as “GHCi-daemon” or “GHC-with-a-bit-of-IDE”. It is a
rather small program which starts GHCi, imports your modules there, executes
some Haskell expression in it, watches your source code and fires :reload and
reevaluates expression once something is changed. It also captures warnings and
errors produced by compiler, and presents them in a way where you can see the
topmost ones right away, without scrolling the whole output to the top. Last but
not least is that it plays really well with stack, which is de-facto
a standard tool for building and testing Haskell projects nowadays.
How to use it
First, install it:
$ stack install ghcid
Now you can run it with stack exec. At this point I usually make a bash script
and tune ghcid the way I need it for my application.
First of all we need to setup is the expression which will be evaluated in GHCi
session. You can do it with -T flag. If you want to run something different
than GHCi, you can pass the command to -c flag. By default ghcid detects
whether you’re using stack or just plain cabal and runs either stack repl
or cabal repl respectively.
Another thing I usually turn on for my projects are -Wall -Werror compiler
flags. It catches such errors as redundant imports, orphan instances, unused
variables and helps you keep the codebase clean. However, when working in a REPL
mode, you’d usually concentrate on the big picture and leave those warnings to
be fixed later, e.g. before committing code. When -W option is passed ghcid
doesn’t take those errors into account and evaluate given expression anyway.
By default ghcid watches only for those source files that were loaded by GHCi.
Sometimes it might not be enough. For example, now I’m working on application
which has Hamlet templates in templates directory and I’d like to reload the
app each time some of templates is changed. Luckily it’s easy to make ghcid do
so with --restart=<PATH> or --reload=<PATH> options. The first one will
rebuild the project and restart the whole GHCi process while the second one will
just fire :reload command to existing GHCi process whenever any of the files
and directories in <PATH> is changed.
Issues and ways to improve
One thing that would be nice to have is errors highlighting. Sure they are
printed in bold font now, but having line numbers and error messages in
different colors would help even more. Right now it is possible to achieve by
using ghcid inside terminal emulator in Neovim or Emacs.
Currently highlighting is the only thing that comes to my mind. Overall ghcid
is a great development tool, and even minor bugs (usually caused by GHCi) can’t
really outweigh its advantages. Definitely must-have.
This article will cover several sorting algorithms: working in quadratic time
(bubble, insertion and selection sort), in loglinear time (merge sort, quicksort and
heap sort) and in linear time, but only for special data types (counting and
radix sort). Source code of examples used in this article is available on
Quadratic time algorithms
There is not much to say about quadratic time algorithms. Usually, they are not
used by their own, but with some loglinear algorithm instead. For example,
insertion sort sometimes is used inside merge sort or quicksort to sort small
subarrays. As for bubble sort, usually it is considerably slower that other
quadratic time sort algorithms and not used in practice.
Idea: repeatedly iterate over an array comparing items pairwise and
exchanging if they are out of order. It is easy to notice that on each
step the biggest item will be placed at the end of an array, thus at
most n iterations is needed to put all items in their places. This algorithm is
Idea: iterate over an array keeping its left side sorted. When next item is
encountered, push it further to the left until it is less or equal than other
items. There are two loops, each of at most n iterations, thus quadratic time.
This algorithm is stable.
Idea: iterate over all positions in array. For each position i search for
minimal item in the range of [i, n] and put it in position i. For each
iteration searching for minimal item will take O(n) time, thus giving
quadratic time overall. This algorithm can be implemented as stable,
however, the implementation below is not stable.
Loglinear time algorithms
Loglinear algorithms are the ones that widely used in practice. It is proven
that it is not possible to sort arbitrary data using comparisons faster than
in O(nlog(n)) time, so these algorithms are the fastest ones. However, which
algorithm to use in a program depends on its goals and constraints.
Idea: (top-down merge sort) split array in halves, sort each half
recursively and then merge them. Using master theorem, T(n) = 2T(n/2) + O(n)
= O(nlog(n)). This algorithm is stable.
Bottom-up merge sort treats n-element array as n singleton arrays. It merges
them pairwise producing n/2 two-element arrays, merging them again and again
until it merges two n/2-element arrays into one. It can be implemented without
The implementation below use O(n) space for buffer array which is used
for merging two subarrays. This is standard recursive top-down merge sort.
Merge sort is easily parallelizable algorithm. It is more efficient than
quicksort on linked lists where it can be implemented using only constant space.
Idea: split array in two parts by comparing each element to the pivot, then
recursively sort these parts. Because quicksort depends on the contents of
array, it is possible to construct such array that will force this algorithm to
work quadratic time. Though this problem is easily solved by choosing random
element, to be precise it is necessary to say that this algorithm has
O(nlog(n)) time complexity in the average case and O(n^2) time complexity
in the worst case. Space complexity is O(log(n)) in the average case and O(n)
in the worst case since this algorithm is recursive.
Another delicate part of quicksort is how to partition array in constant space.
In order to do that create hiStart pointer which will store index of the
beginning of the second part of the array. Then iterate over the array comparing
items to pivot: if current item is less than pivot, swap it with the item at
hiStart and increment hiStart.
To speed up quicksort in the situation of repeated items one could implement
splitting into three parts: less than, equal to and greater than pivot. It
requires one more pointer, loEnd which will store index of the end of the
The implementation below chooses random pivot and uses two-way partition. It is
not stable, however, it is possible to make quicksort stable, but it
requires additional O(n) space. This implementation use O(1) space.
Quicksort has better constant than merge sort and heap sort. When
implemented well it could outperform its competitors drastically. It is possible
to parallelize quicksort, but it is a bit harder to do than in case of merge
Idea: build a heap on input array, then iteratively take minimal element out
of it until there is no items left. Since building a heap has O(n) time
complexity and retrieving minimal element from it requires O(log(n)) time
overall time complexity for this sorting algorithm is O(nlog(n)). This
algorithm is not stable.
This idea could be used on any data structure which supports building in
O(nlog(n)) or faster and retrieving minimum/maximum in O(log(n)). Heap
is preferred because it is possible to build heap on the same input array,
thus spending only O(1) space.
Linear time algorithms
Linear time sorting algorithms do not compare elements or they would work in
O(nlog(n)). Instead, they either require some constraints on input or use
some knowledge about items of an array. They usually used in complex algorithms
to cut some time here and there, e.g. building suffix arrays or for sorting data
on parallel clusters.
Assumption: all items of input array belong to some finite set of items
or could be uniquely associated with a key from this finite set. Denote the
size of such set as k.
Idea: iterate over input array counting how many times each distinct item
occurs here, then using this information put items in their places. To store
occurences we need array of integers of length k and output array of length
n, thus we need O(n+k) space. Both of these array are traversed during the
execution of algorithm, so its time complexity is O(n+k) too.
Assumption: each item from input array should be represented in positional
notation, i.e. as a collection of characters from finite alphabet of size k
where the power of each character depends on where it is placed within
a collection. Examples of positional notation are binary and decimal numbers
Idea: sort input items by a single character in fixed position for all
possible positions, e.g. sort by 1st character, then by 2nd, continue until
all m positions were used to sort items. Sorting at specified position
could be performed using counting sort since alphabet is finite, therefore
it takes O(m(n+k)) time and O(mn) space to sort the whole array.
Depending on the starting position radix sort could be LSD (least significant
digits first) or MSD (most significant digits first). The first one is used to
sort in representation order (correct for decimals), the second one is for
sorting in lexicographic order. The implementation below is LSD radix
Algorithms comparison chart
Algorithm \ Property
Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms
This article will cover another bunch of must-know data structures, such as
heaps, priority queues, associative arrays and hashes. Source code of examples
used in this article is available on
Heap is a tree-like data structure satisfying the following property: the
key of parent node is ordered with respect of the key of child node. According
to the order, heaps can be divided into max heaps (parent key is greater than
the chilld’s one) and min heaps (parent key is lesser than the child’s one).
According to the structure of an underlying tree heaps can be binary, d-ary,
binomial, fibonacci and many others. The following interface is used to
Heaps are usually used to implement priority queues, which are used in some
graph algorithms (Prim, Dijkstra). Heapsort sorting algorithm is another example
of heap’s usage.
AFAIK there is no particular implementation of heap as data structure in Java
and C++ languages. However, there is std::make_heap function in C++ standard
library which rearranges elements in given container in a way that they form
a heap. As for Java, there is PriorityQueue class which uses heap under the
Binary min heap implementation
Idea: this implementation does not have explicit tree structure. Instead,
array is used to store nodes. Index of given node is used to identify its parent
and children (getParent, getLeftChild and getRightChild methods). The
image below gives an example of how to store binary heap of 7 items in an array.
The most important part of heap implementation is methods for maintaining heap
property. In case of binary heap there are two of such methods: bubbleUp and
bubbleDown. bubbleUp compares given element and its parent and if they
violate heap property, exhanges them and moves up the tree. bubbleDown
does the same thing, only for the children of the given element and moves down
the tree. The first method is used to implement insert: we put new item at the
end of an array and run bubbleUp on it in order to put it in its place. The
second method is used to implement popMinimum: we exchange the top item with
the last one and run bubbleDown on the top item in order to put it in its
Now the last thing to do is how to heapify given array efficiently. Naive
approach is to create an empty heap and perform n insertions, which gives us
O(nlog(n)) time and O(n) space. However, it is possible to do it in O(n) time
and O(1) space using bubbleDown method: just iterate over the array in reverse
order and run bubbleDown on each item.
isEmpty and getMinimum have O(1) time complexity. Because both bubbleUp
and bibbleDown take O(nlog(n)) time, insert and popMinimum methods have
the same time complexity. As for Heap(int) ctor: though it is just n runs
of bubbleDown which sholud give us O(nlog(n)) time, actually, it works
faster than that, because most of heapification takes place in lower levels.
Complete proof of its time complexity can be found on
Priority queue is abstract data structure similar to stack or queue, but the
order of items retrieving in it depends on items’ priorities. Usually item with
the highest priority is served first.
Priority queues applicable in many popular algorithms: Dijkstra, Prim, Huffman
encoding, etc. There are implementations of priority queues in C++ and Java:
std::priority_queue and PriorityQueue classes respectively.
Implementation of priority queue using binary min heap
No additional comments as implementation is quite straightforward.
Associative array or map is an abstract data structure. It stores key-value
pairs in such a way, that each possible key appears only once. To describe
associative array the following interface is used:
Associative arrays are applicable in great variaties of computer tasks: removing
duplicates, counting items, etc. There are even specials databases called
“key-value storage” which store key-value pairs either in memeory or on disk
(Redis, Memcached, etc).
Hash table is one of data structures implementing associative array. It uses
array to store key-value pairs and special hash function to compute index of
particular pair by its key. Usually, for implementation of map it performs
better than binary search tree (both put and get work in O(1) amortized
comparing to O(log(n)) in BST), but it is highly dependent on choosing the
right hash function, load factor and collision resolution method.
Hash tables are used to implement maps, sets, database indexes and caches. They
are available in almost any modern programming language. Some languages provide
them as syntactic construct (Python, Ruby), others have specific classes in
standard libraries (std::unordered_map in C++ and HashMap in Java).
Implementation of hash table
Idea: hash table is always implemented as underlying array containing
entries. Depending on collision resolution method, entries could be either
key-value pairs or their chains. There are several methods for collision
resolution; the most popular ones are open adressing and chaining.
Open addressing method implements the following strategy: when collision
occurs during insertion, try to find a different place in array for
particular key by sequential probing of other places until the free one is
found. When searching for a value, pairs are traversed in exactly the same
way. There are several ways of choosing which places to look at when searching
for the free one; the easiest method, called “linear probing”, just looks at the
place following the current one.
Chaining method solves the collision problem differently. It stores
buckets of pairs in underlying array; when collision occurs, new pair is dropped
in associated bucket. When searching for a value, bucket is searched. Usually,
buckets are implemented as linked lists of pairs. The implementation below uses
Another important part of hash table implementation is choosing the right hash
function. The main requirement of hash function is uniform distribution of its
values. The uniformity is crucial; otherwise the number of collisions rises
causing bad performance of hash table. Though, for particular implementation
this condition can be loosen to uniformity at specific sizes. The implementation
below relies on hashCode function of Java’s Object for simplicity and
performs no additional hashing. It may affect performance as programmers writing
classes are not obliged to guarantee uniformity of hashCode.
As the number of items in hash table grows, the chance of collision rises too;
that’s why it is necessary to resize underlying array and re-insert all
key-value pairs sometimes. To determine the time for performing this action,
load factor indicator is used. It is calculated as the number of key-value
pairs inserted divided by the size of underlying array. As for the
implementation below, load factor of 0.75 indicates that resizing is needed; the
number was chosen manually, though in real world you may want to perform some
experiments to choose it more precisely.
Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms
This article is intended to cover some basic data structures: arrays, linked
lists, stacks and queues; their applications and implementations in Java. Source
code of implementations used in this article is available on
Abstract and concrete data structures
Abstract data structure is a mathematical model, an interface declaring
desired behaviour. Concrete data structure is an implementation of such
interface. For example, collection or container is an abstract data structure.
It says only that some items of particular type are stored inside it and
declares methods for accessing and updating them. Array and list, on the
other side, are concrete data structures. They describe the way items are
stored (in continuous memory chunk versus as a chain of pointers) and implement
methods declared in collection’s interface.
Collection (Container) is an abstract data structure. It
holds several items of some specified type T and offers methods for accessing
and updating them. The following interface will be used throughout this article
to describe containers:
Array is a container that holds a number of items in contiguous area of
memory. In its simplest form, static array, the capacity of a container is set
up at the moment of its creation and may not be changed. Dynamic arrays,
however, can grow automatically when current capacity is exceeded.
Arrays are used both on its own and as an underlying data structure for more
sophisticated structures, e.g. heaps, associative arrays and adjacency matrices.
Arrays are supported out of the box in overwhelming majority of modern
programming languages. Usually static arrays present as a syntactic feature.
Here is an example of creating static array of 10 ints in Java:
Dynamic arrays often shipped as utility classes of standard library
(std::vector in C++ and ArrayList in Java).
Dynamic array implementation
Idea: let’s start with a small static array. If there is no free space left
to insert new item, then allocate new static array with extended capacity and copy
all items from the old array into the new one. It is proven that allocating new
array twice as large as the old one gives you O(1) amortized complexity of
inserting new item at the end of an array.
Here is a simple implementation of dynamic array in Java:
Advanced implementation can shrink array when the number of items drops below
some point, say, half of capacity. However, it should be implemented with
cautious, because too eager behaviour can cause drastic change of complexity.
Imagine an array which shrinks itself in half as soon as the number of items
drops below half of capacity; when there are n/2 items and you remove one,
then insert one and repeat these actions m times the overall complexity will
be O(mn). It’s better to shrink array by lesser amount, a third or a fourth of
capacity for example.
Linked list is also a container. It stores its items in nodes. Each node
consists of at least two fields: a value to store and a pointer to another node.
Nodes are chained one to another and, thus, can be located in arbitrary places
of memory (contrary to arrays). Linked lists can be divided into several groups
according to the number of pointers in its nodes: singly-linked,
doubly-linked and multiply-linked.
Linked lists can be used either on its own or as an underlying data structure
for such ADTs as adjacency lists, stacks, queues and for collision resolution in
Linked lists are present in many programming languages, for example, std::list
in C++ and LinkedList in Java. Usually they are implemented as doubly-linked
Doubly-linked list implementation
Idea: key idea of this implementation is to use head and tail sentinel
nodes. They are constant for particular linked list object; they point to each
other when the list is empty and to the first/last node otherwise. Head
sentinel’s prev is always null, just as tail sentinel’s next. Main
purpose of such sentinel nodes is to save us time and energy playing around
null checks: this way we can be sure that list always has head and tail nodes,
though they’re merely decorative.
Here is a simple implementation of doubly-linked list in Java:
Here is complexity comparision of dynamic array and doubly linked list for all
operations in Collection interface. As you see, linked list wins on
prepending, while dynamic array wins on random access.
Additional space used
Stacks and queues
Stacks and queues are abstract data structures that support limited access to
their elements: they allow only appending and retrieval. They are distinguished
by the particular retrieval order they support: stack is LIFO container
(“last-in-first-out”) and queue is FIFO container (“first-in-first-out”).
Interfaces of stack and queue are presented below:
Both stacks and queues are widely applicable in modern software engineering.
Examples of stack usage are:
Function calls are stored in stack while program is executed
Many parsing and expression evaluation algorithms use stacks to hold values or
currently processed lexemes
JVM is a stack-based virtual machine. It does not use registers; instead it
stores all variables on stack
Examples of queue usage are:
Concurrent applications; for example, concurrent access to some resource means
that there is a queue of requests and they are processed in order
Buffering and data transfer
Usually stacks and queues are implemented as adapters to another containers,
for example, linked lists or dynamic arrays. In Java Queue is an interface
implemented by classes ArrayDeque, LinkedList and others. Stack is
a concrete class, but it is obsolete. It is recommended to use interface
Deque instead and its implementations as it has methods of both stacks and
Explanations skipped as the implementation is quite straightforward.
Implementation of stack using doubly-linked list will be exactly the same.
Complexity of operations is O(1) amortized for
array and O(1) for list.
Queue implementation using dynamic array
Idea: this is standard two-pointer implementation of a queue.
Because in queue we need to retrieve element from the beginning of a queue, it
is inefficient to shift the whole array back each time dequeue is called.
Instead lets create a pointer to location where the real beginning of a queue is
and shift it forward on dequeue. On enqueue, however, lets insert new item
not just at the end of an array, but, if possible, in the free place between
array’s beginning and queue’s beginning. In order to achieve this lets create
another pointer to store location of the real end of a queue. When this two
pointers meet lets relocate the whole array so that queue’s beginning and end
will be the same as underlying array’s are.
Complexity of this solution is O(1) amortized for both operations.
Queue implementation using doubly-linked list
This implementation is much easier, because in doubly-linked list we can
remove item from both beginning and end in O(1).
Complexity of this solution is O(1) for both operations.
Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms
Last term has passed and I’ve got a lot of time to spend on writing master
thesis and doing other stuff. I’ve decided that improving FP and Haskell
skills is the thing I’d like to do and the best way to do this is to write a
couple of small, but rather serious projects. We’ve had some interesting ones
in university, but, to be honest, my architecture decisions were not so good
sometimes and by the end of a term projects become surrounded with all kinds of
workarounds and it was too late to rewrite them from scratch. Now I have no
hard deadlines and requirements so why not write these projects properly?
Projects I’m talking about are database management system (like Sqlite) and
virtual machine implementation with JIT compilation and some simple
optimizations (like Hotspot). The former I wrote more than a year ago, the
latter was written a couple of months ago. I’ll start with the first one because
it has been written in Haskell, so it will be easier to fix my previous mistakes.
Another reason is that I’m starting to forget things about DBMS and it would be
great to freshen them up.
I will store my code on Github and
keep the link to the state of project by the time of writing a post on top of
it. Current specifications and requirements are
short: it will be a simple database supporting three fixed-size types (ints,
doubles and varchars), b-tree indexes, cross joins and ‘vacuum’ operation. I
will cover internal architecture more in latter posts; now I’m going to talk
about project management with Cabal.
Cabal is recommended1 build tool for Haskell. Its name stays for “Common
Architecture for Building Applications and Libraries”. It is split in two parts:
cabal-install package which contains cabal executable and Cabal library
which contains some advanced things for complex builds. Be sure to have the
former installed in order to proceed with the instructions below.
Creating new Cabal project is simple: create new directory, cd into it, run
$ cabal init
and answer some questions about your project. A couple of files will be
generated: <your-project>.cabal, a declarative definition of your project’s
build and Setup.hs, a build script. Here is possible contents of generated
-- Initial my-project.cabal generated by cabal init. For further
-- documentation, see http://haskell.org/cabal/users-guide/
synopsis: My precious project
author: Nikolay Obedin
build-depends: base >=4.7 && <4.8
-- ... other sections
Syntax is pretty straightforward: indentation is used to distinguish entries and
their contents, “--” to comment things out. Global properties are on top of
the file, they are pretty simple too: name of your project, version, short
description, license, category on Hackage, etc. One interesting field is
build-type which is used to tell Cabal what type of build is going to be used:
Simple is for builds configured via .cabal file, Custom is for more
complex builds using Setup.hs2. I’ll stick with Simple build type for
now and then.
A set of sections is placed after global properties. Each section should belong
to one of four types: library, executable, test suite or benchmark. I’m going to
take a look at first three of them.
Library is a collection of Haskell modules that can be reused by other
libraries and applications. Project may contain at most one library, its name
matches the name of a project. Each module of a library should be either in
other-modules entry or exposed-modules entry based on its visibility to end
user – the latter are visible and the former aren’t. If you chose library in
cabal init then all your existent modules have been automatically added to
The rest of settings in this example are common to all types of sections.
hs-source-dirs is comma-separated list of directories where Cabal will search
for source files. default-extensions is list of language extensions enabled
by default for all source files.
build-tools is list of programs used to process specific files before
compiling them. Note that these executables for some reason ARE NOT installed
automatically3, you should do it manually. In this example alex is used to
process files with .x extension and happy processes files with .y
extension to generate lexer and parser respectively.
build-depends is list of packages your project depends on. Each package may
be constrained4, but be careful with it as inadequate constraints may lead
Cabal to inability of installing your dependencies. Usually, I use
Stackage to constrain dependencies for me while
leaving them unconstrained in build-depends. However, this approach is
useful only if you’re developing internal library or application – if you’re
going to publish it on Hackage then you should set sane constraints to your
dependencies, but, I guess, by that time you will know these things better than
Executable and Test-Suite sections
Executable is a standalone program. Its name is not required to be the same
as package’s one and project may have many executables. The only thing it requires
is to have an entry point which is main :: IO () function. A source file
having this function should be specified in main-is entry.
Project also may contain many test suite sections and each of these sections
should use one of supported testing interfaces. The interface used by particular
test-suite is defined in its type field. Cabal supports two testing interfaces
out of the box: exitcode-stdio and detailed. I prefer the first one because
it is simpler – it just compiles and runs your test application checking its
exit code: if its non-zero then test has failed. The only required field for
exitcode-stdio is main-is which means exactly the same thing as in
executable section - it is a source file of your test program.
Installing dependencies, building and running
Now that the project is configured it is time to build it. But first you need to
install the dependencies stated in build-depends fields. I strongly recommend
that you use Cabal sandboxes to avoid possible conflicts. To create new sandbox run:
$ cabal sandbox init
and then install dependencies:
$ cabal install --dependencies-only
If you’re using build-tools install them manually:
$ cabal install alex happy <any-other-build-tool>
Now you can build your project:
$ cabal configure && cabal build
Then you can test it and run:
$ cabal test && cabal run
If you want to try something in ghci you can start it with all dependencies of
your project available by running:
$ cabal repl
These are the commands you’re going to use rather frequently; the rest of
available commands and flags can be found by running cabal --help.
In the end
Haskell is a great language and Cabal is a sane build tool if it’s used properly.
Use sandboxes and careful version constraints, do not install many packages in
global scope – remember that Cabal is not a package manager and you’ll be