Vertical partitioning

April 4, 2009

For the past few weeks, I’ve been working on a Haskell program that give recommendations to developers in order to ease the process of vertical partitioning.

This program explores the code source of our web applications, and parses the PHP source code in order to list the different requests that are made on a given table.
This parsing process creates a list of Query objects, each Query representing the list of fields selected during a call to the ORM. This method has the obvious inconvenient to give equal weight to all the requests, but static analysis is the only thing available to me now; it would be better to weigh each query by the number of times it is called, but I do not have this information.

Lazy loading for easier coding

During the parsing process, the module description files are loaded and fields are checked against the object definitions. With Haskell’s lazy evaluation, I wrote my code so that every single module would be loaded… but that’s not what actually happens: Haskell will only execute a small part of the code and only load the modules that I’m going to use. The other modules are really just closures to functions capable of reading a module definition; as long as they are not called, nothing happens.
Usually, only about 1% of all modules are opened and parsed.

Correlations between fields

For all possible pairs of fields (n×(n-1)/2 pairs), a distance is computed using the simple Jaccard Index: this creates a complete graph with n vertices. This distance index indicates how often two fields are used together: two fields that are always selected together have a distance of zero, and two fields that are never together have a distance of one.
It seems like a lot of processing, but even for large table the number of edges stays acceptable: a 100-column table creates a graph with 100 vertices (if they’re all used in the code) and 4950 edges.

A filter then removes all the edges with distances above a certain threshold, and displays the graph using graphviz.

Since we’ve kept only the strongest links, we see clusters emerge. These clusters are groups of fields which are often used together in the code, but don’t tend to be used with anything else. If in a group of fields that are often seen together there exists no other strong link to another field, it can be a candidate for extraction in a separate table. An engineer will not use this directly, but use it as advice.

Example

Consider this code, written using our PHP framework:

// get the 5 oldest of this area.
$this->mPerson
    ->select('id', 'name')
    ->getGroup(0, 5);

$this->mPerson
    ->select('father', 'mother', 'birth')
    ->getGroup();

$this->mPerson
    ->select('country', 'city', 'postcode')
    ->getGroup();

This graph is produced:
person-1.png

Indeed, these fields are always selected together, and are never mixed with anything else.
Now, if we add these two calls:

$this->mPerson
    ->select('id', 'name', 'age')
    ->getGroup();

$this->mPerson
    ->select('id', 'name', 'sex')
    ->getGroup();

The last block changes as such:
person-2.png

Adding these lines adds the age and sex fields and links them to id and name. There is a strong correlation between the presence of id and the presence of name, and this is shown by the green color of their common edge (the other links are weaker). The cluster’s background color shows that the internal consistency of this group is quite low.

Tweaking

This tool does not give ready-made database schemas, but only shows the dependencies between columns. The engineer using this program changes the distance threshold to get different dependencies: the highest the threshold, the weaker the clusters. Raising it can lead to important discoveries: for instance, it can help detect an important link between a strong core of fields and a seldom-used but important field that might get cut by a low threshold. In that case, cutting out this core without realising which hidden dependencies might be affected would be a serious mistake.

Performance

As I mentioned earlier, laziness made it easy to write readable code while getting some performance for free. I’ve also used the Parsec library to parse the ORM calls; its elegance and speed make it a remarkable tool.
The bottleneck is often graphviz’s “dot” program, which can take a while to draw a few dozen nodes with hundreds of edges. This program was not profiled or optimized for speed, as it needs only a few seconds to analyse our 200,000 lines of code.

Real-world overview

This is the output of the program with default options on our largest table, which has 85 columns.

large-default.png

I apologize for the lack of field names, but I don’t think it would be appropriate for me to publish this kind of information.
After the default run, most of the fields have been completely cut out, by lack of strong links attached to them. This default run shows an interesting pattern, with two clusters organized in exactly the same way. This is a good example of an organic growth, where field modifiers have been added and are used like the fields they depend on; think of a Character table with skills a,b,c,d and bonuses applied to these skills by creating aBonus, bBonus… all added after the skills, in the same table. These 2 clusters would be {a, b, c, d} and {aBonus, bBonus, cBonus, dBonus}.

An option is used to bring back fields that have been cut out, by showing each field’s strongest link even if it is under the threshold:
large-1.png

The color progression shows that the last groups have a very low score, which means that there are a lot of hidden links attached to these nodes. Still, keeping the strongest link for the “saved” ones helps maintain some meaning: even the last group is made of fields which actually go together!
But there is a downside to this technique: a few rare fields are attached to strong clusters via a weak link (their strongest). This can sometimes lower the quality of otherwise good clusters, and this is the case here in the second and third clusters. In order to be able to view lonely fields while still retaining uncontaminated clusters, an command-line option enables the creation of a “lonely pool”, where fields with no link left are displayed without any structure.
Displaying the rest of the fields while tweaking the cluster threshold is useful reminder that moving columns out can be a big deal, even if the table seems small when most of its columns are hidden.

It is the role of the engineer doing the partitioning to examine different thresholds, as there is currently no way to select a “best” value. Is there even one?
This is what the table looks like at with different values:

Cut at 0.1:
large-0.1

Cut at 0.2:
large-0.2

Cut at 0.4:
large-0.4

Last points

In my opinion, the data should not come from the source code, but from the database log. Some queries are run hundreds of times more often than others, and such a naïve source code analysis misses this very important point.
Several improvements are planned, such as the ability to ignore such and such fields or to evaluate the complexity of an extraction by counting the number of calls that will need to be changed.

Haskell was the perfect tool here. The mathematical constructions used to compute the correlations are expressive and readable. The purity of the language surely avoided many common mistakes.

Strange, strange loops!

March 24, 2009

I’ve just finsihed reading Gödel, Escher, Bach: An Eternal Golden Braid by Douglas Hofstadter.
The works of Gödel, Church, Turing, Cantor and others are presented in this beautiful book, which is extraordinarily witty and amusing with its dialogues in the style of Lewis Caroll’s What the Tortoise Said to Achilles and Escher paintings for illustrations.
This book is focused on self-reference and the emergence of self-awareness in complex systems. Self-references often have interesting implications, be it Gödel’s theorem or Turing’s proof that no program can solve the Halting Problem.

This article shows some examples of self-reference, using the LISP and Haskell programming languages.

(CONS)

In LISP languages, the cons cell is an essential data structure, composed of two parts. It is created with the CONS function, which takes two arguments. To create a list, one merely links cons cells together by having a pointer to data in the first part of a cell (the CAR), and a pointer to the next cons cell in the second part (the CDR).

Let’s create a cons cell, its first part pointing to a number and the second part set to NIL. We’ll call it “self”:

[1]> (setf self (cons 1 NIL))
(1)
[2]> (car self)
1
[3]> (cdr self)
NIL

We can now make (CDR self) point to… self:

[4]> (car (setf (cdr self) self))
1
[5]> (car (cdr (cdr (cdr (cdr (cdr self))))))
1

Line [4] reads as: point the second part of self to self, and return the first part of this whole construction.
Line [5] is: give me “the first part of the (second part of * 5) self”.
Self is the list made of the number “1”, followed by itself! Obviously, we can’t print it anymore:

[6]> self
(1 1 1 1 1 1 1 1..... (endless list)

This recursive construction is a problem in the construction of simple reference-counting garbage collectors:

One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.” Moon patiently told the student the following story: “One day a student came to Moon and said: ‘I understand how to make a better garbage collector… #

The Chicken & Egg problem

Which came first: the chicken, or the egg?
This profound philosophical question can be approached with laziness and self-reference: The chicken came from an egg, which could come from the same chicken.

In Haskell:

module GEB where

data Chicken = Chicken 	{ comingFrom :: Egg}	deriving Show
data Egg = Egg 		{ laidBy :: Chicken}	deriving Show

myChicken = Chicken myEgg
myEgg = Egg myChicken

This code creates two types, Chicken and Egg. To create a Chicken instance, one has to call “Chicken” with an Egg as the parameter. To create an Egg instance, one has to call “Egg” and give as a parameter a Chicken object, which will be the parent.

Two variables are then declared:

This code compiles and runs fine:

*GEB> :type myChicken 
myChicken :: Chicken
*GEB> :type myEgg 
myEgg :: Egg

Try doing that with C++ references!
With Haskell’s lazy evaluation, the variables won’t actually be computed until there is an explicit need to use their value. Sadly, it means that printing these objects creates an endless loop of self-creation:

*GEB> myChicken 
Chicken {comingFrom = Egg {laidBy = Chicken {
    comingFrom = Egg {laidBy = Chicken {
    comingFrom = ... this goes on and on forever.

What are the consequences of such loops?

Well, things often break. In human languages, this leads to paradoxes such as “this sentence is false”, in Mathematics to Gödel’s incompleteness theorems, in programming languages to infinite loops or memory leaks…

But most often self-reference and infinite loops are a source of wonder and beauty! They are at the center of the extraordinary paintings of M.C. Escher and René Magritte, at the core of computation in lambda calculus with Y = λf·(λx·f (x x)) (λx·f (x x)), and even in quantum mechanics where they bring fascinating infinities.

And in our human brains… at the root of consciousness?