The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

Declaring Reality: An Introduction to Datalog, part I

Welcome to the wonderful world of [Datalog]( programming. Where nothing is imperative, and everything is declarative. Where you can randomly shuffle your statements (clauses) and everything will keep working. It's a different place than you're used to. You will laugh, you will cry. But ultimately, you'll find it to be a magical place. We at LogicBlox are merely your guides through this fascinating world -- we ourselves are explorers too, we try to run ahead and pave the paths, but be prepared, it may be bumpy at times. Even if ultimately you decide the world of Datalog is not for you, you will have enjoyed the ride and it will have changed your life forever. Not to be hyperbolic or anything.

Let's get started.

The first thing you should know about Datalog is that it's all about the truth. Datalog is not a place for liars. If you're into lying, go program <<insert language you don't like here>> or something (_zing!_). If you lie in Datalog, the compiler and runtime will hunt you down and kill you. [_Literally._]( Seriously kids: don't lie, Datalog won't let you get away with it.

Second, stop thinking sequentially, start thinking declaratively. In Datalog, all you have to do is declare reality: what are your world's facts and rules, rules that relate one fact, or a combination of facts to new facts. The world you create is dynamic, new facts emerge and others may no longer be true and disappear -- but your world will always adhere to your rules, consistently.

Let's start with a classic example.

Let's say we want to model the reality of ancestry in families. Datalog, at least the LogicBlox variant of it -- which from now on we will define as the one and only Datalog -- is a typed language. For simplicity, let's model people as strings, we'll get slightly less insane than to equate people to strings once we have worked through some of the basics.

Let's start with the parenthood relationship. In Datalog we call these relationships _predicates_. So, let's define the `parent` predicate:

parent(a, b) -> string(a), string(b).

What this Datalog _clause_ says is: "If there is a parent relationship between `a` and `b` (where `a` and `b` are placeholders), this must mean that `a` is a string, and `b` is too." These types of _constrains_ do not have to be about types at all, they can be anything. If in your world "Adam" has no parents (wooh, controversy!):

parent(a, _) -> a != "Adam".

Which is a Datalog way of saying: "If some person `a` has some parent -- don't really care who -- that must mean `a` is not 'Adam'". Or more succinctly:

!parent("Adam", _).

Or: "There is no parent relationship between Adam and some other person."

Alright, so how do we define facts in this reality? There's two ways, the first I like to think of as eternal truths:

parent("Pete", "John").

This means that John is Pete's parent and this will always be the case. It's invariant. If we don't want our world to work this way, and we'd be able to remove facts (like inserting and removing rows from a database table) at some point, we insert them with a delta clause:

+parent("Pete", "John").

As you may imagine, the `+` means that I'm adding a fact to my reality dynamically, which allows me to later on remove it, if for some reason John is no longer Pete's parent. By the way, to do that, you use:

-parent("Pete", "John").

**Deriving new facts.** Alright, so now we have a way to store who is who's parent. I feel your excitement growing. Let's extend our reality a bit by teaching it what an ancestor is. First, we'l define the type signature for this `ancestor` predicate:

ancestor(a, b) -> string(a), string(b).

Lovely. What we could now do is assert ancestry facts:

+ancestor("Pete", "John").

That would work, but we can be smarter. Of course, if `b` is the parent of `a`, that also means that `b` is an ancestor of `a`, so let's just define this general case:

ancestor(a, b)

Note that the arrow is now pointing in the opposite direction compared to the constraints we defined earlier. This notation is what we call a _rule_. A rule is used to derive new facts from existing ones.

However, there's more to say about ancestry. Let's say we have these facts:

+parent("Pete", "John"). +parent("Pete", "Mary"). +parent("John", "Hank"). +parent("John", "Anna").

Now, not only is John an ancestor of Pete, Hank is too, because Hank is an ancestor of John and John is an ancestor of Pete. So, let's define that too:

ancestor(a, c)

"If there exists a `b` for which it is true that `b` is an ancestor of `a` and `c` is an ancestor of `b`, then it must be true that `c` is an ancestor of `a`."

Still with me?

Now we can query our reality for stuff. In LogicBlox Datalog we do this by defining an anonymous query rule:

_(a, b)

Which says: "For any given `a` and `b`, where `b` is `a`'s parent, return `a` and `b`."

If you run this, it will return:

Pete, John Pete, Mary John, Hank John, Anna

_Slightly_ more interesting is:


which returns:

Hank Anna

If you're a SQL person, you'll find that this is very similar to doing a `SELECT b FROM parent WHERE a = "John"` query, just more concise.

And now for the money shot, which is much harder to do with SQL: "Tell me, oh reality, who are Pete's ancestors?"


and low and behold:

John Mary Hank Anna

It's amazing.

**Doubling down on the typing.** If you're a guy or gal who likes his or her types, you are most probably pretty disturbed about the fact we have been representing people as strings thus far.

Not to worry, _entities_ will save the day!

Let's define what a person is:

person(p), person_name(p:name) -> string(name).

This is a slightly weird syntax that tells us there's such a thing as a person, and that persons have names which identify them. In Datalog terminology `person_name`, in this case, is referred to as a _refmode_, which you may think of as a primary key. So, now we can define people!

+person(p), +person_name(p, "Pete").

Here we say: "There's a person `p` where `p`'s name is 'Pete'." That's slightly verbose, and since `person_name` is a person's refmode, we can shorten this to simply:


So, let's redefine the types of our `parent` and `ancestor` predicates with our better representation of person:

parent(a, b) -> person(a), person(b). ancestor(a, b) -> person(a), person(b).

That's better! The nice thing is that the Datalog compiler is clever enough to understand what we mean when we say this:

+parent("Steve", "John").

And will automatically ensure (=l ook up, or create if it does not yet exist) a person named Steve, a person named John as well as establishing that John is Steve's parent. Three birds with one stone.

Now, I know what you're thinking: "This all sounds fantastic, Zef -- and if I ever want to find out if [it's OK]( [to kiss]( [my second cousin](, I'm sure I'll write a Datalog program to figure it out. But can I build _real_ applications this way?"

I hear ya. And I have good news: it's very possible. Stay tuned for part II, where we will start building an application that mops the floor with [OmniFocus](

Continue reading on