Subtitle: A step towards cleaning up the mess we're in.
How do we keep track of documents that change?
This is a sub-problem to the more general problem “how do we keep track of things” which in it's turn involves solving the problem “how do we name things” and this is a sub problem to the problem of cleaning up the mess we're in.
I'll start with the problem of naming things.
Naming things is one of the toughest problems in computer science and philosophy. As soon as we name something there is an implied context - take away the context, or use the name in a different context and we are lost.
My son is called Thomas - in the context of our family “Thomas” means “my son” and uniquely identifies him. In the context of my workplace Thomas means one of 646 different people (and yes I ran a search to see). Google comes up with 1.58 Billion Thomases - so the name Thomas is both very useful (in the context of my family) and useless (in the wider context).
How could we give a precise name to Thomas? - that's easy (in theory) - we scan Thomas's Genome and compute the SHA1 checksum of the genome (and we'd need some fancy error correction algorithm since I guess two scans of the genome would not produce bit identical results).
Inside computers things are stored in files, this raises two very tricky problems:
The easiest way to name a file is by its content. All we do is compute the [SHA1](http://en.wikipedia.org/wiki/SHA-1) checksum of the content of the file - and bingo we have a unique filename. Then we can stick all the files in the same directory. This solves both our problems.
We've also created a new and totally different problem - “How do I find the filename of a file that contains some data I'm interested in”.
The latter problem (finding the file) is solved by indexing the contents of the file. The file can contain hashtags of certain names that have meaning, and we can search for these.
Once we have the SHA1 name of a file we can safely request this file from any server and don't need to bother with security.
For example, if I request a file with SHA1 cf23df2207d99a74fbe169e3eaa035e623b65d94 from a server then I can check that the data I got back was correct by computing its SHA1 checksum.
This mechanism is immune to a Man in the Middle Attack so regular unencrypted socket transport can be used. If a man in the middle changes the content of the file then the SHA1 checksums will not tally and the client requesting the data will know that the data is corrupt and cannot be trusted.
Given that we have a SHA1 checksum of some data, how can we find the address of a server that hosts this file. Well this problem was solved a long time ago. Just put the data into a DHT (Distributed Hash Table) such as Kademlia or Chord_ (peer-to-peer)) and let it work its magic. Bit torrent clients and servers have routinely using this technique for the last ten years or more.
To enable a network for serving immutable content I'd like to see the emergence of what I have called “the web of hashes” so we'd see web address like:
Instead of addresses like:
SHA1 checksums are fine for content that is immutable (doesn't change) - but what about a file whose content changes with time?
So Erlang files might start something like:
Once a file has a UUID it can be renamed, edited, cloned etc. As long as you don't remove the UUID then we can track the location with a DHT. After a while the contents of files with the same UUID will diverge and at that time we can think of changing the UUIDs
I think this would be a rather good mechanism. Imagine the following:
* The Web of UUIDs
Now we create the web of UUIDs. As for the web of hashes I'd like to see a web of UUIDs, so we could request data for a resource with identifier:
This time we might get many different replies, since there might be multiple copies of the file. What should a request like the above reply? Possible a list of SHA1's -- I'm not sure here.
Today we have a web of names. Things like
But we don't have either a web of hashes or a web of UUIDs
I think we need all three.
You might ask where I'm going with this? - If you've watched my lecture the mess we're in you'll get the answer. The Web and computer software is in a total mess. It has evolved faster than our ability to understand what we are doing.
Once there was not enough software, then it was about right, and now there's too much
It's easy to understand how the total amount of software in existence increases, this is a law of nature - entropy always increases. The amount of software increases because files get copied, edited cloned and modified.
We need mechanisms to reverse this process. By adding UUIDs to files we can track down all copies and modification to a file, and possible reduce the number of files by throwing away bad modifications that make no sense and by keeping the best of all the modifications.
This is part of my reversing entropy plan - that hopefully will clear up the mess we're in.
Since I published this article my attention has been drawn (thank's to Twitter and some private communications) to a number of projects with goals that are broadly similar or overlapping with what I'm suggesting:
CNNx is a project that wants content to be addressed by name, not by machine address and using secured content.
Named Data Networking wants to secure content.
IFPS wants to make the entire web into a secure distributed file system.
It seems like many people have similar ideas - which is great. Hopefully this will lead to better and more secure systems.