I recently came accross the Files are hard article, and it made me wonder how reliable is cstore_fdws design and implementation.
cstore_fdw is a columnar store for PostgreSQL that I designed and developed in my previous job at Citus Data.
I am writing this post so my decisions for cstore_fdws design get reviewed by more people, and I get some feedback and improve the design.
Structure of cstore_fdw is similar to Hives ORC, but with some differences. One of the biggest differences is that cstore_fdw stores metadata in a separate file, while ORC stores it at the end of data file.
The reason for this difference? It made avoiding data corruption much easier, as we will see later.
Each cstore_fdw table consists of two files:
cstore_fdw divides rows into stripes. Each stripe is 150K rows by default. For each stripe data is stored in column oriented manner.
Metadata file is a small file which contains some version infomration, plus outline of the data file. This includes location and size of each stripe.
Currently the only modification operation that cstore_fdw supports is batch insertion, which can be triggered using:
COPY cstore_table FROM '/path/to/file.csv' WITH CSV;
INSERT INTO cstore_table FROM SELECT * FROM some_other_table WHERE some_condition
Our plans were to first get batch inserts right, as this on its own was very useful to our customers who used it to store archive data. Then add support for single row inserts, deletes, and updates.
What does happen when we do a batch load?
If the system crashes during the first four steps, the table is still in a consistent state and we wont have data corruption.
This is because:
What about 5th step? The nice thing here is that the specification of the rename system call requires it to be automic. So this step either is done in full (in which case well see the new data) or isnt done at all (in which case well see the old data, but we wont have data corruption).
And that is the reason for using a separate file for metadata. I needed an atomic operation to rely on, and rename is atomic. Rewriting the whole data in a new file and doing the rename for data file wasnt an option, since it is very expensive. So instead, we rewrite the small metadata file and do the rename for that.
Of course there were some other designs which avoided file corruption, but I found this design very simple compared to other designs.
Reading the discussion in hacker news for Files are hard, I see that Ive missed a step, which is doing a sync on the parent directory to make the rename durable. That is something we should fix, but Im happy that our design seem to avoid most of the issues discussed in Files are hard.