Today from the “What am I doing with my life?”-department: I finally set out to find a definite answer to something I’ve always wondered about ever since hearing about the deduplication feature of ZFS does it work on the intros of TV shows? TL;DR: Nope!
I never even expected it to work. Plus, you’re always advised against using deduplication anyway. The infamous “1GB RAM per 1TB storage in the pool” rule which is often incorrectly applied to ZFS in general stems from it. So even if I had found out that it worked, I probably couldn’t have benefitted from that. But still, not knowing for sure always bugged me.
As I’m currently building a new NAS and will switch from ext4 for my home storage once it will be fully operating it was time to simply do some tests and be done with the matter one and for all. Establishing the test setup: One season of Dexter in 1080p from the iTunes Store weighs roughly 26GB, exactly 28171821903 bytes in the case of my test data. The episodes of said season run for 54:27s on average while the brilliant intro of the hit-turned-shit show lasts for a whopping 01:45s i.e. 3,213957759% of each episode. That means we could hope for saving around 800MB per season in an ideal scenario.
First I created four different ZFS pools:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$ truncate -s 32G /var/lib/zfs_img/zfs_blank.img $ truncate -s 32G /var/lib/zfs_img/zfs_dedup.img $ truncate -s 32G /var/lib/zfs_img/zfs_compr.img $ truncate -s 32G /var/lib/zfs_img/zfs_both.img $ zpool create zfs_blank /var/lib/zfs_img/zfs_blank.img $ zpool create zfs_dedup /var/lib/zfs_img/zfs_dedup.img $ zfs set dedup=on zfs_dedup $ zpool create zfs_compr /var/lib/zfs_img/zfs_compr.img $ zfs set compression=on zfs_compr $ zpool create zfs_both /var/lib/zfs_img/zfs_both.img $ zfs set compression=on zfs_both $ zfs set dedup=on zfs_both
After creating the pools, there is the exact same amount of free space on each of them:
1 2 3 4 5 6
$ df /zfs_* Filesystem 1K-blocks Used Available Use% Mounted on zfs_blank 32771968 0 32771968 0% /zfs_blank zfs_dedup 32771968 0 32771968 0% /zfs_dedup zfs_compr 32771968 0 32771968 0% /zfs_compr zfs_both 32771968 0 32771968 0% /zfs_both
After copying the files into each pool, let’s see what we got:
1 2 3 4 5 6
$ df /zfs_* Filesystem 1K-blocks Used Available Use% Mounted on zfs_blank 32771712 27531008 5240704 85% /zfs_blank zfs_dedup 32709632 27532672 5176960 85% /zfs_dedup zfs_compr 32771712 27527424 5244288 84% /zfs_compr zfs_both 32708480 27529088 5179392 85% /zfs_both
Well, this is odd. Suddenly there is a different number of (total, not just free) 1K-blocks in each of the filesystems. I have no idea why that is happening, please let me know if you can explain it. (I did stumble upon these df/ZFS troubles while researching, but either this was fixed meanwhile or never an issue with the ZoL implementation, as the script there gave me the same numbers as df/du.) To make certain this doesn’t influence the results for the purpose of the test, I also tried it with a set of highly compressible and a set of highly dedupable files. In doing so I encountered the same 1K-blocks issue but still got exactly the results I would expect.
So let’s compare how much space is used in each scenario:
1 2 3 4 5
$ du /zfs_* 27531052 /zfs_blank/ 27532697 /zfs_dedup/ 27527385 /zfs_compr/ 27529012 /zfs_both/
With deduplication turned on, the files actually use up more space than when it is turned off. Even though these are H.264-encoded videos, turning compression on saves a little space. Adding deduplication to the compression is increasing the required space just as it was the case without using compression. Between the most (dedup on) and least (compression on) amount of space the files could use there is a difference of 5312 1K-blocks, roughly 5MB. The gains from compression compared to using no compression are 3667 1K-blocks, roughly 3.5MB. You would have to store more than 750 such seasons before the savings would add up to just a single episode’s file size. Here’s a visualization of you turning on dedup for your pool:
Just like I always expected, deduplication does not work on TV show intros albeit them being “just the same”. Due to the nature of modern video encoding, the underlying data is rarely the same: In an episode with lots of explosions a high amount of bitrate will be dedicated to those scenes and less of it will be left for the intro, and therefore the resulting data will differ from another episode. I’m guessing the gains from compression come from compressible metadata of the container format (and possibly subtitles) but that’s just a wild guess. As others have written before: compression never hurts you, dedup almost certainly does.
Even on a show with an incredibly long intro like Dexter you’ll gain nothing from ZFS’ deduplication feature. On the bright side: Usually you won’t be “wasting” more than 3% of a file on the intro an episode of The Simpsons (average length 22:49s) only uses 1,826150475% for it. You can calculate that percentage for Lost on your own, I guess.
Now you know.