The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

Building a high performance SSD SAN - Part 1

Over the coming month I will be architecting, building and testing a modular, high performance SSD-only storage solution.

Ill be documenting my progress / findings along the way and open sourcing all the information as a public guide.

With recent price drops and durability improvements in solid state storage now is better time than any to ditch those old magnets.

Modular server manufacturers such as SuperMicro have spent large on R&D thanks to the ever growing requirements from cloud vendors that utilise their hardware.

The State Of Enterprise Storage

Companies often settle for off-the-shelf large name storage products from companies based on several, often misguided assumptions:

At the end of the day we dont trust vendors to design our servers - why would we trust them to design our storage?

A great quote on Wikipedia under enterprise storage:

You might think that the hardware inside a SAN is vastly superior to what can be found in your average server, but that is not the case. EMC (the market leader) and others have disclosed more than once that the goal has always to been to use as much standard, commercial, off-the-shelf hardware as we can. So your SAN array is probably nothing more than a typical Xeon server built by Quanta with a shiny bezel. A decent professional 1 TB drive costs a few hundred dollars. Place that same drive inside a SAN appliance and suddenly the price per terabyte is multiplied by at least three, sometimes even 10! When it comes to pricing and vendor lock-in you can say that storage systems are still stuck in the mainframe era despite the use of cheap off-the-shelf hardware.

Its the same old story, if youve got lots of money and you dont care about how you spend it or translating those savings onto your customers - sure buy the ticket, take the ride - get a unit that comes with a flash logo, a 500 page brochure, licensing requirements and a greasy sales pitch.

Our Needs

Storage performance always seems to be our bottleneck at Infoxchange, we run several high-performance high-concurrency applications with large databases and complex reporting.

Were grown (very) fast and with that spending too much on off-the-shelf storage solutions, we have a requirement to self-host most of our products securely within our own control, on our hardware and need to be flexible to meet current and emerging security requirements.

I have been working on various proof-of-concepts which have lead to our decision to proceed with our own modular storage system tailored to our requirements.

Requirements

Software

Operating System Debian Debian is our OS of choice, it has newer packages than RedHat variants and is incredibly stable RAID MDADM For SSDs hardware RAID cards can often be their undoing - they simply cant keep up and quickly become the bottleneck in the system. MDADM is mature and very flexible Node-to-Node Replication DRBD NIC Bonding LACP IP Failover Pacemaker Well probably also use a standard VM somewhere on our storage network for quorum Monitoring Nagios Storage Presentation Open-iSCSI Kernel Latest Stable (Currently 3.18.7) Debian Backports currently has Kernel 3.16, however we do daily CI builds of the latest kernel stable source for certain servers and this may be a good use case for them due the SCSI bus bypass for NVMe introduced in 3.18+

Were going to start with a two node cluster, we want to keep rack usage to a minimum so Im going to go with a high density 1RU build.

The servers themselves dont need to be particularly powerful which will help us keep the costs down. Easily the most expensive components are the 1.2TB PCIe SSDs - but the performance and durability of these units cant be overlooked, were going to have a second performance tier constructed of high end SATA SSDs in RAID10. Of course if you wanted to reduce price further the PCIe SSDs could be left out or purchased at a later date.

Hardware

Base Server SuperMicro SuperServer 1028R-WTNRT 2x 10GbE, NVMe Support, Dual PSU, Dual SATA DOM Support, 3x PCIe, 10x SAS/SATA HDD Bays CPU 2x Intel Xeon E5-2609 v3 We shouldnt need a very high clock speed for our SAN, but its worth getting the newer v3 processor range for the sake of future proofing. RAM 32GB DDR4 2133Mhz Again, we dont need that much RAM, however it will be used for disk caching but 32GB should be more than enough and can be easily upgraded at a later date. PCIe SSD 2x 1.2TB Intel SSD DC P3600 Series (With NVMe) This is where the real money goes - the Intel DC P3600 and P3700 series really are top of the range, the critical thing to note is that they support NVMe which will greatly increase performance, theyre backed by a 5 year warranty, these will be configured in RAID-1 for redundancy. SATA SSD 8x SanDisk Extreme Pro SSD 480GB The SanDisk Extreme Pro line is arguably the most reliable and highest performing SATA SSD on the market - backed by a 10 year warranty, these will be configured in RAID-10 for redundancy and performance. OS SSD 2x 16GB MLC DOM We dont need much space for the OS, just enough to keep vital logs and package updates, these will be configured in RAID-1 for redundancy.

AHCI vs NVMe

NVMe is a relatively new technology which Im very interested in making use of for these storage units.

From Wikipedia:

NVM Express has been designed from the ground up, capitalizing on the low latency and parallelism of PCI Express SSDs, and mirroring the parallelism of contemporary CPUs, platforms and applications. By allowing parallelism levels offered by SSDs to be fully utilized by hosts hardware and software, NVM Express brings various performance improvements.

- AHCI NVMe Maximum queue depth 1 command queue; 32 commands per queue 65536 queues; 65536 commands per queue Uncacheable register accesses (2000 cycles each) 6 per non-queued command; 9 per queued command 2 per command MSI-X and interrupt steering single interrupt; no steering 2048 MSI-X interrupts Parallelism and multiple threads requires synchronization lock to issue a command no locking Efficiency for 4 KB commands command parameters require two serialized host DRAM fetches gets command parameters in one 64 Bytes fetch

NVMe and the Linux Kernel

Intel published an NVM Express driver for Linux, It was merged into the Linux Kernel mainline on 19 March 2012, with the release of version 3.3 of the Linux kernel.

A scalable block layer for high-performance SSD storage, developed primarily by_ _Fusion-io_ _engineers, was merged into the Linux kernel mainline in kernel version 3.13, released on 19 January 2014. This leverages the performance offered by SSDs and NVM Express, by allowing much higher I/O submission rates. With this new design of the Linux kernel block layer, internal queues are split into two levels (per-CPU and hardware-submission queues), thus removing bottlenecks and allowing much higher levels of I/O parallelisation.

Note the following: As of version 3.18 of the Linux kernel, released on 7 December 2014, [VirtIO]6 block driver and the [SCSI]7__layer (which is used by Serial ATA drivers) have been modified to actually use this new interface; other drivers will be ported in the following releases.

Debian - our operating system of choice currently has kernel 3.16 available (using the official backports mirrors), however we do generate CI builds of the latest stable kernel for specific platforms - if youre interested on how were doing that I have some information here.

Thats where Im upto for now, the hardware will hopefully arrive in two weeks and Ill begin the setup and testing.

Coming soon

Stay tuned!

[6/2/2015 - Sam McLeod]

Further reading:

Continue reading on smcleod.net