Coho is frequently accused of being a software defined storage system. I dont particularly love this term, because I think it is hugely overloaded. I fear that most people who hear software defined think it means installed as software. By that description, Ceph and Swift are software defined, but your FAS8000 or VMAX 40K arent. This is a horrible interpretation of the term — not the least of which because both the NetApp and EMC products probably have several hundred times the number of lines of code that either of those open source projects do. The interesting bits of storage are all software, software defined shouldnt be a description of form factor or delivery mechanism.
So what is software defined? As Ive mentioned before, software defined is a term thats come from other domains, most successfully networking. There are a few really strong ideas associated with the term, but to me the most important property of software definition is that of control.
In networking, this point is crystal clear: For a long time, networking was based on distributed algorithms: BGP, OSPF, STP, and so on. Every network device would run its own copy of a control program and, almost magically, these things would collaborate to make the connected system actually work. That networks composed of multiple vendor implementations of complicated protocols like BGP work — at all — is completelygobsmacking. That they work at the scale of the Internet is nothing short of miraculous. And let’s be clear:works at all has kind of been the bar that we held large-scale IP networks to for a long time. But did they work well?
The short answer is no. There are a whole bunch of things that distributed control basically sucks at. In a networking context, this includes properties like utilization, stabilization time, and security. While the networks that we built before SDN were able to (miraculously!) scale, they were frequently very bad at making efficient use of physical resources (often well under 50% utilization, evenin the face of significant load), they took a long time to recover from failures and reconfigurations, and lets just not get started on security. Distributed control is effectively the same as decision by committee: All your routers get together and have frequent meetings in small groups. They tell each other what their neighbors said, and they come up with passable solutions that work for most of the participants, most of the time.
Enter software defined networking. In the SDN model there is a central component, which is quite subtly called a controller, that acts as a single brain that makes decisions for all the devices in the network at once. The controller continuously evaluates the state of the entire system, and pushes down instructions to each individual component to make sure that they are all working together as effectively as possible. As a result, SDN-based techniques are able to achieve much higher utilization of resources, to more quickly respond to failure, and to manage things like access control from an end-to-end perspective. If its doing its job right, an SDN controller is like a benevolent dictator or a very effective CEO which are sort of the same thing.
One final, but really, really important point about SDN and the controller-centric architecture is that it is especially good at responding to dynamism. Distributed control in networks meant that whenever something changed (nodes failed, path weights were adjusted, traffic patterns shifted) that the information about that change had to propagate. As a result, networks at large were very slow to respond to the unexpected — DNS updates are a familiar example for most people here. SDN, with a central point of control, is specifically good at making dynamic decisions. This is exactly why it is able to be used for things like load-responsive traffic management, where connections are actively steered through a network using very high-frequency reconfigurations to maximize the use of physical links.
So if there is going to be such a thing as Software Defined Storage lets please think about it in terms of control. From this perspective, what we are building at Coho is absolutely a categorical example of SDS for one simple reason: there is a central brain. But in a storage context, what does it do?
To answer this question, its helpful to start by thinking about how old storage works. In using the pejorative term old I obviously mean all storage that came before ours. No need to be subtle here.
Old storage designs werebased on spinning disks, and spinning disks were spectacularly slow. As a result, when you built a storage system, you put your data on a spinning disk and hoped that you would never move it again. You hoped this because every operation that moved that data would steal performance from applications that might actually be accessing it. At the end of the day, and with the exception of per-disk optimizations like defragmentation, the only time you ever moved data was when something failed. And when this happened, you had a very precise plan (RAID, for example) of how to build an exact copy of the thing that failed.
This approach to managing storage is exactly distributed control. Heres why: you set up these concrete abstractions — things like RAID groups, volumes, or aggregates. You make a decision at that exact point in time. You decide how much performance to give a LUN or file system. You decide how it should respond to failures. You commit to these decisions, and slowly carve up your storage into a bunch of islands that are individually difficult to manage, and not, not at all, taking advantage of the global set of resources or a global view of the environment. This is crazy.
What would a control-centric architecture mean for a storage system? This (perhaps unsurprisingly at this point) turns out to be exactly the approach that Coho has taken to building our storage system. Heres a graphical representation of what our storage controller looks like:
The SDSC is a central decision making task that runs within the Coho cluster. It continuously evaluates the state of the system, and makes decisions regarding two specific points of control: data placement and connectivity. At any point in time, the SDSC can respond to change by either moving client connections, or by migrating data. These two knobs turn out to be remarkably powerful tools in making the system perform well.
Lets consider two aspects of how data placement is managed in Cohos architecture. First, Cohos microarrays are directly responsible for implementing automatic tiering of data that is stored on them. Tiering happens in response to workload characteristics, but a simple characterization of what happens is that as the PCIe flash device fills up, the coldest data is written out to the lower tier. This is illustrated in the diagram below.
All new data writes go to NVMe flash. Effectively, this top tier of flash has the ability to act as an enormouswrite buffer, with the potential to absorb burst writes that are literally terabytes in size. Data in this top tier is stored sparsely at a variable block size.
As data in the top layer of flash ages and that layer fills, Cohos operating environment (called Coast) actively migrates cold data to the lower tiers within the microarray. The policy for this demotion is device-specific: on our hybrid (HDD-backed) DataStore nodes, data is consolidated into linear 512K regions and written out as large chunks. On repeated access, or when analysis tells us that access is predictive of future re-access, disk-based data is promoted, or copied back into flash so that additional reads to the chunk are served faster.
On our DS2000f (all flash) systems, the second tier of storage is also flash. As a result, data is demoted in smaller (typically 8K) chunks, because we dont need to optimize around seek costs. Additionally, data is almost never promoted from flash to flash, because direct read performance from the lower tier of flash is adequate. This hybrid, but all-flash, approach allows the NVMe tier to efficiently absorb writes (and rewrites!), and protect the durability of the lower, read-oriented tier.
The local decision for tiering can be thought of as similar to the fast path on a network switch. We elected not to have the controller be intimately involved in the placement of data within a given microarray, but still provide external APIs to allow our controller to externally intervene to tune promotions, demotions, and resource limits for individual objects.
SDSC Example: Tiering and Data Placement in a Mixed-Mode Cluster
Now things start to get exciting: Cohos controller is responsible for managing how objects, or stripesof objects, are placed across microarrays within the system. This isa very important aspect of our design, because Coho supports mixed-mode or non-homogeneous clusters of storage hardware. We must be careful to place data in order to get maximum load balance and utilization of our storage hardware.
With this in mind, the job of the controller is to take a bunch of workloads, each of which may be highly different, and a bunch of microarrays, also potentially different, and come up with a plan to map data and connections to hardware.
Lets consider the performance characteristics of two different DataStream models, the DS1000h (hybrid) and the DS2000f (all flash):
As described above, both of these systems mix NVMe flash with a secondary tier of storage, either spinning disks (DS1000h) or SSDs (DS2000f). The diagram above compares ballpark latencies for storage on the two systems. Note that the NVMe flash tier is almost identical in performance on both systems. Differences there are due to the fact that we use two different vendors cards across those systems today.
However, as we move into the second tier which is intended for cold data, there is a clear distinction between the two models: Disk latencies are over a hundred times slower than NVMe (and even worse than that under load), while SSDs are less than a factor of 10 slower, and still offer access times that are under a millisecond.
Theres an important observation to be made about these two systems: As long as the bulk of hot data fits in NVMe flash, performance will be almost identical. Put another way, as long as automatic tiering is doing its job, accesses to slow spinning disk will be minimized, and the vast majority of requests will be served from NVMe flash.
However, as the total amount of hot data, often referred to as the working set grows, the disk-backed system will be forced to serve more requests from disk and performance will suffer. This is the point at which the system should be scaled out, increasing available NVMe flash, and keeping hot data in fast memory.
The DS2000f imposes much lower penalty as data shifts from primary flash to secondary flash. In fact, the secondary flash tier of SSDs is what most of todays all flash systems use for primary storage. It is completely capable of serving active requests for data, and as a result it is able to serve significantly more active workloads than the hybrid system.
Understanding the workloads:
The problem with workloads is that they arent all created equal. One user runs a simple web server with a few static files that barely touches the disk at all. A second runs a 2TB database with pathologically random I/O. An enduring challenge of the storage administrator has been in responding to the system-wide problems that are caused by a typically very small set of highly demanding workloads.
If we understand workloads in detail, and specifically if we understand the working set characteristics of those workloads, we can form an estimate of the amount of fast memory that they will require to run successfully. This is very important in a scale out storage system, because we want to be as uniform as possible in our use of available resources, avoiding hot spots that overload some of the systems.
Cohos storage system achieves this by carefully and efficiently monitoring each clients request stream and maintaining a precise model of each working set over time. Using a technique called counter stacks, we continuously collect this data, and store it in a format that allows the controller to examine both current and historical working set data on a per-workload basis.
A useful way of thinking about working set data is to consider a hit ratio curve (HRC). The HRC is a plot of the expected hit rate in flash of a given workload, depending on the amount of flash that is available to it. Cohos in-line workload analysis efficiently calculates HRCs for live workloads on the system. Here are two example HRCs:
These two workloads may run in identically sized VMs, and use identically sized VMDKs. However, the web server has a workload that fits comfortably in about 1GB of flash. Even if its disk is 100GB, the web server will receive great IO performance and function entirely in flash with a relatively small allocation.
The media server, on the other hand, has a highly random workload over a large working set. As a result, it receives almost no benefit from from the ability to place hot data in flash until it has about 80GB assigned to it. At that point, it receives tremendous benefit from additional flash, up to about 100GB.
Storage systems have historically not been able to efficiently calculate these HRCs of individual workloads, nor have they had the relevant resources to make useful decisions with them. Cohos use of a software-defined storage controller is specifically designed to use data like this to make placement decisions.
To close the loop on this example, lets consider how Cohos customers typically build out their environments. They frequently start with some number of hybrid nodes to achieve a baseline of capacity. Lets say that they scale out to three of our DS1000h systems. With this configuration, they will have 8.4TB of usable PCIe flash. As new workloads are placed on the system, the controller will monitor load, capacity, and unexpected events and ensure that performance is balanced and that data is replicated and available.
Now lets assume that this customer is seeing significant growth in demand from a small number of specific VMs. This sort of information is surfaced directly in Cohos UI, for instance using the working set view of the workload as shown below.
In interacting with the Analytics presentation on the Coho cluster, the customer decides specifically to add additional performance to the environment by extending it with a DS2000f, all flash node. The expansion node is ordered, racked, and wired in to the rest of the system. This is all the customer needs to do.
At this point, the SDSC identifies the additional available space, and characterizes its performance. In the microarray performance graph above, we now have 6 hybrid microarrays (2 per chassis) and 2 all flash ones.
The controller will then solve what is known as a constraint satisfaction problem: It will look at the HRCs for all active workloads, and the available performance of all nodes in the system. It will further consider additional internal constraints, such as maintaining replication properties, balancing load, and minimizing data movement within the cluster. It will then come up with a plan for moving the system to a new configuration that takes advantage of the new performance node. The performance node is capable of handling significantly more hot data, and so will tend to receive the largest workloads.
All of this happens completely transparently, and dynamically while the system is live.
Mixed-mode clusters are only one example of how a central controller is capable of achieving responsive storage and maintaining high levels of utilization. The SDSC responds to all sorts of other environmental events in the cluster. More importantly, and just like a controller in a software defined network, it is completely programmable. Instead of offering interfaces at the level of packets and flows, the SDSC allows policy to be specified on a per-object basis. These APIs can be used for things like ensuring that vmdks being accessed by a specific ESX host remain co-located for efficiency, or indicating that very large block sizes should be used for data that will be used for Hadoop-based analytics.
Its been an enormous amount of work to build the infrastructure required to apply a software-defined, controller-based model to a scalable storage system. Its included high performance data collection, efficient analysis, distributed system coordination, and careful integration with data layout and motion. Its really, really rewarding to start to see the power of these pieces coming together.
10,436 total views, 3 views today