Google recently unveiled one of their crown jewels of system infrastructure: Borg, their cluster scheduler. This prompted me to re-read the Mesos and Omega papers, which deal with the same topic. I thought it'd be interested to do a compare and contrast of these systems. Mesos gets credit for the groundbreaking idea of two-level scheduling, Omega improved upon this with an analogy from databases, and Borg can sort of be seen as the culmination of all these ideas.
Cluster schedulers have existed long before big data. There's a rich literature on scheduling on 1000s of cores in the HPC world, but their problem domain is simpler than what is addressed by datacenter schedulers, meaning Mesos/Borg and their ilk. Let's compare and contrast on a few dimensions.
Supercomputers separate storage and compute and connect them with an approximately full-bisection bandwidth network that goes at close to memory speeds (GB/s). This means your tasks can get placed anywhere on the cluster without worrying much about locality, since all compute nodes can access data equally quickly. There are a few hyper-optimized applications that optimize for the network topology, but these are very rare.
Data center schedulers do care about locality, and in fact this is the whole point of GFS and MapReduce co-design. Back in the 2000s, network bandwidth was comparatively much more expensive than disk bandwidth. So, there was a huge economic savings by scheduling your computation tasks on the same node that held the data. This is a major scheduling constraint; whereas before you could put the task anywhere, now it needs to go on one of the three data replicas.
Supercomputers are typically composed of homogeneous nodes, i.e. they all have the same hardware specs. This is because supercomputers are typically purchased in one shot: a lab gets $x million dollars for a new one, and they spend it all upfront. Some HPC applications are optimized for the specific CPU models in a supercomputer. New technology like GPUs or co-processors are rolled out as a new cluster.
In the big data realm, clusters are primarily storage constrained, so operators are continually adding new racks with updated specs to expand cluster capacity. This means it's typical for nodes to have different CPUs, memory capacities, number of disks, etc. Also toss in special additions like SSDs, GPUs, shingled drives. A single datacenter might need to support a broad range of applications, and all of this again imposes additional scheduling constraints.
When running an application on a supercomputer, you specify how many nodes you want, the queue you want to submit your job to, and how long the job will run for. Queues place different restrictions on how many resources you can request and how long your job can run for. Queues also have a priority or reservation based system to determine ordering. Since the job durations are all known, this is a pretty easy box packing problem. If the queues are long (typically true) and there's a good mix of small jobs to backfill the space leftover from big jobs (also typical), you can achieve extremely high levels of utilization. I like to visualize this in 2D, with time as X and resource usage as Y.
As per the previous, datacenter scheduling is a more general problem. The "shape" of resource requests can be quite varied, and there are more dimensions. Jobs also do not have a set duration, so it's hard to pre-plan queues. Thus we have more sophisticated scheduling algorithms, and the performance of the scheduler thus becomes important.
Utilization as a general rule is going to be worse (unless you're Google; more on that later), but one benefit over HPC workloads is that MapReduce and similar can be incrementally scheduled instead of gang scheduled. HPC, we wait until all N nodes that you requested are available, then run all your tasks at once. MR can instead run its tasks in multiple waves, meaning it can still effectively use bits of leftover resources. A single MR job can also ebb and flow based on cluster demand, which avoids the need for preemption or resource reservations, and also helps with fairness between multiple users.
Mesos predates YARN, and was designed with the problems of the original MapReduce in mind. Back then, Hadoop clusters could run only a single application: MapReduce. This made it difficult to run applications that didn't conform to a map phase followed by a reduce phase. The biggest example here is Spark. Previously, you'd have to install a whole new set of workers and masters for Spark, which would sit alongside your MapReduce workers and masters. Hardly ideal from a utilization perspective, since they were typically statically partitioned.
Mesos addresses this problem by providing a generalized scheduler for all cluster applications. MapReduce and Spark became simply different applications using the same underlying resource sharing framework. The simplest approach would be to write a centralized scheduler, but that has a number of drawbacks:
Instead, Mesos introduces the idea of two-level scheduling. Mesos delegates the per-application scheduling work to the applications themselves, while Mesos still remains responsible for resource distribution between applications and enforcing overall fairness. This means Mesos can be pretty thin, 10K lines of code.
Two-level scheduling happens through a novel API called resource offers, where Mesos periodically offers some resources to the application schedulers. This sounds backwards at first (the request goes from the master to the application?), but it's actually not that strange. In MR1, the TaskTracker workers are the source of truth as to what's running on a node. When a TT heartbeats in saying that a task has completed, the JobTracker then chooses something else to run on that TaskTracker. Scheduling decisions are triggered by what's essentially a resource offer from the worker. In Mesos, the resource offer comes from the Mesos master instead of the slave, since Mesos is managing the cluster. Not that different.
Resource offers act as time-bounded leases for some resources. Mesos offers resources to an application based on policies like priority or fair share. The app then computes how it uses them, and tells Mesos what resources from the offer it wants. This gives the app lots of flexibility, since it can choose to run a portion of tasks now, wait for a bigger allocation later (gang scheduling), or size its tasks differently to fit what's available. Since offers are time-bounded, it also incentivizes applications to schedule quickly.
Some concerns and how they were addressed:
Unaddressed / unknown resolution:
Omega is sort of a successor to Mesos, and in fact shares an author. Since the paper uses simulated results for its evaluation, I suspect it never went into production at Google, and the ideas were rolled into the next generation of Borg. Rewriting the API is probably too invasive of a change, even for Google.
Omega takes the resource offers one degree further. In Mesos, resource offers are pessimistic or exclusive. If a resource has been offered to an app, the same resource won't be offered to another app until the offer times out. In Omega, resource offers are optimistic. Every application is offered all the available resources on the cluster, and conflicts are resolved at commit time. Omega's resource manager is essentially just a relational database of all the per-node state with different types of optimistic concurrency control to resolve conflicts. The upside of this is vastly increased scheduler performance (full parallelism) and better utilization.
The downside of all this is that applications are in a free-for-all where they are allowed to gobble up resources as fast as they want, and even preempt other users. This is okay for Google because they use a priority-based system, and can go yell at their internal users. Their workload broadly falls into just two priority bands: high-priority service jobs (HBase, webservers, long-lived services) and low-priority batch jobs (MapReduce and similar). Applications are allowed to preempt lower-priority jobs, and are also trusted to stay within their cooperatively enforced limits on # of submitted jobs, amount of allocated resources, etc. I think Yahoo has said differently about being able to go yell at users (certainly not scalable), but it works somehow at Google.
Most of the paper talks about how this optimistic allocation scheme works with conflicts, which is always the question. There are a few high-level notes:
This is a production experience paper. It's the same workload as Omega since it's also Google, so many of the metapoints are the same.
The issues listed here are pretty much fixed in Kubernetes, their public, open-source container scheduler.
It seems like YARN will need to draw from Mesos and Omega to scale up to the 10K node scale. YARN is still a centralized scheduler, which is the strawman for comparison in Mesos and Omega. Borg specifically mentions the need to shard to scale.
Isolation is very important to achieve high utilization without compromising SLOs. This can surface at the application layer, where apps themselves need to be design to be latency-tolerant. Think tail-at-scale request replication in BigTable. Ultimately it comes down to hardware spend vs. software spend. Running at lower utilization sidesteps this problem. Or, you can tackle it head-on through OS isolation mechanisms, resource estimation, and tuning your workload and schedulers. At Google-scale, there's enough hardware that it makes sense to hire a bunch of kernel developers. Fortunately they've done the work for us :)
I wonder also if the Google workload assumptions apply more generally. Priority bands, reservations, and preemption work well for Google, but our customers almost all use the fair share scheduler. Yahoo uses the capacity scheduler. Twitter uses the fair scheduler. I haven't heard of any demand or usage of a priority + reservation scheduler.
Finally, very few of our customers run big shared clusters as envisioned at Google. We have customers with thousands of nodes, but this is split up into pods of hundreds of nodes. It's also still common to have separate clusters for separate users or applications. Clusters are also typically homogeneous in terms of hardware. I think this will begin to change though, and soon.