The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

Solr Compute Cloud – An Elastic Solr Infrastructure | BloomReach Engineering Blog

This is a post by Nitin Sharma and Li Ding, Engineers from the Search and Data Infrastructure Team at BloomReach.

Scaling a multi-tenant search platform that has high availability while maintaining low latency is a hard problem to solve. Its especially hard when the platform is running a heterogeneous workload on hundreds of millions of documents and hundreds of collections in SolrCloud.

Typically search platforms have a shared cluster setup. It does not scale out of the box for heterogenous use cases. A few of the shortcomings are listed below.

The key to solve these problems is isolation. We isolate the write and read of each the collections. At BloomReach, we have implemented an Elastic Solr Infrastructure that dynamically grows/shrinks. It helps provide the right amount of isolation among pipelines while improving resource utilization.The SC2 API and HAFT services ( built in house) give us the ability to do the isolation and scale the platform in an elastic manner while guaranteeing high availability, low latency and low operational cost.

This blog describes our innovative solution in greater detail and how we scaled our infrastructure to be truly elastic and cost optimized. We plan to open source HAFT Service in the future for anyone who is interested in building their own highly available Solr search platform.

Problem Statement

Below is a diagram that describes our workload.

In this scheme, our production Solr cluster is the center for everything. As for the other elements:

With this kind of workload, we are facing several key challenges with each client we serve. The graph below illustrates some the challenges with this architecture:

For indexing jobs:

  1. Commits at same time as heavy reads: Indexing jobs running at the same time as pipelines and customer queries, which impact both pipelines and the latency of customer queries.
  2. Frequent commits: Frequent commits and non-batched updates cause performance issues in Solr.
  3. Leaked indexing: Indexing jobs might fail resulting in leaked clients, which get accumulated over time.

For the pipeline jobs:

  1. Cache tuning: Impossible for Solr to tune the cache. The query pattern varies between pipeline jobs when working on the same collections.
  2. OOM and high CPU usage: Unevenly distributed workload among Solr hosts in the clusters. Some nodes might have OOM error while other nodes have high CPU usage.
  3. Bad pipeline: One bad client or query could bring down the performance of the entire cluster or make one node unresponsive.
  4. Heavy load pipeline: One heavy load pipeline would affect other smaller pipelines.
  5. Concurrent pipelines: The more concurrent pipeline jobs we ran, the more failures we saw.

BloomReachSearch Architecture

Left unchecked, these problems would eventually affect the availability and latency SLA with our customers. The key to solving these problems is isolation. Imagine if every pipeline and indexing job had its own Solr cluster, containing the collections they need, and every cluster was optimized for that job in terms of system, application and cache requirements. The production Solr cluster wouldnt have any impact from those workloads. At BloomReach, we designed and implemented a system we call Solr Compute Cloud (SC2) to isolate all the workload to scale Solr.

The architecture overview of SC2 is shown in the diagram below:

We have an elastic layer of clusters which is the primary source of data for large indexing and analysis MapReduce pipelines. This prevents direct access to production clusters from any pipelines. Only search queries from customers are allowed to access production clusters. The technologies behind elastic layer are SC2 API and Solr HAFT (High Availability and Fault Tolerance) Service (both built in-house).

SC2 API features include:

Solr HAFT Service provides several key features to support our SC2 infrastructure.

Read Workflow

We will describe the detailed steps of how read pipeline jobs work in SC2:

  1. Read pipeline requests collection and desired replicas from SC2 API.
  2. SC2 API provisions SC2 cluster dynamically with needed setup (and streams Solr data).
  3. SC2 calls HAFT service to request data replication.
  4. HAFT service replicate data from production to provisioned cluster.
  5. Pipeline uses this cluster to run job.
  6. After pipeline job finishes, call SC2 API to terminate the cluster.

Indexing Workflow

Below are detailed steps describing how an indexing job works in SC2.

  1. The indexing job uses SC2 API to create a SC2 cluster of collection A with two replicas.
  2. SC2 API provisions SC2 cluster dynamically with needed setup (and streams Solr data).
  3. Indexer uses this cluster to index the data.
  4. Indexer calls HAFT service to replicate the index from SC2 cluster to production.
  5. HAFT service reads data from dynamic cluster and replicates to production Solr.

After the job finishes, it will call SC2 API to terminate the SC2 cluster.

Solr/Lucene Revolution Talk 2014 at Washington, D.C.

We spoke in detail about the Elastic Infrastructure at BloomReach in last years Solr Conference. The link to the video of the talk and the slides are below.

  1. Talk: https://www.youtube.com/watch?v=1sxBiXsW6BQ
  2. Slides: http://www.slideshare.net/nitinssn/solr-compute-cloud-an-elastic-solrcloud-infrastructure

Conclusion

Scaling a search platform with heterogeneous workload for hundreds of millions of documents and a massive number of collections in SolrCloud is nontrivial. A kitchen-sink shared cluster approach does not scale well and has a lot of shortcomings such as uneven workload distribution,sub-optimalcache tuning, unpredictablecommit frequency andmisbehaving clients leaking connections.

The key to solve these problems is isolation. Not only do we isolate the read and write jobs as a whole but also isolate write and read of each the collection. The in-house built SC2 API and HAFT services give us the ability to do the isolation and scale the platform in an elastic manner.

The SC2 infrastructure gives us high availability and low latency with low cost by isolating heterogeneous workloads from production clusters. We plan to open source HAFT Service in the future for anyone who is interested in building their own highly available Solr search platform.

 

Continue reading on engineering.bloomreach.com