Spark2Elasticsearch


A few months ago, I was asked to help improve the writing of a lot of data into various Elasticsearch clusters for different teams using Apache Spark.

First, I found that a lot of the teams were using the official product released by Elasticsearch called ES-Hadoop. Reading its source code on GitHub, I soon learned that it utilized the REST API. The REST API is a great idea since it provides a wide version compatibility with both 1.X and 2.X major-minor versions, and composability with various Hadoop-related tools.

Unfortunately, the REST API was not efficient enough even with various tuning and optimizations both on the client and the server.

As a result, I developed and published a Scala library called Spark2Elasticsearch to write large data volumes through Spark to Elasticsearch 2.0 and Elasticsearch 2.1 clusters. The GitHub repository is available at https://github.com/jparkie/Spark2Elasticsearch. Meanwhile, artifacts are cross-published to the Sonatype Central Repository spark2elasticsearch_2.10 and spark2elasticsearch_2.11.

Utilizing the TCP transport layer through the use of the Java API, the library serializes DataFrames to create UPSERTs for Elasticsearch. This method provided an order of magnitude increase in performance. Accordingly, I hope this library may help others who are facing similar performance issues.

Related Posts

Bacon, Maple Syrup, and Scala Up North 2015

The first and only Scala conference organized in Canada.

Scala as a Scalable Programming Language

How I implemented an external domain-specific language with a parser, interpreter, and Java bytecode compiler.

Presenting at the Toronto Scala and Typesafe User Group

A summation of my first experience at public speaking about 'Developing a Real-time Engine with Akka, Cassandra, and Spray'.

10 Lessons Learned Using Cassandra

How I learned to optimize for read performance in Cassandra 2.1.6.