First, I found that a lot of the teams were using the official product released by Elasticsearch called ES-Hadoop. Reading its source code on GitHub, I soon learned that it utilized the REST API. The REST API is a great idea since it provides a wide version compatibility with both
2.X major-minor versions, and composability with various Hadoop-related tools.
Unfortunately, the REST API was not efficient enough even with various tuning and optimizations both on the client and the server.
As a result, I developed and published a Scala library called Spark2Elasticsearch to write large data volumes through Spark to Elasticsearch 2.0 and Elasticsearch 2.1 clusters. The GitHub repository is available at https://github.com/jparkie/Spark2Elasticsearch. Meanwhile, artifacts are cross-published to the Sonatype Central Repository spark2elasticsearch_2.10 and spark2elasticsearch_2.11.
Utilizing the TCP transport layer through the use of the Java API, the library serializes
DataFrames to create
UPSERTs for Elasticsearch. This method provided an order of magnitude increase in performance. Accordingly, I hope this library may help others who are facing similar performance issues.