Home > Software > How to Use Bulk Upserts in Elasticsearch

How to Use Bulk Upserts in Elasticsearch

Anastasios Antoniadis

Learn how to efficiently use bulk upserts in Elasticsearch to update and insert documents simultaneously, optimizing performance and ensuring data accuracy in your Elasticsearch cluster.

Elasticsearch

Elasticsearch, renowned for its powerful full-text search capabilities and scalability, also offers robust data ingestion, updating, and management features. One of the essential operations in data management is the ability to efficiently update or insert data — a process commonly known as “upsert.” When dealing with large volumes of data, performing these operations individually can be time-consuming and resource-intensive.

Fortunately, Elasticsearch’s Bulk API supports bulk upsert operations, enabling the processing of multiple upsert requests in a single API call. This article explores the concept of bulk upserts in Elasticsearch, providing insights into their use cases and advantages and a guide on how to perform them effectively.

Understanding Upserts in Elasticsearch

An “upsert” is a smart operation that updates a document if it exists or inserts it as a new document if it does not. This operation is crucial for maintaining the accuracy and currency of the data in Elasticsearch without manually checking whether each document exists before deciding to insert or update it.

The Bulk API

The Bulk API in Elasticsearch is designed to perform multiple indexing or delete operations in a single API request. This reduces the overhead of making many separate requests and significantly improves performance when processing large datasets. The Bulk API can handle thousands of operations simultaneously, making it an ideal solution for batch-processing tasks.

Bulk Upserts: Use Cases and Advantages

Bulk upserts are particularly useful when data is continuously ingested or updated from various sources, such as logs, user activities, or IoT device data. Using bulk upserts, you can ensure that your Elasticsearch indices remain up-to-date and reflective of the latest data without the need for cumbersome and inefficient individual update checks.

Advantages include:

  • Efficiency: Reduces network overhead and increases throughput by batching multiple operations into a single request.
  • Simplicity: Simplifies client-side logic by handling the insert or update decision on the server side.
  • Performance: Significantly faster data ingestion and updating, especially for large datasets.

Performing Bulk Upserts in Elasticsearch

To perform a bulk upsert, you’ll use a combination of the Bulk API and the update action, specifying the doc_as_upsert parameter. Here’s a step-by-step guide:

Step 1: Prepare Your Bulk Request

A bulk request payload consists of pairs of action metadata and the document to be indexed or updated. For upserts, the action metadata will use the update action and include the document ID.

Each pair specifies an update action for a document by its ID, with the doc_as_upsert parameter indicating that the document should be inserted if it does not exist.

{ "update": { "_id": "1", "_index": "your_index" } }
{ "doc": { "field": "value" }, "doc_as_upsert": true }
{ "update": { "_id": "2", "_index": "your_index" } }
{ "doc": { "field": "another value" }, "doc_as_upsert": true }

Step 2: Execute the Bulk Request

Using curl, you can execute the bulk upsert operation as follows:

curl -X POST "http://localhost:9200/_bulk" -H "Content-Type: application/json" --data-binary "@your_bulk_data.json"

Replace "@your_bulk_data.json" with the path to a file containing your prepared bulk request payload.

Best Practices for Bulk Upserts

  • Batch Size: Find an optimal batch size for your bulk requests. If it is too large, you may strain your Elasticsearch cluster; if it is too small, you might not fully benefit from the efficiency of bulk processing.
  • Error Handling: Always check the response from the Bulk API for errors and handle them appropriately. Partial failures can occur, where some operations succeed while others fail.
  • Indexing Strategy: When performing upserts, consider your document IDs and indexing strategy to avoid unintended overwrites or duplications.

Elasticsearch Bulk Upsert Java Example

This method is useful when preserving the index’s configuration but removing all its documents.

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.17.18</version>
</dependency>

Adjust the version according to your Elasticsearch cluster version.

import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.common.xcontent.XContentType;

import java.io.IOException;

public class BulkUpsertExample {

    public static void main(String[] args) {
        // Initialize the client
        try (RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http")))) {

            // Create a BulkRequest
            BulkRequest request = new BulkRequest();
            
            // Prepare JSON documents to upsert
            String doc1 = "{\"user\":\"john_doe\",\"message\":\"trying out Elasticsearch\"}";
            String doc2 = "{\"user\":\"jane_doe\",\"message\":\"second message\"}";

            // Add upsert requests to the bulk request
            request.add(new UpdateRequest("your_index", "1")
                    .doc(doc1, XContentType.JSON)
                    .docAsUpsert(true));
            request.add(new UpdateRequest("your_index", "2")
                    .doc(doc2, XContentType.JSON)
                    .docAsUpsert(true));

            // Execute the bulk request
            BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);

            // Process the response
            if (!bulkResponse.hasFailures()) {
                System.out.println("Bulk upsert successful.");
            } else {
                System.out.println("Bulk upsert encountered errors.");
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

How It Works:

  • RestHighLevelClient Initialization: This example starts by creating an instance of RestHighLevelClient, the main entry point for the Elasticsearch High-Level REST Client.
  • Creating BulkRequest: A BulkRequest object is created to hold multiple update requests.
  • Preparing Documents: JSON strings representing the documents to be upserted are defined.
  • Adding Update Requests: For each document, an UpdateRequest is created and added to the BulkRequest. The docAsUpsert(true) method specifies that these should be treated as upserts — meaning they will be inserted if they don’t exist or updated if they do.
  • Executing the Bulk Operation: The bulk method on the client is used to execute the bulk operation, and the response is checked for failures to determine if the operation was successful.

Important Considerations:

  • Error Handling: In production code, ensure you properly handle potential exceptions and inspect the bulk response for specific errors.
  • Performance: Bulk operations are efficient, but the optimal size of a bulk request depends on your documents and Elasticsearch cluster setup. Monitor performance and adjust the batch size accordingly.
  • Elasticsearch Version Compatibility: Ensure the client library version is compatible with your Elasticsearch cluster version.

This example provides a foundation for performing bulk upserts in Elasticsearch using Java. Depending on your specific requirements, you may need to adjust index names, document structures, and error handling to fit your application’s needs.

Conclusion

Bulk upserts are an effective way to manage data ingestion and updates in Elasticsearch. They combine the benefits of bulk operations with the flexibility of upsert logic. By using the Bulk API and adhering to best practices, developers can significantly improve the performance and reliability of Elasticsearch-based applications. This ensures that data remains current and accurate with minimal overhead. Whether dealing with streaming data sources, batch processing, or periodic updates, mastering bulk upserts is a valuable skill in optimizing Elasticsearch operations.

Anastasios Antoniadis
Follow me
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x