Home > Software > How to Use the Elasticsearch Scroll API in Java: A Comprehensive Guide

How to Use the Elasticsearch Scroll API in Java: A Comprehensive Guide

Anastasios Antoniadis

Share on X (Twitter) Share on Facebook Share on Pinterest Share on LinkedInWhen dealing with extensive datasets in Elasticsearch, retrieving large volumes of data in a single query can be inefficient or even unfeasible. The Scroll API addresses this challenge by allowing you to retrieve a large number of documents in smaller, manageable batches. This …

Elasticsearch

When dealing with extensive datasets in Elasticsearch, retrieving large volumes of data in a single query can be inefficient or even unfeasible. The Scroll API addresses this challenge by allowing you to retrieve a large number of documents in smaller, manageable batches. This is particularly useful for data analysis, migration, or reindexing tasks where you need to process or transfer all the data in an index. This article provides a detailed guide on implementing the Elasticsearch Scroll API using the Elasticsearch Java High Level REST Client, complete with a full example.

Prerequisites

Before diving into the code, ensure you have the following prerequisites in place:

  • Elasticsearch cluster running and accessible
  • Elasticsearch Java High Level REST Client added to your project dependencies
  • An index filled with the data you intend to scroll through

Setting Up the Java High Level REST Client

First, you need to set up the Elasticsearch Java High Level REST Client. Add the dependency to your pom.xml if you’re using Maven:

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.10.1</version>
</dependency>

Adjust the version according to the version of your Elasticsearch cluster.

Next, initialize the client:

import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

public class ElasticsearchScrollExample {
    private static final String HOST = "localhost";
    private static final int PORT = 9200;
    private static final String SCHEME = "http";

    public static RestHighLevelClient createClient() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost(HOST, PORT, SCHEME)));
    }
}

Implementing Scroll with the High Level REST Client

The following example demonstrates how to use the Scroll API to iterate through all documents in an index named “my_index”.

1. Initiate the Scroll

First, initiate a scroll context with a search request specifying the scroll interval:

import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.Scroll;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.SearchHit;
import java.io.IOException;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;

import java.util.concurrent.TimeUnit;

public class ElasticsearchScrollExample {

    public static void main(String[] args) throws IOException {
        try (RestHighLevelClient client = createClient()) {
            final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
            SearchRequest searchRequest = new SearchRequest("my_index");
            searchRequest.scroll(scroll);
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.query(QueryBuilders.matchAllQuery());
            searchSourceBuilder.size(100); // Adjust the size per batch as needed
            searchRequest.source(searchSourceBuilder);

            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            String scrollId = searchResponse.getScrollId();
            SearchHit[] searchHits = searchResponse.getHits().getHits();

            // Process the first batch of search hits here

            while (searchHits != null && searchHits.length > 0) {
              SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
              scrollRequest.scroll(scroll);
              searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
              scrollId = searchResponse.getScrollId();
              searchHits = searchResponse.getHits().getHits();

              // Process the current batch of search hits here
            }
            
            if (scrollId != null) {
              ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
              clearScrollRequest.addScrollId(scrollId);
              ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest,           RequestOptions.DEFAULT);
              boolean succeeded = clearScrollResponse.isSucceeded();
              // Log or handle the success of the scroll clearing operation
            }
        }
    }

    // Rest of the class definition
}

2. Scroll Through the Batches

Using the scroll ID obtained from the initial search, you can fetch subsequent batches until no more documents are returned:

while (searchHits != null && searchHits.length > 0) {
    SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
    scrollRequest.scroll(scroll);
    searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
    scrollId = searchResponse.getScrollId();
    searchHits = searchResponse.getHits().getHits();

    // Process the current batch of search hits here
}

3. Clear the Scroll Context

After processing all batches, it’s important to clear the scroll context to free up resources on the server:

import org.elasticsearch.action.search.ClearScrollRequest;
import org.elasticsearch.action.search.ClearScrollResponse;

if (scrollId != null) {
    ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
    clearScrollRequest.addScrollId(scrollId);
    ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
    boolean succeeded = clearScrollResponse.isSucceeded();
    // Log or handle the success of the scroll clearing operation
}

Best Practices and Considerations

  • Scroll Duration: The scroll duration (TimeValue.timeValueMinutes(1L)) defines how long each scroll context is kept alive between each batch. Adjust this value based on the processing time of your batches.
  • Batch Size: The size of each batch (searchSourceBuilder.size(100)) can significantly affect performance. Larger batches reduce the number of requests but require more memory.
  • Resource Management: Always ensure that the scroll context is cleared after use to prevent resource leaks.

Conclusion

The Scroll API in Elasticsearch, when used with the Java High Level REST Client, provides a robust solution for processing large datasets efficiently. By following the steps outlined in this guide, you can implement scrolling in your Java applications, allowing for comprehensive data processing, analysis, or migration tasks. Remember to adjust the scroll duration and batch size based on your specific use case and infrastructure capabilities to optimize performance and resource usage.

Anastasios Antoniadis
Follow me
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x