How to Use the Elasticsearch Scroll API in Java: A Comprehensive Guide

X (Twitter) Facebook Pinterest LinkedIn

When dealing with extensive datasets in Elasticsearch, retrieving large volumes of data in a single query can be inefficient or even unfeasible. The Scroll API addresses this challenge by allowing you to retrieve a large number of documents in smaller, manageable batches. This is particularly useful for data analysis, migration, or reindexing tasks where you need to process or transfer all the data in an index. This article provides a detailed guide on implementing the Elasticsearch Scroll API using the Elasticsearch Java High Level REST Client, complete with a full example.

Table of Contents

Prerequisites

Before diving into the code, ensure you have the following prerequisites in place:

Elasticsearch cluster running and accessible
Elasticsearch Java High Level REST Client added to your project dependencies
An index filled with the data you intend to scroll through

Setting Up the Java High Level REST Client

First, you need to set up the Elasticsearch Java High Level REST Client. Add the dependency to your pom.xml if you’re using Maven:

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.10.1</version>
</dependency>

Adjust the version according to the version of your Elasticsearch cluster.

Next, initialize the client:

import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;

public class ElasticsearchScrollExample {
    private static final String HOST = "localhost";
    private static final int PORT = 9200;
    private static final String SCHEME = "http";

    public static RestHighLevelClient createClient() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost(HOST, PORT, SCHEME)));
    }
}

Implementing Scroll with the High Level REST Client

The following example demonstrates how to use the Scroll API to iterate through all documents in an index named “my_index”.

1. Initiate the Scroll

First, initiate a scroll context with a search request specifying the scroll interval:

import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.Scroll;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.SearchHit;
import java.io.IOException;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;

import java.util.concurrent.TimeUnit;

public class ElasticsearchScrollExample {

    public static void main(String[] args) throws IOException {
        try (RestHighLevelClient client = createClient()) {
            final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
            SearchRequest searchRequest = new SearchRequest("my_index");
            searchRequest.scroll(scroll);
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
            searchSourceBuilder.query(QueryBuilders.matchAllQuery());
            searchSourceBuilder.size(100); // Adjust the size per batch as needed
            searchRequest.source(searchSourceBuilder);

            SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
            String scrollId = searchResponse.getScrollId();
            SearchHit[] searchHits = searchResponse.getHits().getHits();

            // Process the first batch of search hits here

            while (searchHits != null && searchHits.length > 0) {
              SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
              scrollRequest.scroll(scroll);
              searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
              scrollId = searchResponse.getScrollId();
              searchHits = searchResponse.getHits().getHits();

              // Process the current batch of search hits here
            }
            
            if (scrollId != null) {
              ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
              clearScrollRequest.addScrollId(scrollId);
              ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest,           RequestOptions.DEFAULT);
              boolean succeeded = clearScrollResponse.isSucceeded();
              // Log or handle the success of the scroll clearing operation
            }
        }
    }

    // Rest of the class definition
}

2. Scroll Through the Batches

Using the scroll ID obtained from the initial search, you can fetch subsequent batches until no more documents are returned:

while (searchHits != null && searchHits.length > 0) {
    SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
    scrollRequest.scroll(scroll);
    searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
    scrollId = searchResponse.getScrollId();
    searchHits = searchResponse.getHits().getHits();

    // Process the current batch of search hits here
}

3. Clear the Scroll Context

After processing all batches, it’s important to clear the scroll context to free up resources on the server:

import org.elasticsearch.action.search.ClearScrollRequest;
import org.elasticsearch.action.search.ClearScrollResponse;

if (scrollId != null) {
    ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
    clearScrollRequest.addScrollId(scrollId);
    ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
    boolean succeeded = clearScrollResponse.isSucceeded();
    // Log or handle the success of the scroll clearing operation
}

Best Practices and Considerations

Scroll Duration: The scroll duration (TimeValue.timeValueMinutes(1L)) defines how long each scroll context is kept alive between each batch. Adjust this value based on the processing time of your batches.
Batch Size: The size of each batch (searchSourceBuilder.size(100)) can significantly affect performance. Larger batches reduce the number of requests but require more memory.
Resource Management: Always ensure that the scroll context is cleared after use to prevent resource leaks.

Conclusion

The Scroll API in Elasticsearch, when used with the Java High Level REST Client, provides a robust solution for processing large datasets efficiently. By following the steps outlined in this guide, you can implement scrolling in your Java applications, allowing for comprehensive data processing, analysis, or migration tasks. Remember to adjust the scroll duration and batch size based on your specific use case and infrastructure capabilities to optimize performance and resource usage.

Author
Recent Posts

Follow me

Anastasios Antoniadis

Anastasios Antoniadis is the founder and editor-in-chief of BORDERPOLAR... He is a software engineer, blogger, and avid gamer covering tech, gaming, and coding guides for over 4 years. He is a 2014 graduate of the Department of Informatics and Telecommunications of the University of Athens, an M.Sc. holder in Computer Science, and a Ph.D. student in Program Analysis.