Elasticsearch flakiness in tests

Posted on

Elasticsearch is an awesome tool for building fast and powerful search experiences. However integration testing with Elasticsearch can be painful. Elasticsearch uses a HTTP REST API to modify, setup, and search indices. The nature of this API is eventually consistent, creating an index will not be done when the HTTP call returns. This eventual consistency can become painful in test since creating, indexing, searching and then removing the index needs to happen in rapid succession.

When this happens test will start failing for seemingly unexplainable reasons at random times. The errors will look like this:

Elasticsearch::Transport::Transport::Errors::ServiceUnavailable at /users

[503] {"error":"SearchPhaseExecutionException[Failed to execute phase [query], all shards failed]","status":503}

A poor solution to this problem is adding a sleep after asking elasticsearch to create your indices, as an example.

create_list(:user, 10)
sleep 1 # Might work, Might not work. Depending on Java GC and other factors

This post by DevMynd suggests wrapping search calls in retry logic which will work, but with the unfortunate side effect of having to modify the application code itself.

The solution I’ve found to work is using the refresh_index! method after creating indices in tests. This will trigger the refresh action on the index as described here.

From the docs:

The refresh API allows to explicitly refresh one or more index, making all operations performed since the last refresh available for search. The (near) real-time capabilities depend on the index engine used. For example, the internal one requires refresh to be called, but by default a refresh is scheduled periodically.

Putting it all together a test for a users endpoint might have the following before and after actions

RSpec.describe '/users' do
  before(:each) do
    User.__elasticsearch__.create_index! force: true, index: User.index_name
    create_list(:user, 10)

  after(:each) do
    User.__elasticsearch__.indices.delete index: User.index_name

  # Tests...

In rare cases it seems like not even this will solve the problem, although so far it has for us. As suggested in this issue on GitHub watching the cluster health and waiting until it becomes green is another option.

This has been a great annoyance when encountered and I hope this post will help you avoid the same annoyance.