Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): Use minimum_should_match to filter hits #1333

Merged
merged 1 commit into from
Jul 17, 2019

Conversation

orangejulius
Copy link
Member

This change utilizes the minimum_should_match Elasticsearch query parameter to reduce the number of hits to search queries. For now, autocomplete is not changed, we can take a look at that later.

Previously, only one token in an input had to match, regardless of the input size. This allows for queries to potentially match a huge number of documents, especially as the number of tokens grows.

Now, most ngrams queries have the minimum_should_match parameter set to 1<-1 3<-25%

Breaking this down, it means that the number of optional tokens follows this pattern:

token count optional tokens
1 0 (obviously)
2 1
3 1
4 1
5 1
6+ at least 75% of tokens must match

This should help ensure quality results are returned where possible, even if long inputs contain a few extraneous bits of information, while reducing the chance of extremely expensive queries.

This change utilizes the [minimum_should_match](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-minimum-should-match.html)
Elasticsearch query parameter to reduce the number of hits to search
queries.

Previously, only one token in an input had to match, regardless of the
input size. This allows for queries to potentially match a huge number
of documents, especially as the number of tokens grows.

Now, most `ngrams` queries have thee parameter set to `1<-1 3<-25%`

Breaking this down, it means that for the given number of tokens in a
query, there is a certain number of optional tokens:

| token count | optional tokens |
| --- | --- |
| 1   | 0 (obviously) |
| 2   | 1   |
| 3   | 1   |
| 4   | 1   |
| 5   | 1   |
| 6+  | at least 75% of tokens must match |

This should help ensure quality results are returned where possible,
even if long inputs contain a few extraneous bits of information, while
reducing the chance of extremely expensive queries.
@orangejulius orangejulius merged commit 6c5db8d into master Jul 17, 2019
@orangejulius orangejulius deleted the minimum_should_match branch July 17, 2019 13:19
@orangejulius
Copy link
Member Author

We saw some very nice reductions in request latency and number of Elasticsearch hits due to this change. It looks like it will really help cut down on slow queries.

The effect of rolling this out was pretty dramatic:
Screenshot_2019-07-16_11-12-42

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Aug 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant