The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.
The match
query accepts a cutoff_frequency
parameter, which allows it to
divide the terms in the query string into a low-frequency and high-frequency
group. The low-frequency group (more-important terms) form the bulk of the
query, while the high-frequency group (less-important terms) is used only for
scoring, not for matching. By treating these two groups differently, we can
gain a real boost of speed on previously slow queries.
One of the benefits of cutoff_frequency
is that you get domain-specific
stopwords for free. For instance, a website about movies may use the words
movie, color, black, and white so often that they could be
considered almost meaningless. With the stop
token filter, these domain-specific terms would have to be added to the stopwords list manually. However,
because the cutoff_frequency
looks at the actual frequency of terms in the
index, these words would be classified as high frequency automatically.
Take this query as an example:
{
"match": {
"text": {
"query": "Quick and the dead",
"cutoff_frequency": 0.01 (1)
}
}
-
Any term that occurs in more than 1% of documents is considered to be high frequency. The
cutoff_frequency
can be specified as a fraction (0.01
) or as an absolute number (5
).
This query uses the cutoff_frequency
to first divide the query terms into a
low-frequency group (quick
, dead
) and a high-frequency group (and
,
the
). Then, the query is rewritten to produce the following bool
query:
{
"bool": {
"must": { (1)
"bool": {
"should": [
{ "term": { "text": "quick" }},
{ "term": { "text": "dead" }}
]
}
},
"should": { (2)
"bool": {
"should": [
{ "term": { "text": "and" }},
{ "term": { "text": "the" }}
]
}
}
}
}
-
At least one low-frequency/high-importance term must match.
-
High-frequency/low-importance terms are entirely optional.
The must
clause means that at least one of the low-frequency terms—`quick` or dead
—_must_ be present for a document to be considered a
match. All other documents are excluded. The should
clause then looks for
the high-frequency terms and
and the
, but only in the documents collected
by the must
clause. The sole job of the should
clause is to score a
document like Quick and the dead'' higher than
The quick but
dead''. This approach greatly reduces the number of documents that need to be
examined and scored.
Tip
|
Setting the operator parameter to |
The minimum_should_match
parameter can be combined with cutoff_frequency
but it applies to only the low-frequency terms. This query:
{
"match": {
"text": {
"query": "Quick and the dead",
"cutoff_frequency": 0.01,
"minimum_should_match": "75%"
}
}
would be rewritten as follows:
{
"bool": {
"must": {
"bool": {
"should": [
{ "term": { "text": "quick" }},
{ "term": { "text": "dead" }}
],
"minimum_should_match": 1 (1)
}
},
"should": { (2)
"bool": {
"should": [
{ "term": { "text": "and" }},
{ "term": { "text": "the" }}
]
}
}
}
}
-
Because there are only two terms, the original 75% is rounded down to
1
, that is: one out of two low-terms must match. -
The high-frequency terms are still optional and used only for scoring.
An or
query for high-frequency terms only—``To be, or not to be''—is
the worst case for performance. It is pointless to score all the
documents that contain only one of these terms in order to return just the top
10 matches. We are really interested only in documents in which the terms all occur
together, so in the case where there are no low-frequency terms, the query is
rewritten to make all high-frequency terms required:
{
"bool": {
"must": [
{ "term": { "text": "to" }},
{ "term": { "text": "be" }},
{ "term": { "text": "or" }},
{ "term": { "text": "not" }},
{ "term": { "text": "to" }},
{ "term": { "text": "be" }}
]
}
}
While the high/low frequency functionality in the match
query is useful,
sometimes you want more control over how the high- and low-frequency groups
should be handled. The match
query exposes a subset of the
functionality available in the common
terms query.
For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:
{
"common": {
"text": {
"query": "Quick and the dead",
"cutoff_frequency": 0.01,
"low_freq_operator": "and",
"minimum_should_match": {
"high_freq": "75%"
}
}
}
}
See the {ref}/query-dsl-common-terms-query.html[common
terms query] reference page for more options.