Skip to content

Latest commit

 

History

History
158 lines (119 loc) · 5.79 KB

40_Field_centric.asciidoc

File metadata and controls

158 lines (119 loc) · 5.79 KB

Field-Centric Queries

All three of the preceding problems stem from most_fields being field-centric rather than term-centric: it looks for the most matching fields, when really what we’re interested is the most matching terms.

Note
The best_fields type is also field-centric and suffers from similar problems.

First we’ll look at why these problems exist, and then how we can combat them.

Problem 1: Matching the Same Word in Multiple Fields

Think about how the most_fields query is executed: Elasticsearch generates a separate match query for each field and then wraps these match queries in an outer bool query.

We can see this by passing our query through the validate-query API:

GET /_validate/query?explain
{
  "query": {
    "multi_match": {
      "query":   "Poland Street W1V",
      "type":    "most_fields",
      "fields":  [ "street", "city", "country", "postcode" ]
    }
  }
}

which yields this explanation:

(street:poland   street:street   street:w1v)
(city:poland     city:street     city:w1v)
(country:poland  country:street  country:w1v)
(postcode:poland postcode:street postcode:w1v)

You can see that a document matching just the word poland in two fields could score higher than a document matching poland and street in one field.

Problem 2: Trimming the Long Tail

In [match-precision], we talked about using the and operator or the minimum_should_match parameter to trim the long tail of almost irrelevant results. Perhaps we could try this:

{
    "query": {
        "multi_match": {
            "query":       "Poland Street W1V",
            "type":        "most_fields",
            "operator":    "and", (1)
            "fields":      [ "street", "city", "country", "postcode" ]
        }
    }
}
  1. All terms must be present.

However, with best_fields or most_fields, these parameters are passed down to the generated match queries. The explanation for this query shows the following:

(+street:poland   +street:street   +street:w1v)
(+city:poland     +city:street     +city:w1v)
(+country:poland  +country:street  +country:w1v)
(+postcode:poland +postcode:street +postcode:w1v)

In other words, using the and operator means that all words must exist in the same field, which is clearly wrong! It is unlikely that any documents would match this query.

Problem 3: Term Frequencies

In [relevance-intro], we explained that the default similarity algorithm used to calculate the relevance score for each term is TF/IDF:

Term frequency

The more often a term appears in a field in a single document, the more relevant the document.

Inverse document frequency

The more often a term appears in a field in all documents in the index, the less relevant is that term.

When searching against multiple fields, TF/IDF can introduce some surprising results.

Consider our example of searching for `Peter Smith'' using the `first_name and last_name fields. Peter is a common first name and Smith is a common last name—​both will have low IDFs. But what if we have another person in the index whose name is Smith Williams? Smith as a first name is very uncommon and so will have a high IDF!

A simple query like the following may well return Smith Williams above Peter Smith in spite of the fact that the second person is a better match than the first.

{
    "query": {
        "multi_match": {
            "query":       "Peter Smith",
            "type":        "most_fields",
            "fields":      [ "*_name" ]
        }
    }
}

The high IDF of smith in the first name field can overwhelm the two low IDFs of peter as a first name and smith as a last name.

Solution

These problems only exist because we are dealing with multiple fields. If we were to combine all of these fields into a single field, the problems would vanish. We could achieve this by adding a full_name field to our person document:

{
    "first_name":  "Peter",
    "last_name":   "Smith",
    "full_name":   "Peter Smith"
}

When querying just the full_name field:

  • Documents with more matching words would trump documents with the same word repeated.

  • The minimum_should_match and operator parameters would function as expected.

  • The inverse document frequencies for first and last names would be combined so it wouldn’t matter whether Smith were a first or last name anymore.

While this would work, we don’t like having to store redundant data. Instead, Elasticsearch offers us two solutions—​one at index time and one at search time—​which we discuss next.