Some aggregations, such as the terms
bucket, operate on string fields. And
string fields may be either analyzed
or not_analyzed
, which begs the question:
how does analysis affect aggregations?
The answer is "a lot," for two reasons: analysis affects the tokens used in the aggregation, and doc values do not work with analyzed strings.
Let’s tackle the first problem: how the generation of analyzed tokens affects aggregations. First, let’s index some documents representing various states in the US:
POST /agg_analysis/data/_bulk
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New Jersey" }
{ "index": {}}
{ "state" : "New Mexico" }
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New York" }
We want to build a list of unique states in our dataset, complete with counts.
Simple—let’s use a terms
bucket:
GET /agg_analysis/data/_search
{
"size" : 0,
"aggs" : {
"states" : {
"terms" : {
"field" : "state"
}
}
}
}
This gives us these results:
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "new",
"doc_count": 5
},
{
"key": "york",
"doc_count": 3
},
{
"key": "jersey",
"doc_count": 1
},
{
"key": "mexico",
"doc_count": 1
}
]
}
}
}
Oh dear, that’s not at all what we want! Instead of counting states, the aggregation is counting individual words. The underlying reason is simple: aggregations are built from the inverted index, and the inverted index is post-analysis.
When we added those documents to Elasticsearch, the string "New York"
was
analyzed/tokenized into ["new", "york"]
. These individual tokens were then
used to populate aggregation counts, and ultimately we see counts for new
instead of
New York
.
This is obviously not the behavior that we wanted, but luckily it is easily corrected.
We need to define a multifield for state and set it to not_analyzed
. This
will prevent New York
from being analyzed, which means it will stay a single
token in the aggregation. Let’s try the whole process over, but this time
specify a raw multifield:
DELETE /agg_analysis/
PUT /agg_analysis
{
"mappings": {
"data": {
"properties": {
"state" : {
"type": "string",
"fields": {
"raw" : {
"type": "string",
"index": "not_analyzed"(1)
}
}
}
}
}
}
}
POST /agg_analysis/data/_bulk
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New Jersey" }
{ "index": {}}
{ "state" : "New Mexico" }
{ "index": {}}
{ "state" : "New York" }
{ "index": {}}
{ "state" : "New York" }
GET /agg_analysis/data/_search
{
"size" : 0,
"aggs" : {
"states" : {
"terms" : {
"field" : "state.raw" (2)
}
}
}
}
-
This time we explicitly map the state field and include a
not_analyzed
sub-field. -
The aggregation is run on state.raw instead of state.
Now when we run our aggregation, we get results that make sense:
{
...
"aggregations": {
"states": {
"buckets": [
{
"key": "New York",
"doc_count": 3
},
{
"key": "New Jersey",
"doc_count": 1
},
{
"key": "New Mexico",
"doc_count": 1
}
]
}
}
}
In practice, this kind of problem is easy to spot. Your aggregations will simply return strange buckets, and you’ll remember the analysis issue. It is a generalization, but there are not many instances where you want to use an analyzed field in an aggregation. When in doubt, add a multifield so you have the option for both.
While the first problem relates to how data is aggregated and displayed to your user, the second problem is largely technical and behind the scenes.
Doc values do not support analyzed
string fields because they are not very efficient
at representing multi-valued strings. Doc values are most efficient
when each document has one or several tokens, but not thousands as in the case
of large, analyzed strings (imagine a PDF body, which may be several megabytes
and have thousands of unique tokens).
For that reason, doc values are not generated for analyzed
strings. Yet these fields
can still be used in aggregations. How is that possible?
The answer is a data structure known as fielddata. Unlike doc values, fielddata
is built and managed 100% in memory, living inside the JVM heap. That means
it is inherently less scalable and has a lot of edge-cases to watch out for.
The rest of this chapter are addressing the challenges of fielddata in the context
of analyzed
strings
Note
|
Historically, fielddata was the default for all fields, but Elasticsearch has been migrating towards doc values to reduce the chance of OOM. Analyzed strings are the last holdout where fielddata is still used. The goal is to eventually build a serialized data structure similar to doc values which can handle highly dimensional analyzed strings, obsoleting fielddata once and for all. |
There is another reason to avoid aggregating analyzed fields: high-cardinality fields consume a large amount of memory when loaded into fielddata. The analysis process often (although not always) generates a large number of tokens, many of which are unique. This increases the overall cardinality of the field and contributes to more memory pressure.
Some types of analysis are extremely unfriendly with regards to memory. Consider an n-gram analysis process. The term New York might be n-grammed into the following tokens:
-
ne
-
ew
-
w{nbsp}
-
{nbsp}y
-
yo
-
or
-
rk
You can imagine how the n-gramming process creates a huge number of unique tokens, especially when analyzing paragraphs of text. When these are loaded into memory, you can easily exhaust your heap space.
So, before aggregating string fields, assess the situation:
-
Is it a
not_analyzed
field? If yes, the field will use doc values and be memory-friendly -
Otherwise, this is an
analyzed
field. It will use fielddata and live in-memory. Does this field have a very large cardinality caused by ngrams, shingles, etc? If yes, it may be very memory unfriendly.