Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations that return field values do not respect output format #5254

Closed
esatterwhite opened this issue Jul 24, 2024 · 8 comments · Fixed by quickwit-oss/tantivy#2468 or #5293
Closed
Assignees
Labels
bug Something isn't working

Comments

@esatterwhite
Copy link
Collaborator

esatterwhite commented Jul 24, 2024

Describe the bug
The output format is not respected on aggregations that return field values, such as bucket aggregations, For example, u64 fields that are serialized to json will return a number in scientific notation.

Steps to reproduce (if applicable)

Field Definition
    , {
        name: '_lid'
      , type: 'u64'
      , coerce: true
      , output_format: 'string'
      , fast: true
      }
Query
{
  "size": 0,
  "aggs": {
    "dupes": {
      "terms": { "field": "_lid" }
    }
  }
}'

Will return

  "aggregations": {
    "dupes": {
      "buckets": [
        {
          "doc_count": 1,
          "key": 1.7679365525681357e18
        },
      ]
    }
  }

Expected behavior
If a field has a defined output format in the index field mapping, it should be used to format the field value before it is serialized

Build Info

{
  "build": {
    "build_date": "2024-07-19T08:56:40Z",
    "build_profile": "release",
    "build_target": "x86_64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.0",
    "commit_date": "2024-07-19T08:44:01Z",
    "commit_hash": "5638734dda122a87e269758905bf416bee42623c",
    "commit_short_hash": "5638734",
    "commit_tags": [
      "qw-airmail-20240719"
    ],
    "version": "0.8.0-nightly"
  },
  "runtime": {
    "num_cpus": 2,
    "num_threads_blocking": 1,
    "num_threads_non_blocking": 1
  }
}
@esatterwhite esatterwhite added the bug Something isn't working label Jul 24, 2024
@fulmicoton
Copy link
Contributor

fulmicoton commented Jul 25, 2024

Related to #2458

@fulmicoton
Copy link
Contributor

@PSeitz have you checked what elasticsearch does there?
Do they return the key_as_string by default? Is the number shown in scientific notation or can it serialize u64 numbers?

@PSeitz
Copy link
Contributor

PSeitz commented Jul 25, 2024

ES does not return key_as_string here, they return just the number. They also don't have an output_format equivalent. I think the way it's handled there is to store the value as text, or use runtime script field.

POST /aggs1/_doc/1
{ "value": 4.56e-9 }
GET /aggs1/_search
{
  "size": 0,
  "aggs": {"number_terms": { "terms": { "field": "value" }}
  }
}

...
"buckets": [
    {
      "key": 4.559999933206882e-9,
      "doc_count": 1
    }
]

It would make sense to generally return key_as_string to avoid the float precision issue when the data is send to the browser (quickwit-oss/tantivy#2458)

@esatterwhite
Copy link
Collaborator Author

esatterwhite commented Jul 25, 2024

In elasticsearch, we have this number mapped as a long and is isn't returning this value in scientific notation

{
  took: 34549,
  timed_out: false,
  _shards: { total: 3, successful: 3, skipped: 0, failed: 0 },
  _clusters: { total: 1, successful: 1, skipped: 0 },
  hits: {
    total: { value: 10000, relation: 'gte' },
    max_score: null,
    hits: []
  },
  aggregations: {
    dupes: {
      doc_count_error_upper_bound: 3,
      sum_other_doc_count: 53864635,
      buckets: [
        { key: 1769070189829214200, doc_count: 1 },
        { key: 1769070189829214200, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 },
        { key: 1769070189837602800, doc_count: 1 }
      ]
    }
  }
}

@esatterwhite
Copy link
Collaborator Author

esatterwhite commented Jul 25, 2024

Although, It does appear that the number is being truncated somewhere.
So the behavior of the response does seem a little defendant on the field mapping.
In which case, I would mostly expect the output_format to wrap the value as a string.

@fulmicoton
Copy link
Contributor

@PSeitz I still don't understand the status of this ticket. Can you update? You also haven't said if ES was serializing stuff as u64 in decimal form. Eric is suggesting this is the case.

@PSeitz
Copy link
Contributor

PSeitz commented Jul 31, 2024

ES returns long values not in scientific notation, but truncates them (also shown by Erics example).

PUT /long_aggs
{
  "mappings": {
    "properties": {
      "number_field": {
        "type": "long"
      }
    }
  }
}
# Add another document to aggs
POST /long_aggs/_doc/1
{
  "value": 1769070189829214202
}
GET /long_aggs/_search
{
  "size": 0,
  "aggs": {
    "number_terms": {"terms": {"field": "value"}}
  }
}
...
  "aggregations": {
    "number_terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 1769070189829214200,
          "doc_count": 1
        }
      ]
    }
  }

Currently we store numeric keys in the term aggregation as f64. If we extend the Key enum with Key::I64 and Key::U64 we can keep the full precision. serde doesn't seem to serialize integers in scientific notation.
I'll prepare a PR for that.

Having multiple numerical types adds some complexity though, e.g. merging aggregation results with mixed types.

@PSeitz
Copy link
Contributor

PSeitz commented Aug 6, 2024

@esatterwhite u64 numbers should now be correctly serialized with full precision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants