Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documents loose after their ingestion #5465

Open
djklim87 opened this issue Sep 30, 2024 · 3 comments
Open

Documents loose after their ingestion #5465

djklim87 opened this issue Sep 30, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@djklim87
Copy link

Describe the bug
After the ingestion 1.8T of data we lost 9 documents.
We log every response from quickwit during ingestion. And we see in logs

{"num_docs_for_processing":7000} #247545 times
{"num_docs_for_processing":1000} #2 times
{"num_docs_for_processing":71} #1 time

So we expected 1732817071 docs as a result but got 1732817062

# curl -H "Content-type: application/json" -X POST \
> http://localhost:7280/api/v1/taxi/search/ \
> -d '{"query":"*","max_hits":0,"aggs":{"count(*)":{"value_count":{"field":"id"}}}}'
{ 
  "num_hits": 1732817062,
  "hits": [],
  "elapsed_time_micros": 1679406,
  "errors": [],
  "aggregations": {
    "count(*)": {
      "value": 1732817062.0
    }
  }
}

Steps to reproduce (if applicable)

This is a big amount of data so we can't provide the dump easily.

You can reproduce this issue via databases comparing tool

  1. Clone comparing tool
git clone [email protected]:db-benchmarks/db-benchmarks.git
cd db-benchmarks
git checkout feat/quickwit
  1. Copy .env.example to .env
  2. Update cpuset in .env with the default value of CPUs that your machine has
  3. Open the test folder
cd tests/taxi
  1. Add exit 1 to prevent other engines init (It doesn't affect our issue and save us space)
  2. Run ./init

Ingestion will take 3-4 days after you will see the problem.

Expected behavior
1732817071 count of docs as results

Configuration:
Please provide:

  1. Output of quickwit --version 0.8.1
  2. The index_config.yaml
@djklim87 djklim87 added the bug Something isn't working label Sep 30, 2024
@PSeitz
Copy link
Contributor

PSeitz commented Oct 1, 2024

If documents don't match the schema, they won't be indexed, which may cause the mismatch

@djklim87
Copy link
Author

djklim87 commented Oct 1, 2024

If documents don't match the schema, they won't be indexed, which may cause the mismatch

Should it answer with some error? Cause we don't see any error responses

@PSeitz
Copy link
Contributor

PSeitz commented Oct 2, 2024

No, I think it only logs errors currently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants