Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wesm
Copy link
Contributor

@wesm wesm commented Dec 16, 2024

Addresses #5746. In very large CSV files, it can be possible for DuckDB to infer an integer type for a column and then subsequently have an error when attempting to convert the rest of the data file to integers. One such data file is found at https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip.

This PR changes the CSV importing to fall back on sample_size=-1 (which uses the entire file to do type inference, rather than a sample of rows) in these exceptional cases. This means it takes longer to load the file, but this is better than completely failing.

I made a couple of other incidental changes:

  • Always use CREATE TABLE when importing CSV files, which gives better performance at the exchange of memory use (we can wait for people to complain about memory use problems before working more on this -- using a temporary local DuckDB database file instead of an in-memory one is one way around this potentially). I made sure that file live updates weren't broken by these changes.
  • Always use CREATE VIEW with Parquet files, since single-threaded DuckDB is plenty snappy without converting the Parquet file to its own internal data format.

QA Notes

Loading this 1GB TSV file into the data explorer takes 10s of seconds because duckdb-wasm is single-threaded, so just wait! It will eventually load.

e2e: @:data-explorer

Copy link

github-actions bot commented Dec 16, 2024

E2E Tests 🚀  ?
This PR will run tests tagged with: @critical @data-explorer

@wesm wesm force-pushed the bug/de-duckdb-csv-import-robustness branch from a411bab to 6ae47db Compare December 17, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant