Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

wesm · 2024-12-16T22:09:31Z

Addresses #5746. In very large CSV files, it can be possible for DuckDB to infer an integer type for a column and then subsequently have an error when attempting to convert the rest of the data file to integers. One such data file is found at https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip.

This PR changes the CSV importing to fall back on sample_size=-1 (which uses the entire file to do type inference, rather than a sample of rows) in these exceptional cases. This means it takes longer to load the file, but this is better than completely failing.

I made a couple of other incidental changes:

Always use CREATE TABLE when importing CSV files, which gives better performance at the exchange of memory use (we can wait for people to complain about memory use problems before working more on this -- using a temporary local DuckDB database file instead of an in-memory one is one way around this potentially). I made sure that file live updates weren't broken by these changes.
Always use CREATE VIEW with Parquet files, since single-threaded DuckDB is plenty snappy without converting the Parquet file to its own internal data format.

QA Notes

Loading this 1GB TSV file into the data explorer takes 10s of seconds because duckdb-wasm is single-threaded, so just wait! It will eventually load.

e2e: @:data-explorer

github-actions · 2024-12-16T22:09:50Z

E2E Tests 🚀 ^?
This PR will run tests tagged with: @critical @data-explorer

…bust

wesm added 2 commits December 17, 2024 14:22

Make loading CSV/TSV files with more difficult type inference more ro…

f01c825

…bust

remove superfluous semicolon

6ae47db

wesm force-pushed the bug/de-duckdb-csv-import-robustness branch from a411bab to 6ae47db Compare December 17, 2024 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

wesm commented Dec 16, 2024 •

edited

Loading

github-actions bot commented Dec 16, 2024 •

edited

Loading

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

Are you sure you want to change the base?

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

Conversation

wesm commented Dec 16, 2024 • edited Loading

QA Notes

github-actions bot commented Dec 16, 2024 • edited Loading

wesm commented Dec 16, 2024 •

edited

Loading

github-actions bot commented Dec 16, 2024 •

edited

Loading