Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses #5746. In very large CSV files, it can be possible for DuckDB to infer an integer type for a column and then subsequently have an error when attempting to convert the rest of the data file to integers. One such data file is found at https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip.
This PR changes the CSV importing to fall back on
sample_size=-1
(which uses the entire file to do type inference, rather than a sample of rows) in these exceptional cases. This means it takes longer to load the file, but this is better than completely failing.I made a couple of other incidental changes:
CREATE TABLE
when importing CSV files, which gives better performance at the exchange of memory use (we can wait for people to complain about memory use problems before working more on this -- using a temporary local DuckDB database file instead of an in-memory one is one way around this potentially). I made sure that file live updates weren't broken by these changes.CREATE VIEW
with Parquet files, since single-threaded DuckDB is plenty snappy without converting the Parquet file to its own internal data format.QA Notes
Loading this 1GB TSV file into the data explorer takes 10s of seconds because duckdb-wasm is single-threaded, so just wait! It will eventually load.
e2e: @:data-explorer