Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot import large csv tables #2995

Closed
spapas opened this issue Jul 5, 2023 · 5 comments
Closed

Cannot import large csv tables #2995

spapas opened this issue Jul 5, 2023 · 5 comments
Assignees
Labels
needs: user feedback We are waiting for a user to answer questions or provide feedback on our fix type: bug Something isn't working user reported Reported by a Mathesar user work: backend Related to Python, Django, and simple SQL work: frontend Related to frontend code in the mathesar_ui directory
Milestone

Comments

@spapas
Copy link
Contributor

spapas commented Jul 5, 2023

Description

I am trying to import a large csv file (>20 MB), table with ~ 200k rows and ~ 30 cols per row. After the data is uploaded it says "Please wait while we prepare a preview for you" and then dispalys the table preview with only the column names (without data). Then if I visit my tables I see that there's a new table with the name of the csv with a comment "Needs Import confirmation". When I click this I get the same problem with the preview.

I did some debug on the network requests and it seems to me that the import page (URL at /db/mathesar_data/20/import/83/) tries to fetch the url /api/db/v0/tables/83/type_suggestions/?columns_might_have_defaults=false via ajax but this call takes too long and is killed by gunicorn (i.e takes more than 30 seconds) thus I get a "Faild to load preview" error.

See the images below for more info.

Expected behavior

To be able to import the data. My understanding about this issue is that mathsar tries to be smart by altering the column types; this takes way too long when there's a lot of data on the table.

To Reproduce

Unfortunately I cannot provide the data I use to get the error because it is internal however I believe that you'd get similar behavior if you use a large file. Please notice that I tried importing the same columns but with only a couple of rows and it worked fine.

Environment

  • OS: Centos 7.9
  • Browser: Chrome
  • Browser Version: Latest
  • Other info: Using mathesar development version

Additional context

It waits on this state for a long time (until the request is killed):

image

The ajax request that gets killed because of the 30 secodn gunicorn limit:

image

The error I get

image

@spapas spapas added status: triage type: bug Something isn't working labels Jul 5, 2023
@rajatvijay rajatvijay added this to the Backlog milestone Jul 5, 2023
@rajatvijay rajatvijay added work: backend Related to Python, Django, and simple SQL user reported Reported by a Mathesar user labels Jul 11, 2023
@kgodey
Copy link
Contributor

kgodey commented Jul 11, 2023

We've discussed making inference (i.e. guessing column types) optional during the import process, I believe that should fix this.

@dmos62
Copy link
Contributor

dmos62 commented Jul 12, 2023

This is the meta ticket for improving column type inference: #2346

And, this is a sub-ticket for making type inference optional, which is one of possible solutions to timeouts such as reported here: #2358

@dmos62 dmos62 modified the milestones: Backlog, v0.1.3 Jul 14, 2023
@dmos62 dmos62 added the work: frontend Related to frontend code in the mathesar_ui directory label Aug 16, 2023
@seancolsen seancolsen modified the milestones: v0.1.3, v0.1.4 Aug 17, 2023
@seancolsen
Copy link
Contributor

@spapas

In #3050 I made some changes which I think might help with this problem you are having.

This change will be available in Mathesar 0.1.4 which should be released in the next few weeks.

Before my change, importing large amounts of data required significant computational time to determine the best Postgres type for each column. (We refer to this process as "column type inference".) I'm fairly certain that the error you observed was due to timeouts during the inference process. After my change, column type inference is optional. You can disable it during import here:

image

With column type inference disabled, Mathesar will use the "Text" type for all columns by default. But you can still manually configure column types during import as show here:

image

This is "optional column type inference" approach is somewhat of a stopgap measure, intended to hopefully offer a quick fix to this problem so that people like you can import large CSV data sets. In the future we'd like to make more improvements to the inference process as well, and we've opened #2346 to track them.

I'd love to hear back from you once you've had a chance to try your import again while manually disabling column type inference. Does this fix your problem? Do you have other thoughts or feedback about how we can improve this import functionality? We'd really appreciate your feedback!

I'm going to leave this ticket open while we wait to hear back from you, but I'm moving it out of our 0.1.4 milestone because we don't plan to make any additional changes to the import process for 0.1.4.

@seancolsen seancolsen modified the milestones: v0.1.4, High priority Oct 9, 2023
@seancolsen seancolsen added needs: user feedback We are waiting for a user to answer questions or provide feedback on our fix and removed status: started labels Oct 9, 2023
@spapas
Copy link
Contributor Author

spapas commented Oct 10, 2023

Hello @seancolsen thank you for taking a time to look at this problem!

Unfortunately right now I'm on vacation and I don't have access to the original very large csv that had the bad behavior. I tried it with another CSV and it worked great; hopefull the other csv will also work fine when importing all columns as text.

I suggest you close this issue, if I've got any more problems related to this I'll open a new one.

Kind regards,
Serafeim

@seancolsen
Copy link
Contributor

Thanks @spapas! Closing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: user feedback We are waiting for a user to answer questions or provide feedback on our fix type: bug Something isn't working user reported Reported by a Mathesar user work: backend Related to Python, Django, and simple SQL work: frontend Related to frontend code in the mathesar_ui directory
Projects
No open projects
Development

No branches or pull requests

5 participants