Skip to content
This repository has been archived by the owner on Nov 2, 2018. It is now read-only.

Presto-odbc with large tables (>300 K rows) #49

Open
nnadtoka opened this issue Sep 30, 2014 · 5 comments
Open

Presto-odbc with large tables (>300 K rows) #49

nnadtoka opened this issue Sep 30, 2014 · 5 comments

Comments

@nnadtoka
Copy link

Hi @nezihyigitbasi,

We are seeing a little bit of an issue with loading big tables sizes (over 300K-500K rows) into Tableau with Presto odbc driver. It becomes very slow, seems to only work with only an extract of data.

Have you seen any issues like that before? What kind of tables have you tested it on (row numbers, number of columns, data types)

Thanks Nezih.

Best,
Nataliya.

@nezihyigitbasi
Copy link
Contributor

Hi Nataliya,
We also had similar issues with big tables @ Netflix as the driver currently loads the whole result set in memory before returning it to application/Tableau. That's why it's slow. Please see https://github.com/prestodb/presto-odbc/blob/master/driver/driver.d#L448 for the for loop that exhausts all rows in the returned result set. Like you say the driver is currently more suitable for data extracts.

Nezih

@yitzchakl
Copy link

Hi.

On the driver.d, how could i turn SQLExecuteImpl method to return the result set to the application by paging (or similar functionality) and not only at the end of batch processing?
Is there a way to prevent from exhausting all rows in the result set to the memory?

@markisaa
Copy link
Contributor

@yitzchakl - I just did some checking. This may be possible (though a definitive answer will require yet more investigation).

Notes to a future implementer:

  1. You must not move the call to ExecuteImpl out of SQLPrepare. In particular, it is necessary to check the syntax/validity of the query at the prepare step, as this cannot be done later on. (Presto also has no SQLPrepare or purely syntax checking query capability).
  2. The results from Presto are already naturally "paged". We use the http interface, which shows some results and has a link to the "next" batch of results. Presently the code reads all of them up-front. We only need to read the first page right away to get the error checking component. It is possible that we can meet this need while defering more of the work till it is called for.
  3. This is presently of questionable benefit: Because garbage collection is turned off, even if we defer reading the input, the memory would not be freed, therefore it may not buy us anything at all. This is especially true given that Tableau, in my experience, likes to read all of the data up-front anyway. Re-enabling the garbage collector correctly is a highly non-trivial task.

@yitzchakl
Copy link

Ok, thanks @markisaa .

My main goal is to improve the performance of the execute, right now my testing results are not so good (i'm working with Qlikview, not Tableau) :
3 minutes for 300,000 rows (14 columns table)
4 minutes for 540,000 rows (14 columns table)
9 minutes (!) for 560,000 rows (31 columns table)

Off course i'm running without tracing log :-)
Which most expensive operation should be optimized? the for loop?

Thanks in advance.

@markisaa
Copy link
Contributor

Sorry to hear that. Performance was definitely not high on the original design considerations; we wanted something that worked and my internship only lasted so long. The only way to know what's slow would be to benchmark. It would not surprise me if the [mis]use of memory played a role here (though I'm sure there's plenty of code that can just be optimized anyway).

I'm not in a position to work on this at the moment (swamped with my last term of university), but should that change I will see if I can help on this front. I'll also do some reading on how to properly re-enable the GC.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants