Replies: 6 comments 9 replies
-
There shouldn't such a big performance gap in that case. I will look into the problem. |
Beta Was this translation helpful? Give feedback.
-
Hi @leezu, I have checked the problem. I think the behaviour for your case is expected.
The ways vineyard stores data structures has cons as well, you could take a reference of a certain recordbatch in the table, and when deleting the table, you could still use that recordbatch but release memory of others. And two tables could share just some of recordbatches. You could combine chunks before putting into vineyard a gain a better performance in your case. |
Beta Was this translation helpful? Give feedback.
-
BTW may I know the context of your task that using vineyard to store the data? We are eagerly want to listen from the community and end-users for scenarios where vineyard could works and we welcome any suggestions to make vineyard better. Many thanks! |
Beta Was this translation helpful? Give feedback.
-
Thank you @sighingnow for investigating the issue. Combining chunks before storing in vineyard is an acceptable workaround for me. I'll try it out.
The context is training models on multi-gpu, with 1 process per GPU using vineyard to store the dataset in shared memory. |
Beta Was this translation helpful? Give feedback.
-
Thank you @leezu. Currently the compatibility status of pandas is enough for common cases, but still needs further improvement. Feel free to post here if you meets other problems :) |
Beta Was this translation helpful? Give feedback.
-
Would it make sense to use |
Beta Was this translation helpful? Give feedback.
-
I have a pyarrow table with columns composed of 21036 chunks (21032413 rows). Storing the table or the equivalent pandas dataframe in vineyard does not succeed in reasonable amount of time (I waited a few minutes) but hangs in record_batch_builder (and the equivalent function for pandas respectively).
That's unlike the pyarrow plasma implementation, which is implemented in C++ and just takes a few seconds:
https://github.com/apache/arrow/blob/995abdc02fed412bbd947fe41a0765036dbbe820/cpp/src/arrow/python/serialize.cc#L588-L599
Do you intend to match performance with pyarrow for large tables? Or is this out of scope for the project?
Beta Was this translation helpful? Give feedback.
All reactions