-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Delta chronically behind Databricks #1775
Comments
IANAL and db is free to use as best I can read this software to provide this product, but it seems against at least the spirit of the license to call it by the same name and to claim it is open source (like). |
would be interested to know what are your findings? It is hard to tell from the comments. |
Hi @th0ma5w - I'd like to understand your context better as this issue does not provide the necessary background. If you would like, feel free to ping me at denny[dot]lee[at]databricks.com, and I'm glad to have a conversation at your convenience. And if you're up for it, we can summarize our conversation in this thread to provide context for complete and full transparency. |
If someone were to provide me a backup of files produced using Delta Lake related product offerings, is there any code at https://github.com/delta-io that would be 100% round trip feature complete for both reading and writing these files? |
So I kept googling here, and I read a lot about the open sourcing of everything about a year ago. Around the same time there was this blog post https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/ |
Hi @th0ma5w - let me try to answer some of your questions:
Per your comment:
While Table Features certainly simplifies our ability to flag and document this process, a key thing missing is the test infrastructure to validate that all of these different APIs actually perform as they state (via documentation, code, or otherwise). Because of this project's pace and so many different users, contributors, and organizations that are using Delta, To address this, we are fortunate that we started the Delta Acceptance Testing or DAT project earlier this year. We have had various meetings and contributions initially with the Python, Rust, Spark, and Trino contributors (note this is open to everyone) to build up test infrastructure so that all APIs could both document and have associated test cases (i.e., pass/fail) for each of the different features. It is still early for this project (hence why its part of delta-incubator). Still, it allows us to test and validate that the framework will address the needs of the community as a whole (e.g., different sets of validation queries between Trino SQL, Datafusion, Polars, Spark SQL, etc.). Per your comment:
No need to apologize; that's the whole point of these forums for discussion! The blog post you mentioned Diving Into Delta Lake: Unpacking The Transaction Log, which you can also watch here, is discussing how the Delta transaction protocol itself works. While it is using Apache Spark™ as its example (this blog was written in 2019, shortly after our initial open-sourcing of Delta Lake), the protocol itself is API agnostic. For example, here's a great webcast with @houqp who had created Delta Rust - D3L2: The Genesis of Delta Rust with QP Hou discussing how he created the Delta Rust via the protocol documentation over a few weekends. As for the implied difference between the OSS vs. Databricks offering, we have remedied to fix this as part Delta Lake 2.0 that was announced in Data + AI Summit 2022. Some information to help provide context:
As for your callout for better documentation on how they differ, you are right, and I have called myself out in Delta Users Slack in this thread HTH! |
@th0ma5w - thanks for opening this issue. delta-io/delta is a reference implementation of the Delta Lake transaction log protocol. There are multiple implementations of the Delta Lake transaction log protocol that are in various stages of development including delta-io/delta-rs and dask-contrib/dask-deltatable.
There are multiple open source implementations that are interoperable with Delta tables created by Databricks. This is one of the main benefits of the Lakehouse architecture.
Yes, and readers/writers that implement the Delta Lake transaction log protocol are interoperable. Your question refers to the delta-io GitHub organization (not this particular repo), so I'll give a more generic answer. A Delta table that's written can only be read by a library that supports the protocol version. A Delta table that's written with deletion vectors enabled can only be read by a library that supports deletion vectors. At the time, delta-io/delta can read Delta tables with deletion vectors enabled and delta-io/delta-rs cannot. I created a diagram the other day that shows how you can create a Delta table with PySpark (delta-io/delta), append to it with pandas (delta-io/delta-rs), and read it with Polars that might help highlight the interoperable nature of Delta tables:
There is full interoperability between Delta tables written in compliance with the Delta Lake transaction protocol and Delta Lake readers that support the Delta Lake transaction log protocol. This is the beauty of the Lakehouse architecture. A user can spin up a machine that ingests data from Kafka into a Delta table using delta-io/kafka-delta-ingest and that Delta table can be read by any technology that follows the protocol.
I think you're referring to the "Tool Write Capability" at the top of the blog post in the diagram. I looked at this for 10 minutes and don't understand what the diagram is trying to communicate. I think the diagram is implying that there is not an open source Flink writer for Delta Lake? Here is the open source Delta Lake Flink connector with write support. I don't want to go into that blog post in detail, but it seems to be making several incorrect/misleading claims and omits important parts of the conversation like Z Ordering. Thanks again for opening this issue. Delta Lake is a Lakehouse Storage System, as described in this paper, and any implementation that abides by the spec will, in turn, allow for an interoperable Lakehouse Architecture. Let me know if you have any additional questions. |
I feel like this is describing perhaps several layers of potential compatibility with various specific features... Is there a feature matrix available? Is it known or documented somewhere what the differences are between collections produced by the Databricks platform, or are we saying that there is nothing but the open source versions in use at Databricks today? |
You're right - per this thread - it's on my priority list to provide. |
@th0ma5w - Yea, each individual project should clearly indicate the protocol version/table features that are supported. I was just going to open an issue to request this in delta-io/delta-rs and realize that @roeap beat me to it in this PR: https://github.com/delta-io/delta-rs/pull/1440/files |
Hi @th0ma5w - from your perspective, when we document this, would this address your concerns? No worries, I'm not trying to close this issue, I just wanted to determine the correct action items. Much appreciated! |
I think this does go very far in that if I was tasked to do this, but it might be hard to unpack depending on how it is documented. I guess I'm not ecstatic about the answer to "can you take a Databricks hosted Delta dependent process and use open source tools to operate it" the answer is "probably not if it is very old, but maybe more probable if it is newer, but lets see if you can solve our feature matrix puzzle." .... What are the things that are only working on the Databricks platform would be very helpful, and also, kindly, is there a way I can contact Databricks legal or something so that the Databricks organization stops calling this stuff open source? |
@dennyglee I've being actively opening issues/PRs in the past year, and I have some suggestions that might help improve the perception about Delta OSS. If you are interested, please send me a message on Slack. |
I did get the ping on this today, I see there is no movement, and Delta is still advertised as an open source offering as a part of the Databricks service offerings, and that continues to not seem true to me given the details of this issue. I do not have any information which is currently not been made public to refute it, so this all seems factual here... I guess I see it as:
Is there any disagreement with these items? I guess I saw a thumbs up from maintainers here on an earlier post but I wanted to re summarize it all again for clarity. |
@th0ma5w - thanks for the ping.
Delta is a Lakehouse storage system. Implementations should follow the Delta Lake transaction log protocol. There are many implementations like the Microsoft Fabric Lakehouse implementation.
This repo contains delta-spark, delta-kernel, and Delta Flink. Other repos contain the Rust implementation, the C Sharp implementation, and the Dask connector. You should be able to fetch the supported protocol versions/table features in each connector. If not, feel free to file an issue on the respective repo.
The word "customer" is ambiguous. There is Amazon EMR Delta Lake, Microsoft Fabric Delta Lake and Delta Lake BigQuery. So yes, if someone makes a Delta table with delta-rs and enables a table feature that's not yet supported by another implementation than a user may face issues.
Yea, this is possible and we're trying to solve the issue with Delta Kernel. delta-rs recently started depending on delta-kernel-rs. Apache Druid is being built with Delta Kernel Java, see this PR. We just chatted with the open source community about updating delta-dotnet to depend on delta-kernel-rs.
I agree that the compatibility issues need to get better and the devs have been working a lot on Delta Kernel Rust and Delta Kernel Java to solve this in a sustainable manner. All Lakehouse storage systems suffer from this issue. It's a really hard engineering problem. So we're trying and investing a ton of resources, but it's really hard. Kernel implementations need to be two-way, zero dependency, and engine agnostic. Delta Kernel Rust needs a clean C FFI. There has been lots of forward progress since your initial message. Look at the 15,000+ lines of Rust code in delta-kernel-rs and all the Delta Kernel Java code in this repo. Fully building out Delta Kernel and getting it integrated into the entire connector ecosystem is a lot of work. Delta is for all engines and all runtime environments. It seems like you're focusing on one engine and one runtime, but we're trying to solve a much broader problem. |
Is there a document that states that Databricks is now using an 100% open source compatible implementation? At the time of this ticket, and as best as I can google now, they say that their Delta offering is open source. My experience at the time of opening this ticket was no public repo of any code that I could find written in any language was compatible with the files I was trying to read. Not being a direct customer of Databricks, I could only open a ticket here.
If Delta is Open Source, why couldn't the DataBricks implementation simply be published? I guess my current status is that I cannot recommend Delta for anyone that has even been a customer of Databricks because there is simply no way to prove compatibility, or for that matter, any clear incompatibility?
So I guess the best thing to be said then is that it is being worked on to make the product Open Source but I'm not sure why you replied, there is no new information I guess? |
Bug
Describe the problem
When trying to interact with the Databricks produced Delta objects, there is no open source compatible version, and so this project is a de facto advertisement and vendor lock in for the Databricks platform.
Steps to reproduce
Observed results
Expected results
Further details
This is all thoroughly and publicly documented in this project's docs and issues.
Environment information
On vs. off Databricks platform
Willingness to contribute
I would be willing to test.
The text was updated successfully, but these errors were encountered: