-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.4 support #705
Comments
Regarding Delta: As I understand it, they dropped support for using Delta without DSV2, which means Spark users will need to migrate to DSV2 Delta, but for us, it's no problem. All DeltaDSV2Spec tests are still passing. |
Integration tests not working:
Unit tests not working:
|
Sorry, no :( We are completely buried with another work. Upgrading to Spark 3.4 is not a priority for our organisation at the moment. If somebody from the community is up for implementing it we would be happy to accept pull-requests and release an upgrade. |
Some work is already done in the linked draft PR. |
Unfortunately we can't give any ETA on this due to reprioritisation and team capacity. As Adam said some work has started in the draft PR #739. It needs to be tested, it might require adding a separate bundle module for 3.4 and potentially other fixes. If you could help with it that would be amazing. Any questions please ask. |
There were some question about what needs to be done to support the new Spark version. So it comes down to two things:
You can use #459 as inspiration for this. |
just a tip: we use the https://github.com/wajda/jars-pommefizer to generate a |
Hey there @wajda / @cerveada - not looking to make any promises here until I understand the full amount of work that might need to be done here after going through the above comments. If we can get a full build running successfully based off of #739 via the below, what else is there left to do? mvn clean test -Pspark3.4 Edit: To add some clarity here, do all the tests passing in this profile address @cerveada 's concern?
I see
In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue. Thanks in advance! |
By all tests, I meant, all unit tests and also all tests in We use teamcity for ci, We can modify it ourselfs when this is ready. |
The test must validate that when you write to a data source and then read from it the URI will be identical. To simulate this I do somethin like this:
I think the issue there is that the Spark will now give you the URI of the original data (A), not the artificial table (B) created from it. So it must be somehow improve or modified to test the same thing as before. Hope it makes sense, I don't remember the actual issue. But from what I wrote here before I think this is it. |
@cerveada Thanks this is helpful. WRT the I'm still learning my way around the codebase but I gathered I could find differences by running the test in 3.3 then in 3.4, while printing out the logical plan in Interestingly enough, they look pretty close to the same, with a few new fields added in 3.4 Here comes the fun part - my print in In 3.3 this makes sense - the test runs and creates two lineage events since there were two writes. There are two In 3.4 it gets weird - the same two events are above, but now two additional logical plans are created as well! Each write has an additional The test is then failing because the lineage captor for the second write is actually getting the second event for the first write. If I ignore the second event (by calling another captor) it actually passes! I don't know if this is the right thing to do given that Spline will be firing extra events. I ran into this same issue while fixing another test - it appears Spark is doing this for both CTAS in regular Spark table and Hive table. Some great news:
Some ok news:
Going to look into the POM piece now - there is a PR #793 addressing what I've done thus far. Would appreciate a look to see if we are fine with this approach. |
* spark agent #705 Spark 3.4 support * Spark 3.4 regression & compatibility fixes * Remove debugging log * Add 3.4 bundle pom * Add back in write to another topic and read from multiple * Update integration-tests/src/test/scala/za/co/absa/spline/harvester/LineageHarvesterSpec.scala * disable SparkUI in examples --------- Co-authored-by: Adam Cervenka <adam.cervenka@absa.africa> Co-authored-by: Ryan Whitcomb <ryankwhitcomb@gmail.com>
there seems to be binary incompatible changes in API of Delta and Spark SQL that Spline core compiled against Spark 2.4 version cannot work with. E.g. RDDPlugin
Todo:
agent-core
Spark specific #604)The text was updated successfully, but these errors were encountered: