Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple Spark versions #3637

Closed
jackye1995 opened this issue Apr 2, 2025 · 6 comments · Fixed by lancedb/lance-spark#1
Closed

Support multiple Spark versions #3637

jackye1995 opened this issue Apr 2, 2025 · 6 comments · Fixed by lancedb/lance-spark#1

Comments

@jackye1995
Copy link
Collaborator

Spark 4.0 is in preview will be out soon, and so far many projects have taken the multi-version stance to support Spark 3.X versions due to its backwards incompatibility issues, examples:

Currently Lance Spark connector only supports a single version, and it feels like we should move to a similar multi-version model to prepare for 4.0.

However, that might also mean we need to switch a build system, because at this moment I don't see a good way for doing selective versioned build through Maven, only Gradle can support this like in the Iceberg & Gravitino example. For Maven-based systems like Hudi, it seems like the user has to either manually build the specific version, or all the supported Spark versions are all built, which is not really ideal. But I am less familiar with Maven than Gradle, so let me know if there is a good way to support that natively in Maven.

This comes up while I am doing #3636, if we agree with the versioning strategy, I can make the corresponding changes during the port process.

Thoughts? @yanghua @SaintBacchus

@yanghua
Copy link
Collaborator

yanghua commented Apr 3, 2025

+1 to support multiple versions

selective versioned build

What's the motivation for this idea? to speed the build?

@jackye1995
Copy link
Collaborator Author

Yes speeding up build is one benefit. And in Iceberg when we did that, the main benefit was to let all developers know what is the targeted version to add new features, because in IDE only one version's module would resolve. That avoids a few issues we encountered when just starting to do multi-version, where people frequently added new features to older versions by mistake, and it was hard to track which version has what features patched. Selective version build does not 100% socle these issues, but definitely improved the situation a lot.

@yanghua
Copy link
Collaborator

yanghua commented Apr 3, 2025

To speed the build, Suppose we have this project structure layout:

parent-project
├── spark-base
├── spark-3.2
├── spark-3.5
└── spark-4.0

We can use Maven's --projects or -pl to choose part of modules, e.g.:

mvn clean install -pl spark-base,spark-4.0

If you want to build the dependent modules at the same time, run this:

mvn clean install -pl spark-base,spark-4.0 -am

Also, if you want to build the parent module at the same time, run this:

mvn clean install -pl spark-base,spark-4.0 -am -N

Does this feature match your requirements?

the main benefit was to let all developers know what is the targeted version to add new features, because in IDE only one version's module would resolve

What does this mean? Suppose we use Maven to manage the above project layout.

  • one spark version one dir and mapping one maven sub-module;
  • git to track the change

Can it solve your problem?

@jackye1995
Copy link
Collaborator Author

jackye1995 commented Apr 3, 2025

I see. In your example, which version does the spark-base compile against? For example if the base is compiled against 3.5, it seems like 3.2 and 4.0 will need to depend on the same artifact? In that case, how can we know when the base is used in 3.2, things are always binary compatible end to end?

I guess if we need to check that, we will do something like the following to rebuild spark-base against 3.2?

mvn clean install -Dspark.version=3.2.1 -pl spark-base,spark-3.2 -am

This probably should be how the CI should be configured to test each version.

@yanghua
Copy link
Collaborator

yanghua commented Apr 3, 2025

In your example, which version does the spark-base compile against?

Good question.

I think the Spark version in spark-base could be a stable version widely used in the industry(of course, this sounds a little subjective). And in spark-${version} module can override the spark-version via a maven property, just like your solution.

The spark-base module's goal is to extract and define the basic common abstraction for lance project, the abstraction is something that could be reused in spark-${version}.

There is a similar reference in flink-connector-elasticsearch. In the base module, it chooses es 7.10.x.

@jackye1995
Copy link
Collaborator Author

Okay sounds good, I think I get the general idea here, will reorganize the codebase and publish it to lance-spark and ping you for review!

jackye1995 added a commit to lancedb/lance-spark that referenced this issue Apr 4, 2025
Port Spark connector from
https://github.com/lancedb/lance/tree/main/java/spark to the new
repository.

Refactor the codebase structure and fix a few methods in LanceArrowUtils
to support both Spark 3.4 and 3.5.

Closes: lancedb/lance#3636
Closes: lancedb/lance#3637
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants