Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ouput request: Delta table format #28

Open
Ayoub-28 opened this issue Mar 11, 2025 · 1 comment
Open

Ouput request: Delta table format #28

Ayoub-28 opened this issue Mar 11, 2025 · 1 comment

Comments

@Ayoub-28
Copy link

Hi guys,

Nice package, we have been playing around with converting xml data and xsd schemas to postgres tables schemas and data insertion. However, we would really like to use the delta table (https://www.chaosgenius.io/blog/delta-table/) format instead of the postgres format.

We are also curious if there are alternative packages that convert XSD & XML to delta table format.

Are there any plans to add this to the current library or can you advice us on alternative libraries.

Thank you!

@martinv13
Copy link
Collaborator

Hi,

Thanks for the feedback.

I am not familiar with this format, but I understand that some form of SQL can be used to run queries? Currently this package is quite tightly integrated with SQLAlchemy, so if it is possible to connect with SQLAlchemy in some way I suppose it could be achieved, but I doubt that it would be a common pattern.

Also I understand that this kind of cloud database is designed with very large denormalized flat tables in mind, as opposed to the rather complex relational models that xml2db can produce? Would that be a good fit, considering the joins involved in analysing the data?

Here is a list of improvement ideas that could help toward the goal of supporting more "modern" databases. However I don't have much time to allocate to this project and on our side we are sticking with a relational database for now so we don't have a strong interest in this.

  1. Decouple the database interaction into a DB "Adapter" class that could handle the specifics of each backend and even be provided by the user.
  2. Implement bulk load methods to insert the data into temporary tables (we do this on our side by patching the package with another load method, as the default SQLAlchemy insert is way slower, and actually unpractical for large data). The issue is that these methods would be backend dependant.
  3. Implement the ability to denormalize the data more. Specifically, in case of 1-n relationships we could "elevate" the children, repeating the parent fields values for each child row. This would allow flattening the schemas to a greater extent.

I could not help much with another library recommendation as the reason we developed this in the first place is the lack of alternative (back then, at least).

Happy to discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants