Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize to and deserialize from Apache Arrow format #1

Open
ghost opened this issue May 24, 2018 · 12 comments
Open

Serialize to and deserialize from Apache Arrow format #1

ghost opened this issue May 24, 2018 · 12 comments

Comments

@ghost
Copy link

ghost commented May 24, 2018

I am using arrow and it uses flat buffers internally which are very fast.

I would be interested in extending qframe to work with flat buffers.

There is also a special schemaless flat buffers called "flexible" which does not enforce a schema. I expect this is what you want to use for qframe.

@tobgu
Copy link
Owner

tobgu commented May 24, 2018

Cool, I'd be very happy to take contributions in this area! I'll be happy to discuss this further with you, answer any questions about the current implementation and/or review PRs.

@ghost
Copy link
Author

ghost commented May 27, 2018

Thanks !
Well there is a great flat buffers library called gotables.
This is worth considering and arrow is much latter I feel.

Check this out and have a play and think how it relates to qframe.

https://github.com/urban-wombat

I plan to work up more stuff with gotables in the urban-wombat repos.

Just totally out of time right now.
The reasons are speed speed speed.
Also the flat buffers are both a fast database and a fast network transport - the two core things every architecture needs. By using it as a db and network serialisation you have way less code and higher speed again.

Anyway I am very curious how it can mate with QFrame as immutable is really important

@tobgu
Copy link
Owner

tobgu commented Jun 18, 2018

I took some time to check out gotables, flatbuffers and how they relate to arrow. As you mention arrow uses flatbuffers for the meta data which seems nice.
I don't really understand what you mean when you say that "arrow is much latter". Even if you use flatbuffers for the actual data serialization wouldn't you have to come up with the schema/format of the data you want to store? Do you mean that a custom data format (based on gotables for example) should be used initially?

Wouldn't it make sense to adopt the Arrow schema from the start and use that as the "native" serialization schema for QFrame? While browsing the Arrow data layout docs it seemed to me that a lot of the data should be possible to use with zero copying when "deserializing" given the current internal data formats in QFrame columns. where that is currently not the case adjustments to the internal format may be possible to allow it.

@ghost
Copy link
Author

ghost commented Jun 19, 2018

agrre that the arrow schema makes sense. I feel out of my knowledge depth about arrow here.
I have not dug into it enough to even comment.

also influxDB startup donated the golang code btw. Its up in the air as to IF it will be maintained . has not been touched in ages.

@tobgu
Copy link
Owner

tobgu commented Jun 19, 2018

Yes, I also noticed the work on Arrow from Influx when it was first released and was very excited. I've also noticed that not much has happened since then. I hope they will pick it up again!

@ghost
Copy link
Author

ghost commented Jun 20, 2018 via email

@tobgu
Copy link
Owner

tobgu commented Jun 20, 2018

I think I'll start experimenting with the Arrow format for fast serialization and deserialization of QFrames to see how far away the current internal representation is from the Arrow format without waiting for the official repo. I'm already in need of an efficient binary format for that so why not choose Arrow.

If that repo starts moving again it may make sense to align the internal representation with Arrow entirely since it would give access to some AVX2 optimized aggregations, etc that they seem to be developing.

I'll change the title of this ticket a bit to narrow the focus to serialization and deserialization for now though.

@tobgu tobgu changed the title Arrow Serialize to and deserialize from Apache Arrow format Jun 20, 2018
@ghost
Copy link
Author

ghost commented Jul 20, 2018

sorry about 1 month delay. Sounds like a good approach to use the Arrow format.
Have Influx of anyone touched the go implementation at all though ?

https://github.com/apache/arrow/tree/master/go/arrow

Nope.. hmm.

seems that sbinet is the maintainer for the go Arrow code ?
https://github.com/apache/arrow/commits?author=sbinet

Might want to chat to him.. He works at Cern i think ?

@sbinet
Copy link
Contributor

sbinet commented Aug 8, 2018

I've started to work on providing support for List arrays:

feel free to have a look at that and comment/improve :)

(PS: I work for IN2P3/CNRS, kind of the french equivalent of NSF/DOE and I do work for some experiments based at CERN. but I am not a CERN employee per se.)

@sbinet
Copy link
Contributor

sbinet commented Aug 9, 2018

and now the PR for Struct arrays:

@tobgu
Copy link
Owner

tobgu commented Aug 9, 2018

Cool @sbinet, great to see the arrow initiative for Go moving again!

@ghost
Copy link
Author

ghost commented Sep 19, 2018

Wow guys this is great.
Qframe with arrow solves a mountain of hoops for jump through.

Much thanks and will play around with this.
If anyone has a project using these bits together please add the link ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants