Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #13

Open
antl3x opened this issue Jun 9, 2022 · 3 comments
Open

Performance issues #13

antl3x opened this issue Jun 9, 2022 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@antl3x
Copy link

antl3x commented Jun 9, 2022

Hey @djcunningham0. First, congrats for the amazing repo / project!

I'm opening this issue to discuss some improvements in the process_data algorithm.

I run a dataset of ~ 70k rows (matches) and it takes >140 minutes to finish.

Maybe we can do some changes to speed up things? Maybe use Numba?

Insights:

https://python.plainenglish.io/a-solution-to-boost-python-speed-1000x-times-c9e7d5be2f40

https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c

@djcunningham0 djcunningham0 added enhancement New feature or request help wanted Extra attention is needed labels Jul 13, 2022
@djcunningham0
Copy link
Owner

This is a fair point. I didn't write the code with huge datasets like that in mind so I'm sure there are some performance gains to be had. Maybe Numba is a part of the solution. I hadn't heard of it but it sounds interesting.

The best way to improve performance would be to find a way to parallelize the computations, but unfortunately I don't see an obvious way of doing that. Each calculation is potentially affected by the one that came before it, so I'm not sure how you'd identify which computations could be run in parallel.

Anyway, if I get some time to work on this I'll check out Numba and maybe test out some code changes. If you do any experimenting feel free to document here or open a PR.

@djcunningham0
Copy link
Owner

By the way, this doesn't exactly address the performance issues but there is an option for batch processing that can help with large datasets in some cases. Basically, process data as it comes in and save the results, then read in the saved results and process only the new data the next time it comes in. This way you don't have to process all data from the beginning each time. See the "saving and loading ratings / batch processing" section of the demo notebook for details.

Just wanted to make this note here in case it's helpful to anyone who comes across this thread.

@saleemakhtar2
Copy link

Numba and strict use of numpy can improve performance for sure. I have a MultiElo setup which satisfies a lot of the same requirements (zero-sum, etc) which can run > 300k matches in under 10seconds so am surprised to hear this version takes >140mins!
Also using polars over pandas can be a big speed improvement also. When I used pandas it would take ~20 mins, swapping to polars was ~10mins, numpy only ~15seconds, swapping to numba was ~10seconds - can help with any of this if required. I'd open source my version but it's very entwined with the rest of my project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants