-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: π simplify datathons guide and remove outdated resources
The lengthy checklist and resources sections were removed in favor of a concise overview focused on key benefits. Original content was moved to the datathon-kit repo.
- Loading branch information
1 parent
e3fe992
commit 3be37ee
Showing
1 changed file
with
3 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,75 +1,5 @@ | ||
# Datathons | ||
|
||
## Checklist | ||
|
||
1. Learn more about the problem. Search for similar Kaggle competitions. Check the task in [Papers with Code](https://paperswithcode.com/). Check [Machine Learning subreddit](https://www.reddit.com/r/MachineLearning) for similar problems. | ||
2. Do a basic data exploration. Try to understand the problem and gather a sense of what can be important. | ||
3. Get baseline model working. You can also rely on things like [TableVectorizer](https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html#tablevectorizer). | ||
4. Design an evaluation method as close as the final evaluation. Plot local evaluation metrics against the public ones (correlation) to validate how well your validation strategy works. | ||
5. Try different approaches for preprocessing (encodings, Deep Feature Synthesis, lags, aggregations, imputers, target/count encoding...). If you're working as a group, split preprocessing feature generation between files. | ||
6. Plot learning curves ([sklearn](https://scikit-learn.org/stable/modules/learning_curve.html) or [external tools](https://github.com/reiinakano/scikit-plot)) to avoid overfitting. | ||
7. Plot real and predicted target distribution to see how well your model understand the underlying distribution. Apply any postprocessing that might fix small things. | ||
8. Tune hyper-parameters once you've settled on an specific approach ([hyperopt](target distribution), [optuna](https://optuna.readthedocs.io/)). | ||
9. Plot and visualize the predictions (target vs predicted errors, histograms, random prediction, ...) to make sure they're doing as expected. Explain the predictions with [SHAP](https://github.com/slundberg/shap). | ||
10. Think about what postprocessing heuristics (clipping, mapping to another distribution, ...) can be done to improve or correct predictions. | ||
11. [Stack](https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html) classifiers ([example](https://www.kaggle.com/couyang/featuretools-sklearn-pipeline#ML-Pipeline)). | ||
12. Try AutoML models. | ||
- Tabular: [AutoGluon](https://auto.gluon.ai/), [AutoSklearn](https://github.com/automl/auto-sklearn), Google AI Platform, [PyCaret](https://github.com/pycaret/pycaret), [Fast.ai](https://docs.fast.ai/). | ||
- Time Series: [AtsPy](https://github.com/firmai/atspy), [DeepAR](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html), [Nixtla's NBEATS](https://nixtlaverse.nixtla.io/neuralforecast/models.nbeats.html), [AutoTS](https://github.com/winedarksea/AutoTS). | ||
|
||
## Preprocessing Resources | ||
|
||
- [Feature Engineering Library](https://feature-engine.trainindata.com/). | ||
- [Feature Engineering Ideas](https://github.com/aikho/awesome-feature-engineering). | ||
- [Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html). [Simple tutorial](https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics). | ||
- [Modern Feature Engineering Ideas](https://www.kaggle.com/c/playground-series-s4e12/discussion/554328) ([code](https://www.kaggle.com/code/cdeotte/first-place-single-model-cv-1-016-lb-1-016)). | ||
- [Target Encoding](https://www.kaggle.com/competitions/playground-series-s4e12/discussion/554328) (with cross-validation to avoid leakage). [Data leakage is a common problem in Target Encoding](https://www.geeksforgeeks.org/target-encoding-using-nested-cv-in-sklearn-pipeline/#the-challenge-of-data-leakage-nested-crossvalidation-cv)! | ||
- Forward Feature Selection. | ||
- [Hillclimbing](https://www.kaggle.com/competitions/playground-series-s3e14/discussion/410639). | ||
|
||
## Exploratory Data Analysis Resources | ||
|
||
- [HiPlot](https://facebookresearch.github.io/hiplot/) | ||
|
||
### Scikit Learn Compatible Transformers | ||
|
||
- [LEGO](https://github.com/koaning/scikit-lego) | ||
- [Skrub](https://github.com/skrub-data/skrub) | ||
- [Skoot](https://github.com/tgsmith61591/skoot) | ||
- [Sktools](https://github.com/david26694/sktools) | ||
- [Scikit-Learn Related Projects](https://scikit-learn.org/stable/related_projects.html). | ||
|
||
### Other Compatible Tools | ||
|
||
- [Contributions repository](https://github.com/scikit-learn-contrib) | ||
- [Awesome Scikit-Learn](https://github.com/fkromer/awesome-scikit-learn) | ||
|
||
### Polars | ||
|
||
- [Modern Polars](https://kevinheavey.github.io/modern-polars/) | ||
- [Polars The Definitive Guide](https://github.com/jeroenjanssens/python-polars-the-definitive-guide) | ||
|
||
## Time Series Resources | ||
|
||
- [Quick Tutorials](https://www.kaggle.com/c/jane-street-market-prediction/discussion/198951) | ||
- [Tsfresh](https://tsfresh.readthedocs.io/en/latest/) | ||
- [Fold](https://github.com/dream-faster/fold) | ||
- [Neural Prophet](https://neuralprophet.com/) or [TimesFM](https://github.com/google-research/timesfm) | ||
- [Darts](https://github.com/unit8co/darts) | ||
- [Functime](https://docs.functime.ai/) | ||
- [Pytimetk](https://github.com/business-science/pytimetk) | ||
- [Sktime](https://github.com/alan-turing-institute/sktime) / [Aeon](https://github.com/aeon-toolkit/aeon) | ||
- [Awesome Collection](https://github.com/MaxBenChrist/awesome_time_series_in_python) | ||
- [Video with great ideas](https://www.youtube.com/watch?v=9QtL7m3YS9I) | ||
- [Tutorial Kaggle Notebook](https://www.kaggle.com/code/tumpanjawat/s3e19-course-eda-fe-lightgbm) | ||
- Think about adding external datasets like [related Google Trends search](https://trends.google.com/trends/), PiPy Packages downloads, [Statista](https://www.statista.com/), weather, ... | ||
- [TabPFN for time series](https://github.com/liam-sbhoo/tabpfn-time-series) | ||
|
||
## Datathon Platforms | ||
|
||
- [Kaggle](https://www.kaggle.com/competitions) | ||
- [MLContest](https://mlcontests.com/). They also share a "State of Competitive Machine Learning" report every year ([2023](https://mlcontests.com/state-of-competitive-machine-learning-2023)) and summaries on the state of the art for ["Tabular Data"](https://mlcontests.com/tabular-data/). | ||
- [Humyn](https://app.humyn.ai/) | ||
- [DrivenData](https://www.drivendata.org/competitions/) | ||
- [Xeek](https://xeek.ai/challenges) | ||
- [Cryptopond](https://cryptopond.xyz/) | ||
- Datathons are a great way to learn new skills and get experience in a short amount of time. | ||
- Doing datathons helps you keep up with the latest trends and technologies. | ||
- Have a [Datathon Kit](https://github.com/davidgasquez/datathon-kit) ready to go to get started quickly. |