Skip to content

Commit

Permalink
feat: πŸ“ enhance data engineering and datathon guides with best practices
Browse files Browse the repository at this point in the history
Added detailed notes on immutability benefits, improved feature engineering tips, and expanded ML resources including Polars references. Enhanced datathon exploration suggestions with target encoding caveats.
  • Loading branch information
davidgasquez committed Jan 27, 2025
1 parent 0ebdf09 commit e3fe992
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 7 deletions.
1 change: 1 addition & 0 deletions Data/Data Engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ graph LR;
- Objects will be more thread safe inside a program.
- Easier to reason about the flow of a program.
- Easier to debug and troubleshoot problems.
- [We need immutability to coordinate at a distance and we can afford immutability, as storage gets cheaper](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf).

## Great Blog Posts

Expand Down
15 changes: 10 additions & 5 deletions Datathons.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

## Checklist

1. Learn more about the problem. Search for similar Kaggle competitions. Check the task in [Papers with Code](https://paperswithcode.com/).
1. Learn more about the problem. Search for similar Kaggle competitions. Check the task in [Papers with Code](https://paperswithcode.com/). Check [Machine Learning subreddit](https://www.reddit.com/r/MachineLearning) for similar problems.
2. Do a basic data exploration. Try to understand the problem and gather a sense of what can be important.
3. Get baseline model working.
3. Get baseline model working. You can also rely on things like [TableVectorizer](https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html#tablevectorizer).
4. Design an evaluation method as close as the final evaluation. Plot local evaluation metrics against the public ones (correlation) to validate how well your validation strategy works.
5. Try different approaches for preprocessing (encodings, Deep Feature Synthesis, lags, aggregations, imputers, ...). If you're working as a group, split preprocessing feature generation between files.
5. Try different approaches for preprocessing (encodings, Deep Feature Synthesis, lags, aggregations, imputers, target/count encoding...). If you're working as a group, split preprocessing feature generation between files.
6. Plot learning curves ([sklearn](https://scikit-learn.org/stable/modules/learning_curve.html) or [external tools](https://github.com/reiinakano/scikit-plot)) to avoid overfitting.
7. Plot real and predicted target distribution to see how well your model understand the underlying distribution. Apply any postprocessing that might fix small things.
8. Tune hyper-parameters once you've settled on an specific approach ([hyperopt](target distribution), [optuna](https://optuna.readthedocs.io/)).
9. Plot and visualize the predictions (target vs predicted errors, histograms, random prediction, ...) to make sure they're doing as expected. Explain the predictions with [SHAP](https://github.com/slundberg/shap).
10. Think about what postprocessing heuristics can be done to improve or correct predictions.
10. Think about what postprocessing heuristics (clipping, mapping to another distribution, ...) can be done to improve or correct predictions.
11. [Stack](https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html) classifiers ([example](https://www.kaggle.com/couyang/featuretools-sklearn-pipeline#ML-Pipeline)).
12. Try AutoML models.
- Tabular: [AutoGluon](https://auto.gluon.ai/), [AutoSklearn](https://github.com/automl/auto-sklearn), Google AI Platform, [PyCaret](https://github.com/pycaret/pycaret), [Fast.ai](https://docs.fast.ai/).
Expand All @@ -23,7 +23,7 @@
- [Feature Engineering Ideas](https://github.com/aikho/awesome-feature-engineering).
- [Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html). [Simple tutorial](https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics).
- [Modern Feature Engineering Ideas](https://www.kaggle.com/c/playground-series-s4e12/discussion/554328) ([code](https://www.kaggle.com/code/cdeotte/first-place-single-model-cv-1-016-lb-1-016)).
- [Target Encoding](https://www.kaggle.com/competitions/playground-series-s4e12/discussion/554328) (with cross-validation to avoid leakage).
- [Target Encoding](https://www.kaggle.com/competitions/playground-series-s4e12/discussion/554328) (with cross-validation to avoid leakage). [Data leakage is a common problem in Target Encoding](https://www.geeksforgeeks.org/target-encoding-using-nested-cv-in-sklearn-pipeline/#the-challenge-of-data-leakage-nested-crossvalidation-cv)!
- Forward Feature Selection.
- [Hillclimbing](https://www.kaggle.com/competitions/playground-series-s3e14/discussion/410639).

Expand All @@ -44,6 +44,11 @@
- [Contributions repository](https://github.com/scikit-learn-contrib)
- [Awesome Scikit-Learn](https://github.com/fkromer/awesome-scikit-learn)

### Polars

- [Modern Polars](https://kevinheavey.github.io/modern-polars/)
- [Polars The Definitive Guide](https://github.com/jeroenjanssens/python-polars-the-definitive-guide)

## Time Series Resources

- [Quick Tutorials](https://www.kaggle.com/c/jane-street-market-prediction/discussion/198951)
Expand Down
4 changes: 2 additions & 2 deletions Open Data.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ That forces to keep up on the quality and freshness.
- Adapters are created by the community so data becomes connected.
- Having better data will help create better and more accessible AI models ([people are working on this](https://github.com/togethercomputer/OpenDataHub)).
- Integrate with the modern data stack to avoid reinventing the wheel and increase surface of the required skill sets.
- Decentralized the computation (where data lives) and then cache inmutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data!
- Decentralized the computation (where data lives) and then cache immutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data!
- [Other Principles from the Indie Web](https://indieweb.org/principles) like have fun!

## Modules
Expand Down Expand Up @@ -123,7 +123,7 @@ Package managers have been hailed among the most important innovations Linux bro
- Tabular data could be partitioned to make it easier for future retrieval.
- **Immutability**. Never remove historical data. Data should be append only.
- Many public data sources issue restatements or revisions. The protocol should be able to handle this.
- [Higher resolution is more valuable than lower resolution](https://www.linkedin.com/pulse/re-framing-open-data-john-weigelt/). Publish inmutable data and then compute the lower resolution data from it.
- [Higher resolution is more valuable than lower resolution](https://www.linkedin.com/pulse/re-framing-open-data-john-weigelt/). Publish immutable data and then compute the lower resolution data from it.
- Similar to how `git` deals with it. You _could_ force the deletion of something in case that's needed, but that's not the default behavior.
- **Flexible**. Allow arbitrary backends. Both centralized ([S3](https://twitter.com/quiltdata/status/1569447878212591618), GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers.
- As agnostic as possible, supporting many types of data; tables, geospatial, images, ...
Expand Down
1 change: 1 addition & 0 deletions Programming.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ A programmer should know [lots](http://programmer.97things.oreilly.com/wiki/inde
- **Treat all the data as an [append only event log](https://www.youtube.com/watch?v=ZQ-MdKj3BjU)**.
- Use a central log where consumers can subscribe to the relevant events.
- Having a central place ([the log](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)) for continuous events make easy to create a stream of data to process and sets a source of truth.
- A [log improves coordination in distributed systems](https://restate.dev/blog/every-system-is-a-log-avoiding-coordination-in-distributed-applications/).
- **There is no silver bullet**.
- Accept that many programming decisions are opinions.
- Make the trade-offs explicit when making judgments and decisions. With almost every decision you make, you're either deliberately or accidentally trading off one thing for another thing.
Expand Down

0 comments on commit e3fe992

Please sign in to comment.