Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cardinality case tutorial #2

Merged
merged 4 commits into from
Sep 3, 2024

Conversation

tsuruda-yoshito
Copy link
Collaborator

high cardinalityなデータのチュートリアルを作成

  • example/regression/high_cardinality_case_regression.ipynbを作成
    • High Cardinalityの説明、CBがHigh Cardinalityデータに対して強いこととその理由、High Cardinalityケースの例、等を記載
    • High Cardinality度合いの異なる3データでRegressionを行いLightGBMと比較(LightGBMはインストールが必要)
    • 全データセットでLightGBMにMAEで勝ち、High Cardinality度合いが強くなるにしたがってその差が大きくなるような結果が得られている
    • データはノートブック内では、リモートのリポジトリからロードしてくるように設定(regression.ipynbに倣って)
      ⇒URLでBlue-Yonder-OSS/cyclic-boostingのmainリポジトリを指定しているので、マージされない限りエラーが出る
    • CBでは交互作用を多く設定(確認お願いします)
  • tests/high_cardinality_data/以下にHigh Cardinality度合いの異なる3つのデータセットを追加

Copy link
Collaborator

@setoguchi-naoki setoguchi-naoki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

コメントしました

"\n",
"The left histogram depicts the frequency distribution of sales values. The x-axis represents different sales values. The y-axis indicates how frequently each sales value occurs, with a logarithmic scale. The histogram shows that the majority of sales values are low, with a steep decrease in frequency as sales values increase. This type of distribution is common in sales data, where a few high sales values are less frequent compared to many low sales values.\n",
"\n",
"The right histogram displays the distribution of the number of records per product ID (P_ID). The x-axis shows the number of records, and the y-axis indicates the number of products having that specific number of records, also with a logarithmic scale. The histogram suggests that most product IDs have a lower number of records, and as the number of records increases, the frequency of such products decreases."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一文目の意味が少し不明瞭に感じました
可視化しているのはユニーク値のカウント数であることを明示した方がいいと思います。分布というと、一見するだけではある階級に属する頻度のように思えます。

The right histogram displays the count of unique product ID (P_ID) with specific number of records. The x-axis shows the number of records, and the y-axis indicates the number of products having that specific number of records, also with a logarithmic scale. It means that total count of y-axis is number of total unique products. The histogram suggests that most product IDs have a lower number of records, and as the number of records increases, the frequency of such products decreases.

@setoguchi-naoki setoguchi-naoki merged commit d1f32ed into main Sep 3, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants