From 1447cd3aec4c7aacd28c703c4b601dcd018b56d7 Mon Sep 17 00:00:00 2001 From: Jonas Mueller <1390638+jwmueller@users.noreply.github.com> Date: Wed, 1 Jan 2025 20:36:21 -0800 Subject: [PATCH] polish fraud detection example --- 1021_fintech_documentation/Requirement.txt | 4 - README.md | 1 + .../fintech_creditcard_fraud.ipynb | 630 +++++++----------- 3 files changed, 239 insertions(+), 396 deletions(-) delete mode 100644 1021_fintech_documentation/Requirement.txt rename 1021_fintech_documentation/Final.ipynb => fraud_detection/fintech_creditcard_fraud.ipynb (88%) diff --git a/1021_fintech_documentation/Requirement.txt b/1021_fintech_documentation/Requirement.txt deleted file mode 100644 index 2758a9a..0000000 --- a/1021_fintech_documentation/Requirement.txt +++ /dev/null @@ -1,4 +0,0 @@ -numpy==1.22.0 -pandas==1.3.3 -scikit-learn==1.0.2 -scikit-image==0.18.3 \ No newline at end of file diff --git a/README.md b/README.md index e66ec76..43f1ae7 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,7 @@ To quickly learn how to run cleanlab on your own data, first check out the [quic | Example | Description | | ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [datalab](datalab_image_classification/README.md) | Use Datalab to detect various types of data issues in (a subset of) the Caltech-256 image classification dataset. | +| [fraud_detection](fraud_detection/fintech_creditcard_fraud.ipynb) | Apply Datalab to detect various issues in a fraud detection (tabular) dataset. | | [llm_evals_w_crowdlab](llm_evals_w_crowdlab/llm_evals_w_crowdlab.ipynb) | Reliable LLM Evaluation with multiple human/AI reviewers of varying competency (via CROWDLAB and LLM-as-judge GPT token probabilities). | | [fine_tune_LLM](fine_tune_LLM/LLM_with_noisy_labels_cleanlab.ipynb) | Fine-tuning OpenAI language models with noisily labeled text data | | [entity_recognition](entity_recognition/) | Train Transformer model for Named Entity Recognition and produce out-of-sample `pred_probs` for **cleanlab.token_classification**. | diff --git a/1021_fintech_documentation/Final.ipynb b/fraud_detection/fintech_creditcard_fraud.ipynb similarity index 88% rename from 1021_fintech_documentation/Final.ipynb rename to fraud_detection/fintech_creditcard_fraud.ipynb index dd6e17c..9479b34 100644 --- a/1021_fintech_documentation/Final.ipynb +++ b/fraud_detection/fintech_creditcard_fraud.ipynb @@ -7,44 +7,16 @@ "id": "b2eebf0d-31ff-4ce0-b2b7-4d82ee61150b" }, "source": [ - "# Detecting Data Quality Issues in Credit Card Fraud Detection Using Cleanlab\n", + "# Detecting Data Quality Issues in Credit Card Fraud Detection Dataset\n", "\n", - "In this 5-minute quickstart tutorial, we will use **Cleanlab's Datalab** to detect various issues in a tabular dataset commonly encountered in financial applications. This tutorial focuses on the **Credit Card Fraud Detection dataset**, which contains thousands of transaction records labeled as fraudulent or non-fraudulent. The dataset includes features such as transaction amount and anonymized variables for privacy.\n", + "This example uses **Datalab** to auto-detect various issues in a tabular dataset commonly encountered in financial applications. Specifically, the [Credit Card Fraud Detection dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), which you should first download and ensure the data file is named **credit_card_fraud_dataset.csv**. This dataset contains thousands of transaction records labeled as fraudulent or non-fraudulent, and features about each transaction such as the transaction amount plus other variables anonymized for privacy. \n", "\n", "### Cleanlab Helps Uncover:\n", "- **Label errors**: Mislabeled transactions, such as fraudulent cases incorrectly marked as non-fraudulent.\n", "- **Outliers**: Transactions with abnormal patterns that deviate significantly from the rest of the dataset.\n", "- **Near-duplicates**: Repeated transactions or entries that may distort results or impact model performance.\n", "\n", - "Using Cleanlab, we automatically identify examples that are likely mislabeled or problematic, improving the overall data quality for better fraud detection performance. You can adapt this tutorial to detect and correct issues in your own financial tabular datasets.\n" - ] - }, - { - "cell_type": "markdown", - "id": "27fcddca-534f-4851-8c80-688e7cb7ff79", - "metadata": { - "id": "27fcddca-534f-4851-8c80-688e7cb7ff79" - }, - "source": [ - "## Quickstart\n", - "\n", - "Already have (out-of-sample) `pred_probs` from a model trained on your original data labels?\n", - "Have a `knn_graph` computed between dataset examples (reflecting similarity in their feature values)?\n", - "Run the code below to find issues in your dataset.\n" - ] - }, - { - "cell_type": "raw", - "id": "3dc09d4e-499e-4aed-942e-57f1df4deca7", - "metadata": { - "id": "3dc09d4e-499e-4aed-942e-57f1df4deca7" - }, - "source": [ - "from cleanlab import Datalab\n", - "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", - "lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)\n", - "\n", - "lab.get_issues()" + "Using Cleanlab, we automatically identify examples that are likely mislabeled or otherwise problematic, improving the overall data quality for better fraud detection performance. You can adapt this tutorial to detect and correct issues in your own financial tabular datasets." ] }, { @@ -54,20 +26,13 @@ "id": "791717d7-d140-4d85-b516-9a6e6f28c7c0" }, "source": [ - "# 1. Install Required Dependencies\n", - "\n", - "To get started, install the required packages for this tutorial using pip:\n", - "\n", - "```bash\n", - "!pip install \"cleanlab[datalab]\" scikit-learn pandas numpy\n" + "## 1. Install and Import Dependencies" ] }, { "cell_type": "code", - "source": [ - "# Install required libraries with correct versions\n", - "!pip install \"cleanlab[datalab]\" \"numpy\" \"pandas==1.3.3\" \"scikit-learn==1.0.2\" \"scikit-image==0.18.3\"\n" - ], + "execution_count": null, + "id": "f7TpME1a6-Db", "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -75,64 +40,9 @@ "id": "f7TpME1a6-Db", "outputId": "9f5642dc-ae12-464b-e65f-77d0f57b4ce0" }, - "id": "f7TpME1a6-Db", - "execution_count": 3, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.22.0)\n", - "Requirement already satisfied: pandas==1.3.3 in /usr/local/lib/python3.10/dist-packages (1.3.3)\n", - "Requirement already satisfied: scikit-learn==1.0.2 in /usr/local/lib/python3.10/dist-packages (1.0.2)\n", - "Requirement already satisfied: scikit-image==0.18.3 in /usr/local/lib/python3.10/dist-packages (0.18.3)\n", - "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.5.0)\n", - "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.10/dist-packages (from pandas==1.3.3) (2.8.2)\n", - "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.10/dist-packages (from pandas==1.3.3) (2024.2)\n", - "Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (1.11.4)\n", - "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (1.4.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (3.5.0)\n", - "Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (3.8.0)\n", - "Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (3.4.2)\n", - "Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (11.0.0)\n", - "Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (2.36.1)\n", - "Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (2024.9.20)\n", - "Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (1.4.1)\n", - "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.6)\n", - "Requirement already satisfied: termcolor>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.5.0)\n", - "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (3.2.0)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.16.1)\n", - "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (17.0.0)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n", - "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.32.3)\n", - "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.5.0)\n", - "Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n", - "Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.7.0->cleanlab[datalab]) (2024.9.0)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.11.10)\n", - "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.26.5)\n", - "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (24.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.2)\n", - "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (1.2.1)\n", - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (0.12.1)\n", - "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (4.55.3)\n", - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (1.4.7)\n", - "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (3.2.0)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7.3->pandas==1.3.3) (1.17.0)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (2.4.4)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n", - "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (24.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.5.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.1.0)\n", - "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (0.2.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.18.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets>=2.7.0->cleanlab[datalab]) (4.12.2)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (3.4.0)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (3.10)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (2.2.3)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (2024.8.30)\n" - ] - } + "outputs": [], + "source": [ + "!pip install \"cleanlab[all]\"" ] }, { @@ -147,19 +57,16 @@ "import random\n", "import numpy as np\n", "import pandas as pd\n", - "\n", "from sklearn.model_selection import cross_val_predict\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import NearestNeighbors\n", - "\n", - "\n", "from cleanlab import Datalab\n", "\n", - "# Set random seed for reproducibility\n", + "# Optional: set seed for reproducibility\n", "SEED = 42\n", "np.random.seed(SEED)\n", - "random.seed(SEED)\n" + "random.seed(SEED)" ] }, { @@ -169,11 +76,7 @@ "id": "3ef577c5-b3f1-4d5b-9b26-9324651b12fd" }, "source": [ - "# 2. Load and Process the Data\n", - "\n", - "We will now load the Credit Card Fraud Detection dataset, which contains features like transaction amounts and anonymized variables, along with labels indicating whether the transaction is fraudulent (`1`) or non-fraudulent (`0`).\n", - "\n", - "First, we load the dataset and display the first few rows to get an overview of the data structure.\n" + "## 2. Load and Process the Data" ] }, { @@ -190,23 +93,12 @@ }, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " TransactionID TransactionDate Amount MerchantID \\\n", - "0 1 2024-04-03 14:15:35.462794 4189.27 688 \n", - "1 2 2024-03-19 13:20:35.462824 2659.71 109 \n", - "2 3 2024-01-08 10:08:35.462834 784.00 394 \n", - "3 4 2024-04-13 23:50:35.462850 3514.40 944 \n", - "4 5 2024-07-12 18:51:35.462858 369.07 475 \n", - "\n", - " TransactionType Location IsFraud \n", - "0 refund San Antonio 0 \n", - "1 refund Dallas 0 \n", - "2 purchase New York 0 \n", - "3 purchase Philadelphia 0 \n", - "4 purchase Phoenix 0 " - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"fraud_data\",\n \"rows\": 100000,\n \"fields\": [\n {\n \"column\": \"TransactionID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28867,\n \"min\": 1,\n \"max\": 100000,\n \"num_unique_values\": 100000,\n \"samples\": [\n 75722,\n 80185,\n 19865\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionDate\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 100000,\n \"samples\": [\n \"2024-08-18 01:11:35.918051\",\n \"2024-06-09 07:44:35.939541\",\n \"2024-06-10 08:55:35.558368\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1442.4159985963513,\n \"min\": 1.05,\n \"max\": 4999.77,\n \"num_unique_values\": 90621,\n \"samples\": [\n 3273.37,\n 4040.01,\n 4120.55\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MerchantID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 288,\n \"min\": 1,\n \"max\": 1000,\n \"num_unique_values\": 1000,\n \"samples\": [\n 702,\n 152,\n 346\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"purchase\",\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Houston\",\n \"Dallas\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"IsFraud\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "fraud_data" + }, "text/html": [ "\n", "
\n", @@ -501,14 +393,25 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "variable_name": "fraud_data", - "summary": "{\n \"name\": \"fraud_data\",\n \"rows\": 100000,\n \"fields\": [\n {\n \"column\": \"TransactionID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28867,\n \"min\": 1,\n \"max\": 100000,\n \"num_unique_values\": 100000,\n \"samples\": [\n 75722,\n 80185,\n 19865\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionDate\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 100000,\n \"samples\": [\n \"2024-08-18 01:11:35.918051\",\n \"2024-06-09 07:44:35.939541\",\n \"2024-06-10 08:55:35.558368\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1442.4159985963513,\n \"min\": 1.05,\n \"max\": 4999.77,\n \"num_unique_values\": 90621,\n \"samples\": [\n 3273.37,\n 4040.01,\n 4120.55\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MerchantID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 288,\n \"min\": 1,\n \"max\": 1000,\n \"num_unique_values\": 1000,\n \"samples\": [\n 702,\n 152,\n 346\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"purchase\",\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Houston\",\n \"Dallas\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"IsFraud\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " TransactionID TransactionDate Amount MerchantID \\\n", + "0 1 2024-04-03 14:15:35.462794 4189.27 688 \n", + "1 2 2024-03-19 13:20:35.462824 2659.71 109 \n", + "2 3 2024-01-08 10:08:35.462834 784.00 394 \n", + "3 4 2024-04-13 23:50:35.462850 3514.40 944 \n", + "4 5 2024-07-12 18:51:35.462858 369.07 475 \n", + "\n", + " TransactionType Location IsFraud \n", + "0 refund San Antonio 0 \n", + "1 refund Dallas 0 \n", + "2 purchase New York 0 \n", + "3 purchase Philadelphia 0 \n", + "4 purchase Phoenix 0 " + ] }, + "execution_count": 3, "metadata": {}, - "execution_count": 3 + "output_type": "execute_result" } ], "source": [ @@ -558,8 +461,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ " Amount TransactionType_refund Location_Dallas Location_Houston \\\n", "0 1.173161 True False False \n", @@ -612,22 +515,11 @@ "id": "d394b6cd-34fd-4ec0-8f6f-99d13dd481ab" }, "source": [ - "### 3. Select a Classification Model and Compute Out-of-Sample Predicted Probabilities\n", + "## 3. Train a Classification Model and Compute Out-of-Sample Predicted Probabilities\n", "\n", "To detect potential label errors in the **Credit Card Fraud Detection dataset**, Cleanlab requires **probabilistic predictions** for every data point. However, predictions generated on the same data used for training can be **overfitted** and unreliable. For accurate results, Cleanlab works best with **out-of-sample** predicted class probabilities—i.e., predictions for data points excluded from the model during training.\n", "\n", - "---\n", - "\n", - "### Why Use Out-of-Sample Predictions?\n", - "\n", - "Out-of-sample predictions ensure that the model hasn't seen the data points during training. This approach:\n", - "- **Prevents overfitting**: Predictions are not biased by the training process.\n", - "- **Improves reliability**: Probabilities are closer to real-world performance.\n", - "- **Supports Cleanlab's analysis**: Enables Cleanlab to accurately identify mislabeled data and other issues.\n", - "\n", - "---\n", - "\n", - "### How We Generate Out-of-Sample Predictions\n", + "#### How We Generate Out-of-Sample Predictions\n", "\n", "We use **K-fold cross-validation**, which:\n", "1. Splits the dataset into `K` folds.\n", @@ -636,30 +528,9 @@ "\n", "This ensures every data point has **out-of-sample predicted probabilities**.\n", "\n", - "---\n", - "\n", - "### Model: Logistic Regression\n", - "\n", - "For this tutorial, we use **Logistic Regression**, a simple and interpretable model commonly used in fraud detection tasks. It predicts the probability of each class (`0` for non-fraud, `1` for fraud) based on the input features.\n", + "#### Model: Logistic Regression\n", "\n", - "---\n", - "\n", - "### Predicted Probabilities\n", - "\n", - "The output of cross-validation is an array of **predicted probabilities** (`pred_probs`):\n", - "- **Rows** correspond to individual transactions.\n", - "- **Columns** represent the probabilities of each class (`0` and `1`).\n", - "\n", - "For example:\n", - "| Transaction ID | Probability (Non-Fraud) | Probability (Fraud) |\n", - "|----------------|--------------------------|----------------------|\n", - "| 1 | 0.92 | 0.08 |\n", - "| 2 | 0.65 | 0.35 |\n", - "| ... | ... | ... |\n", - "\n", - "These probabilities are a critical input for Cleanlab to identify potential label issues in the dataset.\n", - "\n", - "Next, we will use these probabilities to construct a **K-Nearest Neighbors (KNN) graph** for analyzing data quality.\n" + "For this example, we use **Logistic Regression**, a simple and interpretable model commonly used in fraud detection tasks. It predicts the probability of each class (`0` for non-fraud, `1` for fraud) based on the input features of an example in the dataset. The same approach will work with *any* Machine Learning model." ] }, { @@ -671,8 +542,7 @@ }, "outputs": [], "source": [ - "# Define the classification model\n", - "clf = LogisticRegression(max_iter=1000, random_state=SEED)\n" + "clf = LogisticRegression(max_iter=1000, random_state=SEED)" ] }, { @@ -688,15 +558,14 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Shape of predicted probabilities: (100000, 2)\n" ] } ], "source": [ - "# Perform K-fold cross-validation to compute out-of-sample predicted probabilities\n", "num_crossval_folds = 5\n", "pred_probs = cross_val_predict(\n", " clf,\n", @@ -706,7 +575,6 @@ " method=\"predict_proba\" # Get predicted probabilities\n", ")\n", "\n", - "# Display the shape of the predicted probabilities array\n", "print(\"Shape of predicted probabilities:\", pred_probs.shape)" ] }, @@ -717,17 +585,17 @@ "id": "2681aad2-d84d-4713-bed8-aa1204223fd5" }, "source": [ - "# 4. Construct K Nearest Neighbors Graph\n", + "## 4. Construct a K Nearest Neighbors Graph (Optional)\n", "\n", - "The **KNN graph** represents the similarity between examples in the dataset. It helps Cleanlab identify issues like:\n", - "- **Outliers**: Data points that are far from others in feature space.\n", - "- **Duplicates or Near-Duplicates**: Examples that are unusually close to each other.\n", + "The **KNN graph** represents the similarity between examples in the dataset.\n", + "Here, we'll define similarity using the **Euclidean distance** between our normalized feature values.\n", "\n", - "For tabular data, we define similarity using the **Euclidean distance** between feature values.\n", + "Note that this step is *optional*. If you pass in numerical `features` to Datalab but no KNN graph, then Datalab will internally construct its own KNN graph.\n", + "You can provide your KNN graph as done here to exert greater control over this process, or do this whenever your data aren't in a numerical format or you have a massive dataset (use [approxmate KNN](https://docs.cleanlab.ai/stable/tutorials/datalab/workflows.html#Accelerate-Issue-Checks-with-Pre-computed-kNN-Graphs) in that case).\n", "\n", - "We use scikit-learn's `NearestNeighbors` class to construct this graph:\n", + "Here we use scikit-learn's `NearestNeighbors` class to construct this graph:\n", "1. Compute pairwise distances between all examples.\n", - "2. Represent the graph as a sparse matrix, with nonzero entries indicating the distance to nearest neighbors.\n" + "2. Represent the graph as a sparse matrix, with nonzero entries indicating the distance to nearest neighbors." ] }, { @@ -756,19 +624,9 @@ "id": "b27cf2de-e276-438a-8f0a-9a5de3d1757a" }, "source": [ - "# 5. Use Cleanlab to Find Dataset Issues\n", - "\n", - "With the given labels, predicted probabilities, and the KNN graph, Cleanlab can help us identify various issues in the **Credit Card Fraud Detection dataset**, such as:\n", - "\n", - "- **Label Issues**: Transactions where the assigned label (fraud or non-fraud) is likely incorrect.\n", - "- **Outliers**: Transactions with anomalous patterns that differ significantly from the rest.\n", - "- **Near-Duplicates**: Transactions that are highly similar or repeated.\n", - "- **Class Imbalance**: Uneven representation of classes in the dataset.\n", + "## 5. Use Datalab to Find Dataset Issues\n", "\n", - "We use Cleanlab's **Datalab** class to audit the dataset for these issues. The process involves:\n", - "1. Wrapping the dataset (preprocessed features and labels) into a dictionary format.\n", - "2. Creating a `Datalab` object to analyze the dataset.\n", - "3. Detecting and reporting various types of data quality issues." + "With the given labels, predicted probabilities, and the KNN graph (optional), Datalab can automatically identifies various issues in the dataset." ] }, { @@ -784,8 +642,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Finding label issues ...\n", "Finding outlier issues ...\n", @@ -807,8 +665,7 @@ "lab = Datalab(data, label_name=\"y\")\n", "\n", "# Use Cleanlab to find issues in the dataset\n", - "lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)\n", - "\n" + "lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph) # could provide features here instead of knn_graph" ] }, { @@ -824,8 +681,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Dataset Information: num_examples: 100000, num_classes: 2\n", "\n", @@ -930,22 +787,19 @@ }, { "cell_type": "markdown", - "source": [ - "## Label Issues\n", - "The report indicates that Cleanlab identified several label issues in the dataset. These are data entries where the given labels may not match the actual label, as estimated by Cleanlab. Each issue includes a numeric label score that quantifies how likely the label is correct (lower scores indicate higher likelihood of being mislabeled)." - ], + "id": "qBcATrTFCWqJ", "metadata": { "id": "qBcATrTFCWqJ" }, - "id": "qBcATrTFCWqJ" + "source": [ + "## Label Issues\n", + "The report indicates that Cleanlab identified several label issues in the dataset. These are data entries where the given labels may not match the actual label, as estimated by Cleanlab. Each issue includes a numeric label score that quantifies how likely the label is correct (lower scores indicate higher likelihood of being mislabeled)." + ] }, { "cell_type": "code", - "source": [ - "# Retrieve label issues\n", - "label_issues = lab.get_issues(\"label\")\n", - "print(label_issues.head())\n" - ], + "execution_count": 11, + "id": "pee_lWpiCiIV", "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -953,12 +807,10 @@ "id": "pee_lWpiCiIV", "outputId": "d5bcf570-0051-4b92-df49-20c3479b88b1" }, - "id": "pee_lWpiCiIV", - "execution_count": 11, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ " is_label_issue label_score given_label predicted_label\n", "0 False 0.990469 0 0\n", @@ -968,15 +820,18 @@ "4 False 0.991149 0 0\n" ] } + ], + "source": [ + "# Retrieve label issues\n", + "label_issues = lab.get_issues(\"label\")\n", + "\n", + "print(label_issues.head())" ] }, { "cell_type": "code", - "source": [ - "# Filter rows with label issues\n", - "label_issues_filtered = label_issues[label_issues['is_label_issue'] == True]\n", - "print(label_issues_filtered.head())\n" - ], + "execution_count": 12, + "id": "8pGqVz8RDeoF", "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -984,12 +839,10 @@ "id": "8pGqVz8RDeoF", "outputId": "322a6eb8-4b2f-4597-9b8f-614c97887a45" }, - "id": "8pGqVz8RDeoF", - "execution_count": 12, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ " is_label_issue label_score given_label predicted_label\n", "190 True 0.007187 1 0\n", @@ -999,20 +852,17 @@ "506 True 0.009220 1 0\n" ] } + ], + "source": [ + "# Filter rows with label issues\n", + "label_issues_filtered = label_issues[label_issues['is_label_issue'] == True]\n", + "print(label_issues_filtered.head())" ] }, { "cell_type": "code", - "source": [ - "# Sort the label issues by label_score (lower scores indicate higher likelihood of being mislabeled)\n", - "sorted_issues = label_issues.sort_values(\"label_score\").index\n", - "\n", - "# View the most likely label errors\n", - "X_raw.iloc[sorted_issues].assign(\n", - " given_label=y.iloc[sorted_issues],\n", - " predicted_label=label_issues[\"predicted_label\"].iloc[sorted_issues]\n", - ").head()\n" - ], + "execution_count": 13, + "id": "m1KP2zEWDfaE", "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -1021,20 +871,13 @@ "id": "m1KP2zEWDfaE", "outputId": "6fc9c1b0-30a0-4c3f-f015-44e42202c166" }, - "id": "m1KP2zEWDfaE", - "execution_count": 13, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " Amount TransactionType Location given_label predicted_label\n", - "6901 346.13 purchase San Jose 1 0\n", - "7933 25.91 refund San Jose 1 0\n", - "13204 963.84 purchase San Jose 1 0\n", - "16276 1093.22 purchase San Jose 1 0\n", - "7546 598.78 refund San Jose 1 0" - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \")\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 438.6116871789898,\n \"min\": 25.91,\n \"max\": 1093.22,\n \"num_unique_values\": 5,\n \"samples\": [\n 25.91,\n 598.78,\n 963.84\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"San Jose\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 1,\n \"num_unique_values\": 1,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, "text/html": [ "\n", "
\n", @@ -1317,18 +1160,37 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \")\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 438.6116871789898,\n \"min\": 25.91,\n \"max\": 1093.22,\n \"num_unique_values\": 5,\n \"samples\": [\n 25.91,\n 598.78,\n 963.84\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"San Jose\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 1,\n \"num_unique_values\": 1,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " Amount TransactionType Location given_label predicted_label\n", + "6901 346.13 purchase San Jose 1 0\n", + "7933 25.91 refund San Jose 1 0\n", + "13204 963.84 purchase San Jose 1 0\n", + "16276 1093.22 purchase San Jose 1 0\n", + "7546 598.78 refund San Jose 1 0" + ] }, + "execution_count": 13, "metadata": {}, - "execution_count": 13 + "output_type": "execute_result" } + ], + "source": [ + "# Sort the label issues by label_score (lower scores indicate higher likelihood of being mislabeled)\n", + "sorted_issues = label_issues.sort_values(\"label_score\").index\n", + "\n", + "# View the most likely label errors\n", + "X_raw.iloc[sorted_issues].assign(\n", + " given_label=y.iloc[sorted_issues],\n", + " predicted_label=label_issues[\"predicted_label\"].iloc[sorted_issues]\n", + ").head()\n" ] }, { "cell_type": "markdown", + "id": "-ApyX5r6FTmI", + "metadata": { + "id": "-ApyX5r6FTmI" + }, "source": [ "### Example Review of Label Issues\n", "\n", @@ -1342,7 +1204,7 @@ "| 1093.22 | purchase | San Jose | 1 | 0 |\n", "| 598.78 | refund | San Jose | 1 | 0 |\n", "\n", - "These examples have been labeled incorrectly and should be carefully re-examined:\n", + "These examples may have been labeled incorrectly and should be carefully re-examined:\n", "- **Entry 1**: A purchase of 346.13 labeled as fraudulent (`1`) is predicted to be non-fraudulent (`0`).\n", "- **Entry 2**: A refund of 25.91 is similarly labeled as fraudulent but predicted as non-fraudulent.\n", "- **Entry 4**: A purchase of $1093.22 also seems misclassified as fraudulent.\n", @@ -1350,33 +1212,25 @@ "The predicted labels suggest a potential mislabeling pattern for transactions in `San Jose`. Transactions with relatively lower amounts or refunds might have been mislabeled as fraudulent. This should be reviewed with additional domain knowledge or transaction metadata for confirmation.\n", "\n", "Such insights are crucial for improving the dataset's quality and ensuring the model learns from accurate labels.\n" - ], - "metadata": { - "id": "-ApyX5r6FTmI" - }, - "id": "-ApyX5r6FTmI" + ] }, { "cell_type": "markdown", + "id": "_zzPdWl0GFOY", + "metadata": { + "id": "_zzPdWl0GFOY" + }, "source": [ "\n", "### Outlier Issues\n", "\n", "According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via the `get_issues` method. We sort the resulting DataFrame by Cleanlab’s outlier quality score to see the most severe outliers in our dataset." - ], - "metadata": { - "id": "_zzPdWl0GFOY" - }, - "id": "_zzPdWl0GFOY" + ] }, { "cell_type": "code", - "source": [ - "outlier_results = lab.get_issues(\"outlier\")\n", - "sorted_outliers = outlier_results.sort_values(\"outlier_score\").index\n", - "\n", - "X_raw.iloc[sorted_outliers].head()" - ], + "execution_count": 14, + "id": "D7VClp15GIXC", "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -1385,20 +1239,13 @@ "id": "D7VClp15GIXC", "outputId": "ec6c23c3-1802-42ba-d2aa-69537251da5d" }, - "id": "D7VClp15GIXC", - "execution_count": 14, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " Amount TransactionType Location\n", - "43484 4999.73 purchase Chicago\n", - "4659 2114.37 refund Philadelphia\n", - "67602 3255.47 purchase San Jose\n", - "91994 1147.93 refund Chicago\n", - "52696 4005.05 purchase San Antonio" - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1519.3915375570575,\n \"min\": 1147.93,\n \"max\": 4999.73,\n \"num_unique_values\": 5,\n \"samples\": [\n 2114.37,\n 4005.05,\n 3255.47\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Philadelphia\",\n \"San Antonio\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, "text/html": [ "\n", "
\n", @@ -1669,18 +1516,33 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1519.3915375570575,\n \"min\": 1147.93,\n \"max\": 4999.73,\n \"num_unique_values\": 5,\n \"samples\": [\n 2114.37,\n 4005.05,\n 3255.47\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Philadelphia\",\n \"San Antonio\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " Amount TransactionType Location\n", + "43484 4999.73 purchase Chicago\n", + "4659 2114.37 refund Philadelphia\n", + "67602 3255.47 purchase San Jose\n", + "91994 1147.93 refund Chicago\n", + "52696 4005.05 purchase San Antonio" + ] }, + "execution_count": 14, "metadata": {}, - "execution_count": 14 + "output_type": "execute_result" } + ], + "source": [ + "outlier_results = lab.get_issues(\"outlier\")\n", + "sorted_outliers = outlier_results.sort_values(\"outlier_score\").index\n", + "\n", + "X_raw.iloc[sorted_outliers].head()" ] }, { "cell_type": "markdown", + "id": "qQYl8X5RG9F6", + "metadata": { + "id": "qQYl8X5RG9F6" + }, "source": [ "\n", "\n", @@ -1701,32 +1563,24 @@ "\n", "These steps will ensure that the dataset is representative and does not include suspicious entries that could affect the performance of fraud detection models.\n", " " - ], - "metadata": { - "id": "qQYl8X5RG9F6" - }, - "id": "qQYl8X5RG9F6" + ] }, { "cell_type": "markdown", - "source": [ - "### Near-Duplicate Issues\n", - "\n", - "According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by Cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.\n", - "\n", - "\n" - ], + "id": "STlYZFJRRDtO", "metadata": { "id": "STlYZFJRRDtO" }, - "id": "STlYZFJRRDtO" + "source": [ + "### Near-Duplicate Issues\n", + "\n", + "According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by Cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated (the score is equal to 0 if examples are exactly duplicated)." + ] }, { "cell_type": "code", - "source": [ - "duplicate_results = lab.get_issues(\"near_duplicate\")\n", - "duplicate_results.sort_values(\"near_duplicate_score\").head()" - ], + "execution_count": 15, + "id": "VHcPnNYbQZ-n", "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -1735,27 +1589,13 @@ "id": "VHcPnNYbQZ-n", "outputId": "7dc6f1fe-ac78-4c77-96e5-176c7f3a6a16" }, - "id": "VHcPnNYbQZ-n", - "execution_count": 15, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " is_near_duplicate_issue near_duplicate_score near_duplicate_sets \\\n", - "62583 True 0.0 [55080] \n", - "30333 True 0.0 [13617] \n", - "12827 True 0.0 [15703] \n", - "66741 True 0.0 [82920] \n", - "45125 True 0.0 [95476] \n", - "\n", - " distance_to_nearest_neighbor \n", - "62583 0.0 \n", - "30333 0.0 \n", - "12827 0.0 \n", - "66741 0.0 \n", - "45125 0.0 " - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"duplicate_results\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_near_duplicate_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_sets\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance_to_nearest_neighbor\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, "text/html": [ "\n", "
\n", @@ -2032,38 +1872,46 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"duplicate_results\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_near_duplicate_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_sets\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance_to_nearest_neighbor\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets \\\n", + "62583 True 0.0 [55080] \n", + "30333 True 0.0 [13617] \n", + "12827 True 0.0 [15703] \n", + "66741 True 0.0 [82920] \n", + "45125 True 0.0 [95476] \n", + "\n", + " distance_to_nearest_neighbor \n", + "62583 0.0 \n", + "30333 0.0 \n", + "12827 0.0 \n", + "66741 0.0 \n", + "45125 0.0 " + ] }, + "execution_count": 15, "metadata": {}, - "execution_count": 15 + "output_type": "execute_result" } + ], + "source": [ + "duplicate_results = lab.get_issues(\"near_duplicate\")\n", + "duplicate_results.sort_values(\"near_duplicate_score\").head()" ] }, { "cell_type": "markdown", - "source": [ - "The results above show which examples Cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that Cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are." - ], + "id": "0FyG5cJtRNGb", "metadata": { "id": "0FyG5cJtRNGb" }, - "id": "0FyG5cJtRNGb" + "source": [ + "The results above show which examples Cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that Cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are." + ] }, { "cell_type": "code", - "source": [ - "# Identify the row with the lowest near_duplicate_score\n", - "lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].idxmin()\n", - "\n", - "# Extract the indices of the lowest scoring duplicate and its near duplicate sets\n", - "indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", - "\n", - "# Display the relevant rows from the original dataset\n", - "X_raw.iloc[indices_to_display]\n" - ], + "execution_count": 18, + "id": "IqgcWEVIROAP", "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -2072,18 +1920,13 @@ "id": "IqgcWEVIROAP", "outputId": "eb36a8cd-a66e-4f3d-eb68-c7aac6ef27b5" }, - "id": "IqgcWEVIROAP", - "execution_count": 18, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " Amount TransactionType Location\n", - "73 3374.61 refund New York\n", - "19427 3374.61 refund New York\n", - "30450 3374.63 refund New York" - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.011547005383782014,\n \"min\": 3374.61,\n \"max\": 3374.63,\n \"num_unique_values\": 2,\n \"samples\": [\n 3374.63,\n 3374.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, "text/html": [ "\n", "
\n", @@ -2342,40 +2185,45 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.011547005383782014,\n \"min\": 3374.61,\n \"max\": 3374.63,\n \"num_unique_values\": 2,\n \"samples\": [\n 3374.63,\n 3374.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " Amount TransactionType Location\n", + "73 3374.61 refund New York\n", + "19427 3374.61 refund New York\n", + "30450 3374.63 refund New York" + ] }, + "execution_count": 18, "metadata": {}, - "execution_count": 18 + "output_type": "execute_result" } + ], + "source": [ + "# Identify the row with the lowest near_duplicate_score\n", + "lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].idxmin()\n", + "\n", + "# Extract the indices of the lowest scoring duplicate and its near duplicate sets\n", + "indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", + "\n", + "# Display the relevant rows from the original dataset\n", + "X_raw.iloc[indices_to_display]\n" ] }, { "cell_type": "markdown", + "id": "6nhecZHHSuv9", + "metadata": { + "id": "6nhecZHHSuv9" + }, "source": [ "These examples are exact duplicates! Perhaps the same information was accidentally recorded multiple times in this data.\n", "\n", "Similarly, let’s take a look at another example and the identified near-duplicate sets:" - ], - "metadata": { - "id": "6nhecZHHSuv9" - }, - "id": "6nhecZHHSuv9" + ] }, { "cell_type": "code", - "source": [ - "# Identify the next row not in the previous near duplicate set\n", - "second_lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].drop(indices_to_display).idxmin()\n", - "\n", - "# Extract the indices of the second lowest scoring duplicate and its near duplicate sets\n", - "next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", - "\n", - "# Display the relevant rows from the original dataset\n", - "X_raw.iloc[next_indices_to_display]" - ], + "execution_count": 19, + "id": "94gQWzVkRW53", "metadata": { "colab": { "base_uri": "https://localhost:8080/", @@ -2384,17 +2232,13 @@ "id": "94gQWzVkRW53", "outputId": "106f3513-d065-4483-dc76-e6c28e614b39" }, - "id": "94gQWzVkRW53", - "execution_count": 19, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - " Amount TransactionType Location\n", - "167 1796.39 refund New York\n", - "53564 1796.39 refund New York" - ], + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 1796.39,\n \"max\": 1796.39,\n \"num_unique_values\": 1,\n \"samples\": [\n 1796.39\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe" + }, "text/html": [ "\n", "
\n", @@ -2647,40 +2491,45 @@ "
\n", " \n" ], - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "dataframe", - "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 1796.39,\n \"max\": 1796.39,\n \"num_unique_values\": 1,\n \"samples\": [\n 1796.39\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" - } + "text/plain": [ + " Amount TransactionType Location\n", + "167 1796.39 refund New York\n", + "53564 1796.39 refund New York" + ] }, + "execution_count": 19, "metadata": {}, - "execution_count": 19 + "output_type": "execute_result" } + ], + "source": [ + "# Identify the next row not in the previous near duplicate set\n", + "second_lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].drop(indices_to_display).idxmin()\n", + "\n", + "# Extract the indices of the second lowest scoring duplicate and its near duplicate sets\n", + "next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", + "\n", + "# Display the relevant rows from the original dataset\n", + "X_raw.iloc[next_indices_to_display]" ] }, { "cell_type": "markdown", - "source": [ - "We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the FAQ.\n", - "\n", - "This tutorial highlights a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Cleanlab with any ML model – the better the model, the more accurate the data errors detected by Cleanlab will be!" - ], + "id": "6vexriCMTCAG", "metadata": { "id": "6vexriCMTCAG" }, - "id": "6vexriCMTCAG" - }, - { - "cell_type": "code", - "source": [], - "metadata": { - "id": "I56gc8gFTC4l" - }, - "id": "I56gc8gFTC4l", - "execution_count": null, - "outputs": [] + "source": [ + "We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the [FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-Datalab?).\n", + "\n", + "This tutorial highlights a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Cleanlab with any ML model – the better the model, the more accurate the data issues detected by Cleanlab will be!" + ] } ], "metadata": { + "colab": { + "provenance": [] + }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", @@ -2697,11 +2546,8 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" - }, - "colab": { - "provenance": [] } }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +}