Skip to content

Latest commit

 

History

History
63 lines (39 loc) · 2.98 KB

README.md

File metadata and controls

63 lines (39 loc) · 2.98 KB

CAIM is a supervised discretization method [1] and Python-CAIM is a Python implementation of CAIM. This is a work in progress, results should be closely inspected. The goal is to provide both a CLI to discretize data for later use as well as a class for programmatic usage. Pull requests welcome.

There is a MATLAB implementation by Guangdi Li and a Java implementation (Research->Data Mining Tool) by the author. The latter being an implementation of the currently unpublished CAIM+ version of the algorithm.

Current Python-CAIM is working on UCI's Musk1 dataset as well as other toy datasets. Results are validated against the Java implementation (see above).

On performance, the Java implementation has notably lower latency (higher performance). This may be due to Java being fundamentally faster than Python, design tricks/shortcuts, or a combination of both. Currently difficult to determine source of improved performance since source code does not appear to be included in the CAIM JAR file. The MatLab version is comparable and often faster for very small datasets. However, Python-CAIM can parallelize discretization, and can thus scale better for datasets with many features.

CLI Options

usage: caim.py [-h] [-t TARGET_FIELD] [-o OUTPUT_PATH] [-H] [-q] input_file

CAIM Algorithm Command Line Tool and Library

positional arguments:
  input_file            CSV input data file

optional arguments:
  -h, --help            show this help message and exit
  -t TARGET_FIELD, --target-field TARGET_FIELD
                        Target field as an integer (0-indexed) or string
                        corresponding to column name. Negative indices (e.g.
                        -1) are allowed.
  -o OUTPUT_PATH, --output-path OUTPUT_PATH
                        File path to write discretized form of data in CSV
                        format
  -H, --header          Use first row as column/field names
  -q, --quiet           Minimal information is printed to STDOUT

Example Usages

Discretize IRIS data

python3 ./caim.py datasets/iris.data -t -1 -H

Discretize IRIS data and save discrete results to iris_caim_data.csv

python3 ./caim.py datasets/iris.data -t -1 -H -o iris_caim.csv

Discretize musk1

python3 ./caim.py datasets/musk_clean1.csv -t -1

Interval Output

Intervals are printed in the form:

[ 0.13  0.34  0.39  0.66]

Which should be interpretted as:

[0.13, 0.34](0.34, 0.39](0.39, 0.66]

The output dataset will use the right-end of each interval as the discretized value.

TODO

  • Fix Unit Tests
  • Continue to re-implement in Pandas/NumPy for speed (avoid loops)
  • Add more test data and corresponding unittests
  • Clean-up API and document

[1] Kurgan, L. and Cios, K.J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145-153