Skip to content

Latest commit

 

History

History
45 lines (36 loc) · 1.97 KB

Readme.md

File metadata and controls

45 lines (36 loc) · 1.97 KB

Approximate hierarchical clustering

Implementation of Integer Linear Program (ILP) to recover exact hierarchical clustering according to the cost function proposed by Dasgupta (STOC '16) as well as O(log n) LP rounding based approximation algorithm.

Manuscript (NIPS 2016)

Non standard dependencies

Usage

To run the ILP (warning will be quite slow on large datasets) type

python ip.py -d [data] -l [label_index] -k [kernel] -f [function] -o [out_file] -n [num_workers] 
-p [prune] -x [num_samples]
  • [data]: the path to the data file (comma delimited)
  • [label_index]: the index of the column that contains the labels (default 0)
  • [kernel]: one of {linear, gaussian}
  • [function]: one of {linear, quadratic, cubic, exponential, logarithm}
  • [num_workers]: the number of workers for parallelization (default 1)
  • [prune]: number of flat clusters to prune to
  • [num_samples]: on large sets, sample this many points uniformly at random

To run the approximation algorithm (faster) type

python lp.py -d [data] -l [label_index] -k [kernel] -f [function] -o [out_file] -n [num_workers] 
-p [prune] -x [num_samples] -t [triangle] -e [epsilon]
  • [data]: the path to the data file (comma delimited)
  • [label_index]: the index of the column that contains the labels (default 0)
  • [kernel]: one of {linear, gaussian}
  • [function]: one of {linear, quadratic, cubic, exponential, logarithm}
  • [num_workers]: the number of workers for parallelization (default 1)
  • [prune]: number of flat clusters to prune to
  • [num_samples]: on large sets, sample this many points uniformly at random
  • [triangle]: if true, then use lazy triangle inequality (faster)
  • [epsilon]: algorithm parameter (0.5 is usually a safe choice)