Skip to content

Commit

Permalink
Add JOSS paper and improvements (cblearn#78)
Browse files Browse the repository at this point in the history
* Create references.bib
* Create paper.md
* Create build-paper-pdf.yaml
* add supplementaries
* extend unit tests
* extend documentation

Co-authored-by: Mojtaba Barzegari <40744245+mbarzegary@users.noreply.github.com>
  • Loading branch information
dekuenstle and mbarzegary authored Jun 6, 2024
1 parent 79713a9 commit d5177e7
Show file tree
Hide file tree
Showing 68 changed files with 2,437 additions and 298 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/build-paper-pdf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Build paper

on:
push:
paths:
- paper/**

jobs:
paper:
runs-on: ubuntu-latest
name: Build Paper PDF
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper/paper.md
- name: Build supplementary PDF
uses: docker://pandoc/latex:2.9
with:
args: >- # allows you to break string into multiple lines
--standalone
--output=paper/supplementary.pdf
--bibliography=paper/references.bib
--resource-path=paper/
paper/supplementary.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper-pdf
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper/*.pdf
10 changes: 5 additions & 5 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Based on:
# Based on:
# https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/
name: Publish Python 🐍 distributions 📦 to PyPI and TestPyPI
name: Release

on:
on:
push:
branches:
- main
Expand All @@ -14,9 +14,9 @@ jobs:
name: Build and publish Python 🐍 distributions 📦 to PyPI and TestPyPI
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install pypa/build
Expand Down
18 changes: 11 additions & 7 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Lint and test code, test documentation build
name: Test

on:
push:
Expand Down Expand Up @@ -35,7 +35,7 @@ jobs:
# see .flake8 config file for selected/ignored rules.
# warnings can be found in the action logs

docs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Expand All @@ -55,21 +55,26 @@ jobs:
test:
strategy:
matrix:
python-version:
python-version:
- "3.9"
- "3.10"
- "3.11"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip
cache-dependency-path: setup.cfg
- name: Setup R
uses: r-lib/actions/setup-r@v2
- name: Install R dependencies
run: |
# 2024-05-14: loe is not available from CRAN, we have to fallback to the archive.
wget https://cran.r-project.org/src/contrib/Archive/loe/loe_1.1.tar.gz
R CMD INSTALL ./loe_1.1.tar.gz
- name: Install package with dependencies
run: |
python3 -m pip install --upgrade pip
Expand All @@ -79,10 +84,9 @@ jobs:
run: |
pytest cblearn --cov=cblearn --cov-report=xml --remote-data
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: unittests
env_vars: OS,PYTHON

60 changes: 15 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,15 @@
# cblearn
<h1 align="center">
<img src="https://raw.githubusercontent.com/cblearn/cblearn/main/docs/logo-light.svg" width="300">
</h1><br>

## Comparison-based Machine Learning in Python
[![PyPI version](https://img.shields.io/pypi/v/cblearn.svg)](https://pypi.python.org/pypi/cblearn)
[![Documentation](https://readthedocs.org/projects/cblearn/badge/?version=stable)](https://cblearn.readthedocs.io/en/stable/?badge=stable)
[![Test status](https://github.com/cblearn/cblearn/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/cblearn/cblearn/actions/workflows/test.yml)
[![Test Coverage](https://codecov.io/gh/cblearn/cblearn/branch/master/graph/badge.svg?token=P9JRT6OK6O)](https://codecov.io/gh/cblearn/cblearn)

Comparison-based Learning algorithms are the Machine Learning algorithms to use when training data contains similarity comparisons ("A and B are more similar than C and D") instead of data points.

Triplet comparisons from human observers help model the perceived similarity of objects.
These human triplets are collected in studies, asking questions like
"Which of the following bands is most similar to Queen?" or
"Which color appears most similar to the reference?".
Comparison-based learning methods are machine learning algorithms using similarity comparisons ("A and B are more similar than C and D") instead of featurized data.

This library provides an easy-to-use interface for comparison-based learning algorithms.
It plays hand-in-hand with scikit-learn:

```python
from sklearn.datasets import load_iris
Expand All @@ -36,51 +32,25 @@ embedding = estimator.fit_transform(triplets)
print(f"The embedding has shape {embedding.shape}.")
```

Please try the [Examples](https://cblearn.readthedocs.io/en/stable/generated_examples/index.html).

## Getting Started

Install cblearn as described [here](https://cblearn.readthedocs.io/en/stable/install.html) and try the [examples](https://cblearn.readthedocs.io/en/stable/generated_examples/index.html).

Find a theoretical introduction to comparison-based learning, the datatypes,
algorithms, and datasets in the [User Guide](https://cblearn.readthedocs.io/en/stable/user_guide/index.html).

## Features

### Datasets

*cblearn* provides utility methods to simplify the loading and conversion
of your comparison datasets. In addition, some functions download and load multiple real-world comparisons.
* [Installation & Quickstart](https://cblearn.readthedocs.io/en/stable/getting_started.html)
* [Examples](https://cblearn.readthedocs.io/en/stable/generated_examples/index.html).
* [User Guide](https://cblearn.readthedocs.io/en/stable/user_guide/index.html).

| Dataset | Query | #Object | #Response | #Triplet |
| --- | --- | ---:| ---:| ---:|
| Vogue Cover | Odd-out Triplet | 60 | 1,107 | 2,214 |
| Nature Scene | Odd-out Triplet | 120 | 3,355 | 6,710 |
| Car | Most-Central Triplet | 60 | 7,097 | 14,194 |
| Material | Standard Triplet | 100 | 104,692 |104,692 |
| Food | Standard Triplet | 100 | 190,376 |190,376 |
| Musician | Standard Triplet | 413 | 224,792 |224,792 |
| Things Image Testset | Odd-out Triplet | 1,854 | 146,012 | 292,024 |
| ImageNet Images v0.1 | Rank 2 from 8 | 1,000 | 25,273 | 328,549 |
| ImageNet Images v0.2 | Rank 2 from 8 | 50,000 | 384,277 | 5M |


### Embedding Algorithms

| Algorithm | Default | Pytorch (GPU) | Reference Wrapper |
| --------------------------- | :---: | :-----------: | :---------------: |
| Crowd Kernel Learning (CKL) | X | X | |
| FORTE | | X | |
| GNMDS | X | X | |
| Maximum-Likelihood Difference Scaling (MLDS) | X | | [MLDS (R)](https://cran.r-project.org/web/packages/MLDS/index.html)|
| Soft Ordinal Embedding (SOE) | X | X | [loe (R)](https://cran.r-project.org/web/packages/loe/index.html) |
| Stochastic Triplet Embedding (STE/t-STE) | X | X | |

## Contribute

We are happy about your bug reports, questions or suggestions as Github Issues and code or documentation contributions as Github Pull Requests.
Please see our [Contributor Guide](https://cblearn.readthedocs.io/en/stable/contributor_guide/index.html).

## Related packages

There are more Python packages for comparison-based learning:

- [metric-learn](http://contrib.scikit-learn.org/metric-learn) is a collection of algorithms for metric learning. The *weakly supervised* algorithms learn from triplets and quadruplets.
- [salmon](https://docs.stsievert.com/salmon/) is a package to collect triplets efficiently in crowd-sourced experiments. Therefore it implements ordinal embedding algorithms and sampling strategies to actively query the most informative comparisons.

## Authors and Acknowledgement
*cblearn* was initiated by current and former members of the [Theory of Machine Learning group](http://www.tml.cs.uni-tuebingen.de/index.php) of Prof. Dr. Ulrike von Luxburg at the University of Tübingen.
The leading developer is [David-Elias Künstle](http://www.tml.cs.uni-tuebingen.de/team/kuenstle/index.php).
Expand Down
8 changes: 5 additions & 3 deletions cblearn/datasets/_food_similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def fetch_food_similarity(data_home: Optional[os.PathLike] = None, download_if_m
.. warning::
This function downloads the file without verifying the ssl signature to circumvent an outdated certificate of the dataset hosts.
However, after downloading the function verifies the file checksum before loading the file to minimize the risk of man-in-the-middle attacks.
=================== =====================
Triplets 190376
Objects 100
Expand Down Expand Up @@ -83,12 +83,14 @@ def fetch_food_similarity(data_home: Optional[os.PathLike] = None, download_if_m
archive_path = _base._fetch_remote(ARCHIVE, dirname=data_home)
finally:
ssl._create_default_https_context = ssl_default

with zipfile.ZipFile(archive_path) as zf:
with zf.open('food100-dataset/all-triplets.csv', 'r') as f:
triplets = np.loadtxt(f, dtype=str, delimiter=';')
triplets = np.char.strip(triplets) # trim whitespace

image_names = np.asarray([name[len('food100-dataset/'):] for name in zf.namelist()
image_names = np.asarray([name[len('food100-dataset/'):]
for name in zf.namelist()
if name.startswith('food100-dataset/images/')
and name.endswith('.jpg')])

Expand Down
17 changes: 15 additions & 2 deletions cblearn/datasets/_musician_similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,21 @@

def fetch_musician_similarity(data_home: Optional[os.PathLike] = None, download_if_missing: bool = True,
shuffle: bool = True, random_state: Optional[np.random.RandomState] = None,
return_triplets: bool = False) -> Union[Bunch, np.ndarray]:
return_triplets: bool = False,
valid_triplets: bool = True) -> Union[Bunch, np.ndarray]:
""" Load the MusicSeer musician similarity dataset (triplets).
=================== =====================
Triplets 131.970
Triplets 118.263
Objects (Artists) 448
Dimensionality unknown
=================== =====================
.. warning::
This dataset contains triplets of musicians, which are not unique.
I.e. for some triplets (i, j, k), i==j, j==k, or i==k is possible.
This function by default filters out these triplets, but this can be disabled by setting `valid_triplets=False`.
See :ref:`musician_similarity_dataset` for a detailed description.
Args:
Expand All @@ -42,6 +48,8 @@ def fetch_musician_similarity(data_home: Optional[os.PathLike] = None, download_
Initialization for shuffle random generator
return_triplets : boolean, default=False.
If True, returns numpy array instead of a Bunch object.
valid_triplets: boolean, default=True.
If True, only valid triplets are returned. I.e. triplets where i!=j!=k.
Returns:
dataset : :class:`~sklearn.utils.Bunch`
Expand Down Expand Up @@ -102,6 +110,11 @@ def fetch_musician_similarity(data_home: Optional[os.PathLike] = None, download_

triplet_filter = musicians_data['other'] != '' # remove bi-tuples.
triplet_ids = np.c_[musicians_data['target'], musicians_data['chosen'], musicians_data['other']]
if valid_triplets:
triplet_filter = (triplet_filter
& (triplet_ids[:, 0] != triplet_ids[:, 1])
& (triplet_ids[:, 1] != triplet_ids[:, 2])
& (triplet_ids[:, 0] != triplet_ids[:, 2]))
triplet_ids = triplet_ids[triplet_filter].astype(int)

all_ids, triplets = np.unique(triplet_ids, return_inverse=True)
Expand Down
19 changes: 19 additions & 0 deletions cblearn/datasets/_triplet_response.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@
from cblearn.datasets._datatypes import NoiseTarget, Distance


def _count_unique_items(query):
""" Count unique items per row in a 2D array.
Efficient approach even for large number of rows
and integer items:
https://stackoverflow.com/a/48473125
"""
sorted_query = np.sort(query, axis=1)
return (sorted_query[:, 1:] != sorted_query[:, :-1]).sum(axis=1) + 1


def noisy_triplet_response(triplets: utils.Query, embedding: np.ndarray, result_format: Optional[str] = None,
noise: Union[None, str, Callable] = None, noise_options: Dict = {},
noise_target: Union[str, NoiseTarget] = 'differences',
Expand Down Expand Up @@ -63,6 +74,14 @@ def noisy_triplet_response(triplets: utils.Query, embedding: np.ndarray, result_
result_format = utils.check_format(result_format, triplets, None)
triplets: np.ndarray = utils.check_query(triplets, result_format=utils.QueryFormat.LIST)
embedding = check_array(embedding)
if triplets.shape[1] != 3:
raise ValueError("Triplets require 3 columns.")
if (triplets < 0).any() or (triplets >= embedding.shape[0]).any():
raise ValueError("Triplet indices must be within the range of the embedding.")
non_unique_rows = _count_unique_items(triplets) != 3
if (non_unique_rows).any():
raise ValueError(f"Triplets must contain unique indices, got {triplets[non_unique_rows]}.")

if isinstance(noise, str):
random_state = check_random_state(random_state)
noise_fun: Callable = getattr(random_state, noise)
Expand Down
3 changes: 2 additions & 1 deletion cblearn/datasets/descr/car_similarity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ The people chose one car of three, such that the following statement is true:
All images were found on Wikimedia Commons and are assigned to one of four classes:
ORDINARY CARS, SPORTS CARS, OFF-ROAD/SPORT UTILITY VEHICLES, and OUTLIERS.

The corresponding car images are available with the _`full dataset`.
The corresponding car images are available here in the `full dataset`_.

.. _full dataset: http://www.tml.cs.uni-tuebingen.de/team/luxburg/code_and_data/index.php

**Data Set Characteristics:**
Expand Down
1 change: 1 addition & 0 deletions cblearn/datasets/tests/test_food_similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ def test_fetch_food(tmp_path):

assert bunch.data.shape == (190376, 3)
assert bunch.image_names.shape == (100, )
assert (bunch.data[:, 1] != bunch.data[:, 2]).all(), "Something went wrong during parsing"
assert bunch.image_names[bunch.data[0, 0]] == 'images/214649bfd7ea489b8daf588e6fed45aa.jpg'

triplets = fetch_food_similarity(data_home=data_home, shuffle=False, return_triplets=True)
Expand Down
8 changes: 4 additions & 4 deletions cblearn/datasets/tests/test_musician_similarity.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ def test_fetch_musician_similarity(tmp_path):
data_home = tmp_path / 'cblearn_datasets'
bunch = fetch_musician_similarity(data_home=data_home, shuffle=False)

assert bunch.data.shape == (131_970, 3)
assert bunch.judgement_id.shape == (131_970, )
assert bunch.user.shape == (131_970, )
assert bunch.survey_or_game.shape == (131_970, )
assert bunch.data.shape == (118_263, 3)
assert bunch.judgement_id.shape == (118_263, )
assert bunch.user.shape == (118_263, )
assert bunch.survey_or_game.shape == (118_263, )
assert bunch.artist_name.shape == (448, )
assert bunch.artist_id.shape == (448, )
assert bunch.artist_name[bunch.data][0, 0] == 'queen'
Expand Down
33 changes: 33 additions & 0 deletions cblearn/datasets/tests/test_triplet_response.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import pytest
import numpy as np

from cblearn.datasets import triplet_response


def test_triplet_response_validates_input():
n = 5 # n objects
t = 10 # n triplets
d = 2 # n dimensions
valid_queries = [
np.random.choice(n, size=3, replace=False)
for _ in range(t)
]
invalid_queries_1 = [
np.random.choice(n, size=5, replace=False)
for _ in range(t)
]
invalid_queries_2 = [
np.random.choice(n + 1, size=3, replace=False)
for _ in range(t)
]
invalid_queries_3 = np.random.uniform(low=-1, high=1, size=(t, 3))
embedding = np.random.normal(size=(n, d))

responses = triplet_response(valid_queries, embedding)
assert responses.shape == (t, 3)
with pytest.raises(ValueError):
triplet_response(invalid_queries_1, embedding)
with pytest.raises(ValueError):
triplet_response(invalid_queries_2, embedding)
with pytest.raises(ValueError):
triplet_response(invalid_queries_3, embedding)
Loading

0 comments on commit d5177e7

Please sign in to comment.