EvalRS-CIKM-2022 : Evaluation Loop

EvalRSRunner (defined inEvalRSRunner.py) is a class that encapsulates the evaluation approach for EvalRS (Bootstrapped Nested Cross-Validation). Being this challenge a code competiton on a public dataset, we could not rely on unseen test data to produce a final leaderboard.

Our approach is illustrated in the following diagram: subsets of the original dataset are randomly designated as train and test set; EvalRSRunner will automatically feed them to the model object you provide and the predictions will be scored with RecList. Sampling is per-user, and testing is based on the "leave-one-out" principle: for each user, we pick a track and leave it out for the test set as the target item for the predictions.

This procedure will be repeated n times, and your average scores will be uploaded at the end (the library will take care of it) and determine your position on the leaderboard. The n folds are not guaranteed to have disjoint sets of users, as they are generated randomly in a stateless fashion: each round in the evaluation loop is independent from the others. Due to the stochastic nature of the loop, scores will vary slightly between runs, but the organizing committee will still be able to statistically evaluate your code submission in the light of your scores. As stated in the rules, since this is a code competition, reproducibility is essential: please, take your time to make sure that you understand the submission scripts and that your final project is easily reproducible from scratch.

As demonstrated by the notebook and submission.py, you do not have to worry about any of the above implementation details: folds are generated for you and predictions are automatically compared to ground truths by the provided code and RecList.

Note: when you instantiate the ChallengeDataset, num_folds determines how many times the evaluation procedure is run. Default is 4, and that is the value you should use for all your leaderboard submissions.

How to use the provided abstractions

Main abstractions

We provide two basic abstractions for the Data Challenge:

an EvalRSRunner class to run the evaluation and submit the scores to the board. You should not change this class in any way;
a EvalRSRecList class, implementing a RecList class from the OS package. You should not change this class in any way, but you are encouraged to i) understand it, ii) extend it with your own tests for your paper (see the notebooks folder for a working example). See below for an explanation of the provided tests.

We also provide out-of-the-box utility functions and a template script as an entry point for your final submission (submission.py). As long as your training and prediction logic respects the provided model API, you should be able to easily adapt existing models to the Data Challenge.

During the leaderboard phase, you can submit your scores to the leaderboard running the code however you prefer. However, at submission, your repository needs to comply with the rules in the general README. Remember: if we are not able to reproduce your results and statistically verify your scores, you won't be eligible for the prize.

Build your own evaluation loop

To make a new submission to the leaderboard you are just required to build one new object, a model class containing a train and predict methods, such as the one contained in the submission folder.

First, you should inherit RecModel and implement the train method: train should contain the model training code, including any necessary hyper-parameter optimization; if you wish to pass additional parameters to the training function, you can always add them in your init and retrieve them later. You are free to use any modelling technique you want (collaborative filtering, two-tower etc.) as long as your code complies with the Data Challenge rules (no test leaking, hyperparameter and compute time within the budget etc.).

Second, when the training is done, you should wrap your predictions in a method predict: train should store the trained model inside the class, and predict will use that model to provide predictions. The predict method accepts as input a dataframe of all the user IDs for which the model is asked to make a prediction on.

For each user_id, we expect k predictions (where k=100 for the competition): you can play around with different Ks for debugging purposes but only k=100 will be accepted for the leaderboard. The expected prediction output is a dataframe with user_id as index and k columns, each representing the ranked recommendations (0th column being the highest rank). In addition, it is expected that the predictions are in the same order as the user_id in the input dataframe. An example of the desired dataframe format for n user_ids and k predictions per user is seen in the table below. Note that if your model provides less than k predictions for a given user_id, the empty columns should be filled with -1.

	0	...	k-1
user_id_1	track_id_1	...	-1
user_id_2	track_id_4	...	track_id_5
...	...	...	-1
user_id_n	track_id_18	...	track_id_9

Please note that the number of examples in the dataframe returned by predict and the user_id in input should match and that the user IDs fed to predict by the evaluation loop are unique.

Implementing the class, and returning the trained model in the proper wrapper:

class MyModel(RecModel):
    
    def __init__(self, items: pd.DataFrame, top_k: int=100, **kwargs):
        super(MyModel, self).__init__()
        self.items = items
        self.top_k = top_k
        # kwargs may contain additional arguments in case, for example, you
        # have data augmentation strategies
        print("Received additional arguments: {}".format(kwargs))
        return

    def train(self, train_df: pd.DataFrame):
        """
        Implement here your training logic. Since our example method is a simple random model,
        we actually don't use any training data to build the model, but you should ;-)

        At the end of training, make sure the class contains a trained model you can use in the predict method.
        """
        print(train_df.head(1))
        print("Training completed!")
        return 

    def predict(self, user_ids: pd.DataFrame) -> pd.DataFrame:
        """
        
        This function takes as input all the users that we want to predict the top-k items for, and 
        returns all the predicted songs.

        While in this example is just a random generator, the same logic in your implementation 
        would allow for batch predictions of all the target data points.
        
        """
        k = self.top_k
        num_users = len(user_ids)
        pred = self.items.sample(n=k*num_users, replace=True).index.values
        pred = pred.reshape(num_users, k)
        pred = np.concatenate((user_ids[['user_id']].values, pred), axis=1)
        return pd.DataFrame(pred, columns=['user_id', *[str(i) for i in range(k)]]).set_index('user_id')

As your final code submission, you are also required to contribute a new test: make sure to include an extended RecList in your code.

Please see the notebooks folder for a walk-through on the evaluation engine, and check the instructions in the main README and the template script in submission.py to understand how to make the final code submission.

Last.FM RecList

We prepared a set of quantitative, sliced-based and behavioral tests for the Last.FM use case, inspired by our previous work on the topic and the existing literature on fairness, evaluation and biases.

We detail here the individual tests implemented in the EvalRSRecList class, and provide some context on why they are chosen and how they are operationalized: of course, feel free to check the code for implementation details. Once the evaluaton script obtains the test score for each of the individual test below, a macro-score is automatically calculated for the leaderboard: check the logic for the aggregation below.

Please note when you install the requirements.txt you will automatically get the appropriate beta version of RecList, needed to properly run this code.

Individual tests

In this Data Challenge, you will be asked to train a user-item recommendation model: given historical data on users’ music consumption, your model should recommend the top k songs to a set of test users - generally speaking, given a user U, if the held-out song for U is contained in the top k suggestions, the model is successful in its predictions.

We now explain in detail the tests that we included in EvalRSRecList to provide a rounded evaluation of recommender systems. We divide our list into the three subgroups as per our paper, and explain the motivations behind each one. Information Retrieval metrics

Mean Reciprocal Rank (MRR): MRR provides a good measure of where the first relevant element is ranked in the output list. Besides being considered a standard rank-aware evaluation metric, we chose MRR as it is particularly simple to compute and to interpret.
Hit Ratio (HR): Recall at k (k=100) is the proportion of relevant items found in the top-k recommendations. Together with MRR, it is also a standard evaluation metric for Information Retrieval

Information Retrieval metrics on a per-group or slice basis

We are interested in testing models through a number of behavioral tests whose aim is to address known issues for recommender systems for instance: fairness (e.g. a model should have equal outcomes for different groups), robustness (e.g. a model should produce good outcomes also for long-tail items, such as items with less history or belonging to less represented categories), industry-specific use-cases (e.g. in the case of music, your model should not consistently penalize niche or simply less known artists).

For an overview of fairness in ranking see Yang and Stoyanovich 2016, Castillo 2019, Zehlike et al. 2021; for a discussion about robustness in collaborative recommendations see O’Mahony 2018. For a discussion on specific behavioral testing, with a focus on ecommerce see Tagliabue et al. 2022.

Tests based on data slices are all based on False Positive Rate (FPR), defined as ratio between false positives and the sum of false positives and true negatives.

For those tests where the partition of the test set consists of a binary class (see Gender Balance below), the final score for each test is the difference between the FPR obtained on the relevant slice and the FPR obtained on the original test set (i.e. the general population). For those tests where the partition of the test set consists of a n-ary class (see Artist Popularity below), the final test score is the difference between the FPR obtained on each slice and the the FPR obtained on the original test set (i.e. ONE-VS-MANY).

The slice-based tests considered for the final scores are:

Gender balance. This test is meant to address fairness towards gender, since it is a known problem for recommender systems Saxena and Jein 2020. Given that the dataset only provides binary gender, in this test you will be asked to minimize the difference between the FPR obtained on users who put Female as gender and the FPR obtained on the original test set: the smaller the difference, the fairer the model towards potential gender biases. For this reason, you will be asked to minimize this number.
Artist popularity. This test is meant to address a known problem in music recommendation: recommender systems often penalize niche or simply less known artists and users who are less interested in very popular content Kowald et al. 2020, Celma and Cano 2008. This is particularly important since several music streaming services (e.g. Spotify, Tidal) also act as marketplaces where artists promote their music. In this case, since splitting the test set into two would draw an arbitrary line between popular vs. unpopular artists, failing to capture the actual properties of the distribution, we divided the test set through logarithmic bucketing (i.e. logarithmic bins in base 10). This test is therefore an example of n-ary partition of the test set. Consequently you will be asked to compute the difference between the FPR obtained on each slice and the the FPR obtained on the original test set (i.e. ONE-VS-MANY); the final number representing your model’s score on the test will be the mean of all the values obtained.
User country. Music consumption is subject to many country dependent factors, such as language differences, local sub-genres and styles, local licensing and distribution laws, cultural influences of local traditional music, etc. Since, as some argued, digitization has led to more diverse cultural markets Bello and Garcia 2021, these factors have deep implications for how people listen to music, how artists, labels and streaming platforms go to market. In this test, we sliced the test set selecting the top-10 countries based on the number of users.
Song popularity. This test is meant to make sure that your model performs adequately both on most-listened tracks and on songs with fewer listening events. In this category you will find both less popular songs and newer songs and therefore the test is designed to address both robustness to long tail items and cold-start scenarios. Also in this case, to avoid setting an arbitrary threshold for popularity, we used logarithmic bucketing in base 10 to divide the test set into bins.
User history. Users with a long history vs. users with a short history. The test can be viewed as a robustness/cold-start test. Artist history is operationalized in terms of user play counts (i.e. the sum of play counts per artist). Also in this case, to avoid setting an arbitrary threshold for popularity, we used logarithmic bucketing in base 10 to divide the test set into bins.

Behavioral and qualitative tests

Be less wrong. It is important that recommender systems maintain a reasonable standard of relevance even when the predictions are not accurate. For instance, let’s say that the ground truth for a recommendation is the rap song, ‘Humble’ by Kendric Lamar and that our recommendation system does not get it right. Now our recommender might recommend another rap song from the same year, such as ‘The story of O.J.’ by Jay-Z, or it might recommend a famous pop song from the top chart of that year, such as ‘Shape of You’ by Ed Sheeran. There is still a substantial difference between these two as the first one is closer to the ground truth than the second. Since this has a great impact on the overall user experience, it is desirable that models test and measure their performance scenarios like the one just described. In this test, you will be asked to use the latent space of tracks and to report the average distance between the embeddings of the items chosen by your model and those of the ground truth items. Distance is measured in terms of cosine similarly.
Latent diversity: Diversity is closely tied with the maximization of marginal relevance. In a nutshell, diversity is about retrieving items that are not too similar to each others and it is a way to acknowledge uncertainty of user intent and to address user utility in terms of discovery (see Drosou et al. 2017). Diversity is often considered a partial proxy for fairness and it is an important measure of the performance of recommender systems in real world scenarios Kunaver and Požrl 2017. In this test we address diversity through the latent space of tracks provided along with the dataset testing the model density, where density is defined as the summation of the differences between each point in the prediction space and the mean of the prediction space. Additionally, in order to account also for the “correctness” of prediction vectors, we compute a bias defined as the distance between the ground truth vector and the mean of the prediction vector and weight to penalize for high bias. The final score is computed as 0.3 * diversity - 0.7 * bias, where 0.3 and 0.7 are weights added to balance diversity and connectedness.

Please note that the RecList used by the evaluation script may (and actually should, since your final code submission requires at least one custom test) contain additonal tests on top of the ones that concur to define the leaderboard score. You can, in fact, extend the RecList with as many tests as you want to write your paper, debug your system, uncover some new data insight: remember, EvalRS is about testing as much as scoring! However, only the tests listed above are the ones included in the leaderboard calculation.

Aggregating the scores into the leadearboard macro-score

As explained in the rules, we adopt a timeline in two phases for the leadeboard scoring:

in the first phase, scores of individual tests are simply averaged to get the macro-score for the leaderboard (your submission e-mail will still contain detailed information about each test);
at the start of the second phase (Sept. 1st), you will be required to pull an updated evaluation script from this repository: the new script will contain a novel aggregation function which will be openly shared on Slack. This function will consider the scores for individual tests across all teams in the first phase, and acts as an equalizer between tests - i.e. if a test is easy for everybody, its importance will be downplayed in the macro-score. Only scores obtained in the second phase are considered for the final prizes.

Practically, you won't have to change anything in your model code between the two phases, as everything is handled magically for you by the provided abstractions. As in phase 1, we encourage you to submit often in phase 2 as well.

The final aggregation function is available in the repository, and will be detailed in an upcoming paper as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EvalRS-CIKM-2022 : Evaluation Loop

How to use the provided abstractions

Main abstractions

Build your own evaluation loop

Last.FM RecList

Individual tests

Aggregating the scores into the leadearboard macro-score

Files

README.md

Latest commit

History

README.md

File metadata and controls

EvalRS-CIKM-2022 : Evaluation Loop

How to use the provided abstractions

Main abstractions

Build your own evaluation loop

Last.FM RecList

Individual tests

Aggregating the scores into the leadearboard macro-score