Our paper aims to reproduce the results of the STAMP paper (GitHub) and use the model to explore how different data preparation methods affect the model's performance. For comparability, we also implement the NARM model so that our results obtained with STAMP would be comparable to some baseline method.
The aim of this README is to make the reproducibility of our results as easy as possible. Follow the steps given below.
As per grading scheme, the files we've modified in some way are shown on the file tree below:
project
│ README.md
│ paths.yaml
│
└───datas
│ │
│ │
│ └───cikm16
│ │ process_cikm_users.py
│ │ process_cikm.py
│
└───narm
│ │
| │
│ └───datasets
│ │ │ preprocess_user.py
│ │ │ preprocess.py
│ │
│ │ main.py
│
|
└───stamp
│
└───basic_layer
│ │ NN_adam.py (tensorflow versioning)
│
└───data_prepare
│ │ data_loader.py
│ │ data_reader.py
│
└───model
│ │ STAMP_cikm.py
│ │ STAMP_rsc.py
│
└───output
│
└───util
│ │ kfolds.py
│ │ Pooler.py (tensorflow versioning)
│ │ RepeatRatio.py
│ │ save_results.py
│ │ SoftmaxMask.py (tensorflow versioning)
│ │ TensorGather.py (tensorflow versioning)
│
│
│ cmain.py
We use two different environments in this project, one for running the STAMP and its related files and another one for NARM and its related files.
When on a Linux operating system, the tf_env.yml can be used to set up the environment, but in order to ensure a smooth experience, we recommend manually installing the following packages instead:
python==3.9
tensorflow==2.12
dill
numpy
pyyaml
python==3.8
torch==1.5.0
torchvision==0.6.0
tqdm
pip
numpy
pillow
openssl
cudatoolkit
xz
zlib
In order for everything to work, update the paths in paths.yaml. The yoochoose path you may add after downloading the required file in step 2.
Our paper leverages two datasets: Diginetica and Yoochoose
After downloading the two datasets, you can put them in the folder datas\
, then process them by process_rsc.py
and process_cikm.py
respectively.
In order to download YOOCHOOSE, follow the instructions given here until step 2, we dont need step 3 since data is provided in the datas folder already, with .data files: https://github.com/RUCAIBox/RecSysDatasets/blob/master/conversion_tools/usage/YOOCHOOSE.md From the downloaded files, we need to keep yoochoose-clicks.dat
To get the required data files run the following files in datas/rsc15:
python -u datas/rsc15/process_rsc.py
python -u narm/datasets/preprocess.py /
--dataset yoochoose
Check for the Data field to download everything here: https://competitions.codalab.org/competitions/11161#learn_the_details-data2
The file we need is train-item-views.csv
Once the file is downloaded and under datas/cikm16/raw run the following two snippets
The first one divides the data by session and the second one by users. We also save the user id's of training and testing in order to have the same train/test split for NARM
python -u datas/cikm16/process_cikm.py
python -u datas/cikm16/process_cikm_users.py
Files will appear under narm/datasets
python -u narm/datasets/preprocess.py --dataset diginetica
python -u narm/datasets/preprocess_user.py
python -u cmain.py \
-m stamp_cikm \
-d cikm16 \
--epoch 30 \
--reload
python -u narm/main.py /
--train_path 'path to train_session.txt in narm/datasets/diginetica' /
--test_path 'path to test_session.txt in narm/datasets/diginetica' /
--diginetica /
--checkpoint 'checkpoint name'
For the 64 version:
python -u cmain.py \
-m stamp_rsc \
-d rsc15_64 \
--epoch 30
For the 4 version:
python -u cmain.py \
-m stamp_rsc \
-d rsc15_4 \
--epoch 30
For the 64 version:
python -u main.py \
--train_path 'Path to narm/datasetsyoochoose1_64/train.txt' \
--test_path 'Path to narm/datasetsyoochoose1_64/test.txt' \
--checkpoint 'checkpoint name'
For the 4 version:
python -u main.py \
--train_path 'Path to narm/datasetsyoochoose1_4/train.txt' \
--test_path 'Path to narm/datasetsyoochoose1_4/test.txt' \
--checkpoint 'checkpoint name'
We were only able to implement the user-based split on the Diginetica dataset, since yoochoose does not include user ID's.
python -u cmain.py \
-m stamp_cikm \
-d cikm16 \
--epoch 30 \
--reload \
--user_split
python -u narm/main.py \
--train_path 'path to train_user.txt in narm/datasets/diginetica' \
--test_path 'path to test_user.txt in narm/datasets/diginetica' \
--checkpoint 'checkpoint name' \
--diginetica \
--user_split
The K-fold cross-validation has been implemented for the STAMP model and the two datasets.
The arguments below use an @K
value of 10 (--cutoff 10
), and 5 folds (--kfolds 5
).
python cmain.py \
-m stamp_cikm \
-d cikm16 \
--epoch 30 \
--reload \
--kfolds 5 \
--cutoff 10
Equivalently to the Diginetica parameters, with either rsc15_4
or rsc15_64
as the -d
argument, and stamp_rsc
as the -m
argument.
python cmain.py \
-m stamp_rsc \
-d rsc15_4 \
--epoch 30 \
--reload \
--kfolds 5 \
--cutoff 10
For the stamp model, the testing is done together with the training. Just check the validation results of the final epoch.
In order to run NARM for testing, the following commands can be used:
python main.py \
--test \
--topk 20 \
--checkpoint 'checkpoint to the model' \
--train_path 'training data path' \
--test_path 'testing data path'
The dataset choice can be done by selecting the corresponding file to train and test with. In order to test for diginetica however, please add the --diginetica flag to the run command and in order to test the user-based split, also add the --user_split flag.
NB! The user split flag can only be used in accordance with the --diginetica flag, since we only ran experiments with that dataset.