This repository contains the code and instructions needed to reproduce the experiments of the paper Efficient Iterative Programs with Distributed Data Collections. Experiments conducted with the DIQL system are found in this repository, and those conducted with the Emma system are available in this repository.
This repository is written in Scala to be executed in a Spark 2.4 cluster. It requires to be compiled and executed in a Java 8 environment with Scala 2.11.12.
- Java 8 (tested with Eclipse Temurin 8)
- Spark 2.4.x
- Maven 3.x
- sbt 1.3.x
- Scala 2.11.12 (installed by sbt)
All the datasets used for the experiments reported in the paper can be downloaded here: datasets.zip
This archive contains the files of the graphs used in the experiments.
You can download it with the following command:
wget https://cloud.univ-grenoble-alpes.fr/s/4QPG6FMobrAqwnn/download/datasets.zip
unzip dataset.zip
Files to be used to run mu-monoid programs are in the folder mumonoiddata.
- Compile the assembly JAR of this project. It will produce a ready-to-use JAR file in the
target/scala-2.11
folder, with the name matching*-assembly-*.jar
. You should move this file up to the project root and rename itmumonoid-programs-assembly.jar
(we'll use this same in the next commands):sbt assembly mv target/scala-2.11/mumonoid-programs-assembly-0.1.0-SNAPSHOT.jar mumonoid-programs-assembly.jar
- Copy the assembly JAR to a place accessible by
spark-submit
- IntelliJ works well by loading the
build.sbt
file to make a new project - If you use Eclipse:
- Call
sbt eclipse
to generate the project files - Import the project into Eclipse
- Change the Scala Compiler properties of the Eclipse project to use Scala 2.11
- Call
Below is the command to run each the different programs. These commands require the following arguments:
$DATA_FILE
: Absolute path to the input dataset$MASTER_URL
: Spark master URL. See the Spark documentation. Omit when running locally.$NB_PARTITIONS
: Number of partitions in Spark. We set it to be the number of available cores in the cluster--cluster
: Option to indicate that the program is run on a spark cluster. Omit when running locally.$PROGRAM
: The test class name. All available in the directory todo (for instance TC, SP, ...)
$PROGRAM
: One of TC, TCNoPdist, SP, SPNoPA, SPNoPdist.$DATA_FILE
: Absolute path to the file containing the graph edges. Files used in the experiments are of the form rnd_n_p.txt for TC programs and rnd_n_p_W.txt for SP programs.
spark-submit \
--class fr.inria.tyrex.mumonoidPrograms.$PROGRAM \
--driver-memory 40g \
--conf spark.driver.maxResultSize=0 \
mumonoid-programs-assembly.jar \
$DATA_FILE \
--cluster \
--master $MASTER_URL \
--partitions $NB_PARTITIONS
$PROGRAM
: One of TCFilter, TCFilterNoPdist, SPFilter, SPFilterNoPA, SPFilterNoPdist.$DATA_FILE
: Absolute path to the file containing the graph edges. Files used in the experiments are of the form rnd_n_p.txt for TC programs and rnd_n_p_W.txt for SP programs.$START_NODES_FILE
: Absolute path to the file containing the start nodes. Files used in the experiments are of the form start_rnd_n_p.txt
spark-submit \
--class fr.inria.tyrex.mumonoidPrograms.$PROGRAM \
--driver-memory 40g \
--conf spark.driver.maxResultSize=0 \
mumonoid-programs-assembly.jar \
$DATA_FILE \
$START_NODES_FILE
--cluster \
--master $MASTER_URL \
--partitions $NB_PARTITIONS
$PROGRAM
: One of PathPlanning, PathPlanningNoPA, PathplanningNoPdist.$DATA_FILE
: Absolute path to the file containing routes between cities. Files used in the experiments are of the form cities_n_p.$START_ROUTES_FILE
: Absolute path to the file containing the starting routes. Files used in the experiments are of the form start_cities_n_p.txt
spark-submit \
--class fr.inria.tyrex.mumonoidPrograms.$PROGRAM \
--driver-memory 40g \
--conf spark.driver.maxResultSize=0 \
mumonoid-programs-assembly.jar \
$DATA_FILE \
$START_ROUTES_FILE
--cluster \
--master $MASTER_URL \
--partitions $NB_PARTITIONS
$PROGRAM
: One of MovieRecommendations, MovieRecommendationsNoPdist.$DATA_FILE
: Absolute path to the file containing users. Files used in the experiments are of the form users_n$START_MOVIES_FILE
: Absolute path to the file containing start movies. Files used in the experiments are of the form start_movies_n.txt
spark-submit \
--class fr.inria.tyrex.mumonoidPrograms.$PROGRAM \
--driver-memory 40g \
--conf spark.driver.maxResultSize=0 \
mumonoid-programs-assembly.jar \
$DATA_FILE \
$START_MOVIES_FILE
--cluster \
--master $MASTER_URL \
--partitions $NB_PARTITIONS
$PROGRAM
: One of Flights, FlightsNoPdist.$DATA_FILE
: Absolute path to the file containing flights. Files used in the experiments are of the form flights_n_p.txt.
spark-submit \
--class fr.inria.tyrex.mumonoidPrograms.$PROGRAM \
--driver-memory 40g \
--conf spark.driver.maxResultSize=0 \
mumonoid-programs-assembly.jar \
$DATA_FILE
--cluster \
--master $MASTER_URL \
--partitions $NB_PARTITIONS