OPEX is an optimal experimental design framework written in R to help biologists to select the most informative experiments to conduct given the experiments conducted up to now. This repo demonstrates the application of OPEX on collecting gene expression data of E. coli under the stress of various antibiotics and biocides.
The structure of the code is show as follows. The entry to this project is run.sh
which runs the main.R
. The folder, src
stores the implementation of the functions and classes used in main.R
. There are seven R scripts in src
. The script, generate_setting.R
is for generating settings for running a simulation. The script, Simulator.R
defines a class named Simulate
, which is the workhorse of running the simulation. Other scripts are helper modules of the Simulate
class. For details of each script, see the document header of each file.
├── main.R
├── run_OPEX_on_your_dataset.R
├── run.sh
└── src
├── add_noise.R
├── generate_setting.R
├── max_dist.R
├── prepare_data.R
├── screen_index_helper.R
├── Simulator.R
└── update_train_pool.R
The input data is a table, in which the first 14 columns define the culture conditions in each row and the other 1123 columns represents the gene expression profile for each condition. (Genes that did not have a sufficient sequencing depth were excluded).
A culture condition is defined by a binary vector, representing the presence (with 0) or absence (with 1) of 10 biocides and 4 antibiotics: Chlorexidine, Phenol, H2O2, Isopropanol, Bezalkonium_chloride, Ethanol, Glutaraldehyde, Percetic_acid, Sodium_hypochlorite, Povidone_iodine, Kanamycin, Rifampicin, Norfloxacin, Ampicillin
.
-
Step 1: generate a file that include the settings for running OPEX. The setting file is named after the sampling method. e.g. expert sampling is used in the following example.
cd ./R/src Rscript generate_setting.R setting
After running the above commands, a file named
setting.csv
is generated in./output
. Thegenerate_setting.csv
specifies the value for hyper-parameters: random_seed, exploration frequency, adaptive , start size, add, dataset id, noise, iter_num, and sampling method. For the meaning of these hyper-parameters, see the comments in thegenerate_setting.R
file. -
Step 2: Run the simulation using one of the setting in the file generated in Step 1. e.g. The first setting is used in the following example.
cd ./R bash run.sh setting.csv 1
To run all the settings, we used high performance computing. The script for submitting all the simulations is as follows:
#!/bin/bash
#SBATCH -p low
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem-per-cpu 1000
#SBATCH -t 1:00:00
#SBATCH -o output/slurm.%N.%j.out
#SBATCH -e output/slurm.%N.%j.err
#SBATCH --array=1-1800
Rscript main.R setting.csv $SLURM_ARRAY_TASK_ID
Upon completion, a folder named setting
will be created in ./output
. The results generated by this simulation run is stored in the folder, expert_sample
.
The result is a csv file named by the value of the hyper-parameters in the setting and contains the order of each culture condition selected by expert sampling.
To OPEX on your own biological problem, two tabular datasets are needed. One is a dataset for training a model. The other is a pool of candidate experiments to run. Both datasets are a matrix. In the training dataset, the last column is the output and other columns are inputs. Each row denotes one datapoint. The pool dataset has one less column than the training set as the output column is missing.
The command to run OPEX is as follows:
Rscript run_OPEX_on_your_dataset.R <training_path> <pool_path> <batch_size>
training_path, pool_path are two strings representing the path of two csv files.
batch_size is an integer.
If you have any questions about this project, please contact us at tagkopouloslab@ucdavis.edu
See the LICENSE file for license rights and limitations (Apache2.0).
This work was supported by an NSF award (#1743101).