Skip to content

naimurashid/PurIST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

output
html_document pdf_document
default
default

PurIST

Here we provide an example dataset and walkthrough to performing predictions in R. You can jump to the end of the document to copy the whole code block needed to run this example.

Disclaimer: PurIST is patent pending and access/use is for not-for-profit research only.

Walkthrough

First load the purist classifier object.

# load object
load("fitteds_public_2019-02-12.Rdata")

# Extract classifier 
classifier = classifs[[1]]

Here is the list of classifier genes, where each row pertains to a gene pair.

# get list of gene pairs
TSPlist = classifier$TSPs[classifier$fit$beta[-1]!=0,]

Printing this list will give you the following

GPR87	REG4
KRT6A	ANXA10
BCAR3	GATA6
PTGES	CLDN18
ITGA3	LGALS4
C16orf74	DDC
S100A2	SLC40A1
KRT5	CLRN3

Now we get our example data matrix. This data matrix should the following

  1. Rows as the genes, each row will be labeled with the gene name.
  2. Columns as samples, with each column labeled with the sample name.

This data matrix should contain all of the genes in the list above.

In general, between-sample normalization is not necessary for our method, as we utilize a rank based approach to generate the predictors for our prediction model.

IMPORTANT NOTE FOR MICROARRAYS

For Microarray expression data, we urge some caution in use given that relative probe expression between genes is not always proportional to the relative biological expression given probe effects. For those reasons we recommend RNA-seq or nanostring for classifier use, however we have seen good performance in prior microarray datasets. In the past we have simply averaged across probes belonging to the same gene prior to use in the classifier.

DO NOT row normalize (standardize within gene to mean 0 variance 1). This will introduce artifacts into the prediction.

IMPORTANT NOTE FOR RNA-seq

If using RNA-seq data, using TPM (transcripts per million) measurements is necessary to compare relative expression per gene. If not available, FPKM (fragments per kilobase per million) data is sufficient. Both these measurements are available from common expression quantitation pipelines such as Salmon.

DO NOT use raw counts or expected read counts.

DO NOT row normalize expression. For example, DO NOT standardize within gene expression to mean 0 variance 1 across samples. This will introduce artifacts into the prediction.

IMPORTANT NOTE FOR NANOSTRING

For nanostring data, we have found that utilizing the usual raw transcript counts provided from common software is sufficient for use. D

DO NOT row normalize expression. For example, DO NOT standardize within gene expression to mean 0 variance 1 across samples. This will introduce artifacts into the prediction.

Load data and apply classifier

Now lets load the example data matrix.

# load the data, objected named 'dat'
load("example.Rdata")

Now we can applied the classifier to the data. We will source the helper functions to facilitate this process first. Then we apply the predictive model.

# source functions
source("functions.R")

# apply classifier
predictions = apply_classifier(data = dat, classifier = classifier)

Interpreting output

The predictions object above contains a n by 3 data frame, where the 1st column is the predicted probability of each sample of belonging to the "basal-like" subtype. The second column is the subtype call based on a predicted probability cutoff of 0.5. Greater than 0.5 indicates the basal-like subtype, and less than 0.5 indicated the classical subtype. The third column is a graded subtype call similar to the PurIST manuscript, indicating the confidence of the call (Strong, Likely, Lean). We have preliminary evidence that the strength of the call may actually reflect the underlying mixture of basal and classical cells in the tumor itself.

The output generated from our example data is given below ('predictions' object).

          Pred_prob_basal    Subtype    Subtype_graded
Sample_1      0.936765062 basal-like strong basal-like
Sample_2      0.991222760 basal-like strong basal-like
Sample_3      0.002768882  classical  strong classical
Sample_4      0.991222760 basal-like strong basal-like
Sample_5      0.092112770  classical  strong classical
Sample_6      0.019727002  classical  strong classical
Sample_7      0.001095705  classical  strong classical
Sample_8      0.205301579  classical  likely classical
Sample_9      0.205301579  classical  likely classical
Sample_10     0.991222760 basal-like strong basal-like

Summary

Putting it all together, once can use the following set of commands in R to generate the above

# load data objects and source functions
load("fitteds_public_2019-02-12.Rdata")
load("example.Rdata")
source("functions.R")

# Extract classifier 
classifier = classifs[[1]]


# apply classifier 
predictions = apply_classifier(data = dat, classifier = classifier)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages