Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation of expression matrix #87

Open
jma1991 opened this issue Dec 14, 2020 · 4 comments
Open

Preparation of expression matrix #87

jma1991 opened this issue Dec 14, 2020 · 4 comments

Comments

@jma1991
Copy link

jma1991 commented Dec 14, 2020

I am a bit confused about the "preparation of expression matrices" section in the STAR methods of the optimal transport paper. You define three different expression matrices: the UMI matrix, the log-normalized expression matrix, and the truncated expression matrix. Which of these should be used as the input expression matrix? Additionally, if there are multiple batches of cells, should the expression matrix be batch-corrected beforehand? Either by regression or some other method (e.g. fastMNN). I couldn't find any mention of batch correction in the STAR methods, apart from:

"The expression matrix was downsampled to 15,000 UMIs per cell. Cells with less than 2000 UMIs per cell in total and all genes that were expressed in less than 50 cells were discarded, leaving 251,203 cells and G = 19,089 genes for further analysis. The elements of expression matrix were normalized by dividing UMI count by the total UMI counts per cell and multiplied by 10,000 i.e., expression level is reported as transcripts per 10,000 counts."

I'm not sure why the data was downsampled and then normalized by UMI count? Doesn't the first correction make the second redundant? Also, is this downsampled / normalized matrix different from the three defined above? If so, should I be using this as the input expression matrix instead? Finally, you use library size as a scaling factor to correct for differences in sequencing depth, but is there a requirement to normalize for compositional biases as well? (e.g. using pool-based size factors)

Thanks,
James

@geoffschieb
Copy link
Collaborator

geoffschieb commented Dec 14, 2020 via email

@jma1991
Copy link
Author

jma1991 commented Dec 14, 2020

Hi Geoff,

Thanks for the super fast reply!

Hi James, We used the log-normalized matrix for the OT computations. The truncated expression matrix was used in the regulatory regressions.

Okay, thanks for clarifying.

We didn't do batch correction. This allowed us to use the distance between batches as a base-line for performance in our geodesic interpolation computations. We tested the batch effect for each time-point and identified one or two time-points with large batch effects and removed those corrupted samples.

Sorry, does that mean batch correction should or should not be used? I would prefer to correct the effect and not throw samples away.

As for your last question, we downsample to a maximum of 15,000 UMIs. We still need to normalize by UMI count because most of the cells have less than 15,000 UMIs. This expression matrix is similar to the log-normalized expression matrix (the final difference is the log + 1).

Oh yes that makes sense, silly me.

Something like SCTransform also works well, as we describe in our newer work: https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1 Best, Geoff

Thanks, I will try SCTransform.

Additionally, what do you think about using size factors which account for compositional biases? Edited my original post so you may not have seen my question.

@geoffschieb
Copy link
Collaborator

geoffschieb commented Dec 14, 2020 via email

@jma1991
Copy link
Author

jma1991 commented Dec 14, 2020

Out of over 100 runs of 10x we threw out 1, which was very different than all our other samples. This was essentially a failed reaction.

What about in cases where you are pooling cells from multiple donors or even technologies? For example imagine a developmental series, whereby you are using multiple mouse embryos at each time point. In a conventional scRNA-seq analysis you might correct the expression matrix to ensure any donor-specific variation is removed.

I'm not sure what you mean about size factors, can you ask again?

You divide the counts matrix by the sum of UMIs for each cell, this corrects for unequal sequencing depth across the cells. However, this does not correct for compositional differences caused by unbalanced differential expression between samples. Explained here: http://bioconductor.org/books/release/OSCA/normalization.html#normalization-by-deconvolution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants