-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preparation of expression matrix #87
Comments
Hi James,
We used the log-normalized matrix for the OT computations. The truncated
expression matrix was used in the regulatory regressions.
We didn't do batch correction. This allowed us to use the distance between
batches as a base-line for performance in our geodesic interpolation
computations. We tested the batch effect for each time-point and identified
one or two time-points with large batch effects and removed those corrupted
samples.
As for your last question, we downsample to a maximum of 15,000 UMIs. We
still need to normalize by UMI count because most of the cells have less
than 15,000 UMIs. This expression matrix is similar to the log-normalized
expression matrix (the final difference is the log + 1).
Something like SCTransform also works well, as we describe in our newer
work:
https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1
Best,
Geoff
…On Mon, Dec 14, 2020 at 1:51 PM James Ashmore ***@***.***> wrote:
I am a bit confused about the "preparation of expression matrices" section
in the STAR methods of the optimal transport paper. You define three
different expression matrices: the UMI matrix, the log-normalized
expression matrix, and the truncated expression matrix. Which of these
should be used as the input expression matrix? Additionally, if there are
multiple batches of cells, should the expression matrix be batch-corrected
beforehand? Either by regression or some other method (e.g. fastMNN). I
couldn't find any mention of batch correction in the STAR methods, apart
from:
"The expression matrix was downsampled to 15,000 UMIs per cell. Cells with
less than 2000 UMIs per cell in total and all genes that were expressed in
less than 50 cells were discarded, leaving 251,203 cells and G = 19,089
genes for further analysis. The elements of expression matrix were
normalized by dividing UMI count by the total UMI counts per cell and
multiplied by 10,000 i.e., expression level is reported as transcripts per
10,000 counts."
I'm not sure why the data was downsampled and then normalized by UMI
count? Doesn't the first correction make the second redundant? Also, is
this downsampled / normalized matrix different from the three defined
above? If so, should I be using this as the input expression matrix instead?
Thanks,
James
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#87>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJCQYVTTGXMDRTBK7E3D6DSU2CGXANCNFSM4U3MHFYA>
.
|
Hi Geoff, Thanks for the super fast reply!
Okay, thanks for clarifying.
Sorry, does that mean batch correction should or should not be used? I would prefer to correct the effect and not throw samples away.
Oh yes that makes sense, silly me.
Thanks, I will try SCTransform. Additionally, what do you think about using size factors which account for compositional biases? Edited my original post so you may not have seen my question. |
Out of over 100 runs of 10x we threw out 1, which was very different than
all our other samples. This was essentially a failed reaction.
I'm not sure what you mean about size factors, can you ask again?
…On Mon, Dec 14, 2020 at 2:16 PM James Ashmore ***@***.***> wrote:
Hi Geoff,
Thanks for the super fast reply!
Hi James, We used the log-normalized matrix for the OT computations. The
truncated expression matrix was used in the regulatory regressions.
Okay, thanks for clarifying.
We didn't do batch correction. This allowed us to use the distance between
batches as a base-line for performance in our geodesic interpolation
computations. We tested the batch effect for each time-point and identified
one or two time-points with large batch effects and removed those corrupted
samples.
Sorry, does that mean batch correction should or should not be used? I
would prefer to correct the effect and not throw samples away.
As for your last question, we downsample to a maximum of 15,000 UMIs. We
still need to normalize by UMI count because most of the cells have less
than 15,000 UMIs. This expression matrix is similar to the log-normalized
expression matrix (the final difference is the log + 1).
Oh yes that makes sense, silly me.
Something like SCTransform also works well, as we describe in our newer
work: https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1 Best,
Geoff
… <#m_-2222056408674736231_>
Thanks, I will try SCTransform.
Additionally, what do you think about using size factors which account for
compositional biases? Edited my original post so you may not have seen my
question.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJCQYRTIDISUH3O2TGGZ33SU2FE3ANCNFSM4U3MHFYA>
.
|
What about in cases where you are pooling cells from multiple donors or even technologies? For example imagine a developmental series, whereby you are using multiple mouse embryos at each time point. In a conventional scRNA-seq analysis you might correct the expression matrix to ensure any donor-specific variation is removed.
You divide the counts matrix by the sum of UMIs for each cell, this corrects for unequal sequencing depth across the cells. However, this does not correct for compositional differences caused by unbalanced differential expression between samples. Explained here: http://bioconductor.org/books/release/OSCA/normalization.html#normalization-by-deconvolution |
I am a bit confused about the "preparation of expression matrices" section in the STAR methods of the optimal transport paper. You define three different expression matrices: the UMI matrix, the log-normalized expression matrix, and the truncated expression matrix. Which of these should be used as the input expression matrix? Additionally, if there are multiple batches of cells, should the expression matrix be batch-corrected beforehand? Either by regression or some other method (e.g. fastMNN). I couldn't find any mention of batch correction in the STAR methods, apart from:
"The expression matrix was downsampled to 15,000 UMIs per cell. Cells with less than 2000 UMIs per cell in total and all genes that were expressed in less than 50 cells were discarded, leaving 251,203 cells and G = 19,089 genes for further analysis. The elements of expression matrix were normalized by dividing UMI count by the total UMI counts per cell and multiplied by 10,000 i.e., expression level is reported as transcripts per 10,000 counts."
I'm not sure why the data was downsampled and then normalized by UMI count? Doesn't the first correction make the second redundant? Also, is this downsampled / normalized matrix different from the three defined above? If so, should I be using this as the input expression matrix instead? Finally, you use library size as a scaling factor to correct for differences in sequencing depth, but is there a requirement to normalize for compositional biases as well? (e.g. using pool-based size factors)
Thanks,
James
The text was updated successfully, but these errors were encountered: