From 143978abb071f910bafd5c1b73fa185669ded058 Mon Sep 17 00:00:00 2001 From: chuyaowangMAC Date: Mon, 2 Sep 2024 09:53:46 +0200 Subject: [PATCH] Notes update 2024-09-02 --- .../MLE vs. Bayesian Parameter Estimation.md | 2 +- Biology/Concepts/Binding Site Maps.md | 46 ++ Biology/Concepts/Cell Cycle.md | 19 + Biology/Concepts/Co-Accessible Peaks.md | 61 ++ Biology/Concepts/FAIRE-seq.md | 105 +++ Biology/Concepts/MYC Signaling.md | 61 ++ Biology/Concepts/Promoter.md | 2 +- .../Concepts/Regulatory Candidate Genes.md | 8 +- Biology/Concepts/mRNA.md | 20 +- Biology/Concepts/scATAC-seq.md | 39 +- Biology/Concepts/scRNA-seq.md | 300 ++++++- .../0. Introduction to Bioinformatics VU.md | 1 + .../1. The introduction of introduction.md | 25 + .../2. GRN Programming Environment.md | 248 ++++++ ...tegies in an Era Defined by Data and AI.md | 1 + ...ference and in silico gene perturbation.md | 278 ++++++- ...ysis of signed gene regulatory networks.md | 6 +- ...l data using the Information Bottleneck.md | 15 +- ... enhancers and gene regulatory networks.md | 1 + ...associated with prostate cancer relapse.md | 95 +++ LCMS/Lipid_slides.md | 768 ++++++++++++++++++ .../Concepts/Bagging Ridge Regression.md | 40 + .../Concepts/Bayesian Ridge Regression.md | 140 ++++ .../Concepts/Cluster Label Transfer.md | 125 +++ Machine Learning/Concepts/Confusion Matrix.md | 34 + .../Concepts/Gamma Distribution.md | 0 Machine Learning/Concepts/Gamma Function.md | 0 .../Concepts/Gaussian Smoothing.md | 63 ++ .../Concepts/K-nearest Neighbor Imputation.md | 58 ++ Machine Learning/Concepts/Kronecker Delta.md | 0 .../Concepts/L1 Regularization.md | 9 + .../Concepts/L2 Regularization.md | 266 ++++++ .../Concepts/L3 Regularization.md | 0 .../Concepts/Mean Squared Error.md | 69 ++ .../Concepts/Multicollinearity.md | 137 ++++ .../Concepts/Network Adjacency Matrix.md | 106 +++ .../Concepts/Numerical Overflow.md | 56 ++ .../Concepts/Pearson Correlation.md | 63 ++ .../Concepts/Residual Sum of Squares.md | 14 + Machine Learning/Concepts/Softmax.md | 166 ++++ .../External Linux on SSD.md | 153 ++++ Programming Environment/Ubuntu 2204.md | 232 ++++++ Programming Environment/bash.md | 12 +- Programming Environment/conda.md | 60 +- Programming Environment/jupyter.md | 77 +- Python/Python Magic Methods.md | 3 +- Python/celloracle.md | 8 + Python/ipykernel.md | 6 + Python/pip.md | 5 + Python/scanpy.md | 3 + R/R.md | 60 +- R/Ubuntu R.md | 11 + R/cicero.md | 3 + R/irkernel.md | 14 + R/monocle3.md | 3 + R/openssl.md | 19 + note project.md | 23 + 57 files changed, 4111 insertions(+), 28 deletions(-) create mode 100644 Biology/Concepts/Binding Site Maps.md create mode 100644 Biology/Concepts/Cell Cycle.md create mode 100644 Biology/Concepts/Co-Accessible Peaks.md create mode 100644 Biology/Concepts/FAIRE-seq.md create mode 100644 Biology/Concepts/MYC Signaling.md create mode 100644 Biology/Introduction to Bioinformatics VU/0. Introduction to Bioinformatics VU.md create mode 100644 Biology/Introduction to Bioinformatics VU/1. The introduction of introduction.md create mode 100644 Biology/Multiomics Competition/2. GRN Programming Environment.md rename {Machine Learning => Biology/Seminars}/Biopharma Tech Strategies in an Era Defined by Data and AI.md (98%) create mode 100644 Biology/Seminars/Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse.md create mode 100644 LCMS/Lipid_slides.md create mode 100644 Machine Learning/Concepts/Bagging Ridge Regression.md create mode 100644 Machine Learning/Concepts/Bayesian Ridge Regression.md create mode 100644 Machine Learning/Concepts/Cluster Label Transfer.md create mode 100644 Machine Learning/Concepts/Confusion Matrix.md create mode 100644 Machine Learning/Concepts/Gamma Distribution.md create mode 100644 Machine Learning/Concepts/Gamma Function.md create mode 100644 Machine Learning/Concepts/Gaussian Smoothing.md create mode 100644 Machine Learning/Concepts/K-nearest Neighbor Imputation.md create mode 100644 Machine Learning/Concepts/Kronecker Delta.md create mode 100644 Machine Learning/Concepts/L1 Regularization.md create mode 100644 Machine Learning/Concepts/L2 Regularization.md create mode 100644 Machine Learning/Concepts/L3 Regularization.md create mode 100644 Machine Learning/Concepts/Mean Squared Error.md create mode 100644 Machine Learning/Concepts/Multicollinearity.md create mode 100644 Machine Learning/Concepts/Network Adjacency Matrix.md create mode 100644 Machine Learning/Concepts/Numerical Overflow.md create mode 100644 Machine Learning/Concepts/Pearson Correlation.md create mode 100644 Machine Learning/Concepts/Residual Sum of Squares.md create mode 100644 Machine Learning/Concepts/Softmax.md create mode 100644 Programming Environment/External Linux on SSD.md create mode 100644 Programming Environment/Ubuntu 2204.md create mode 100644 Python/celloracle.md create mode 100644 Python/ipykernel.md create mode 100644 Python/pip.md create mode 100644 Python/scanpy.md create mode 100644 R/Ubuntu R.md create mode 100644 R/cicero.md create mode 100644 R/irkernel.md create mode 100644 R/monocle3.md create mode 100644 R/openssl.md create mode 100644 note project.md diff --git a/Biology/Computational Biology MIT/Appendix/MLE vs. Bayesian Parameter Estimation.md b/Biology/Computational Biology MIT/Appendix/MLE vs. Bayesian Parameter Estimation.md index 43efd9f..9886af6 100644 --- a/Biology/Computational Biology MIT/Appendix/MLE vs. Bayesian Parameter Estimation.md +++ b/Biology/Computational Biology MIT/Appendix/MLE vs. Bayesian Parameter Estimation.md @@ -78,7 +78,7 @@ $$ ### The prior -The prior can be hard to define as a probability distribution. It is also argued that since different prior beliefs will lead to different posteriors, the Bayesian approach is not scientific enough. The counter argument says that many subject choices are already made in conducting an experiment. +The prior can be hard to define as a probability distribution. It is also argued that since different prior beliefs will lead to different posteriors, the Bayesian approach is not scientific enough. The counter argument says that many subjective choices are already made in conducting an experiment. ### The constant of integration diff --git a/Biology/Concepts/Binding Site Maps.md b/Biology/Concepts/Binding Site Maps.md new file mode 100644 index 0000000..f2646d8 --- /dev/null +++ b/Biology/Concepts/Binding Site Maps.md @@ -0,0 +1,46 @@ +# Binding Site Maps + +In the context of the [study](Single-cell%20ATAC%20and%20RNA%20sequencing%20reveal%20pre-existing%20and%20persistent%20cells%20associated%20with%20prostate%20cancer%20relapse.md) on enzalutamide (ENZ) resistance in prostate cancer, **binding site maps** refer to the genomic locations where specific transcription factors, such as MYC and AR (androgen receptor), bind to DNA. These maps are crucial for understanding how transcription factors regulate gene expression and contribute to cellular processes and disease states. + +## Key Points About Binding Site Maps + +1. **Purpose:** + - Binding site maps identify the specific DNA sequences where transcription factors bind, providing insights into the regulatory networks controlled by these factors. + - They help elucidate how changes in transcription factor binding contribute to cellular responses, such as drug resistance. + +2. **Techniques Used:** + - **ChIP-seq (Chromatin Immunoprecipitation Sequencing):** A common technique for mapping transcription factor binding sites. It involves crosslinking proteins to DNA, immunoprecipitating the DNA-protein complexes with specific antibodies, and sequencing the bound DNA fragments. + - **[FAIRE-seq](FAIRE-seq.md) (Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing)**: Although primarily used to identify open chromatin regions, it can also provide information about regions where transcription factors are likely to bind, especially in combination with other data. + +## MYC and AR Binding Site Maps in the Study + +1. **MYC Binding Site Maps:** + - **Increased MYC Activity:** The study found that ENZ-resistant cells exhibit increased chromatin accessibility at MYC-binding sites, suggesting enhanced MYC activity. + - **Regulatory Regions:** MYC binding site maps help identify regulatory regions where MYC exerts its effects, promoting the transcription of genes involved in cell proliferation, metabolism, and survival. + +2. **AR Binding Site Maps:** + - **Altered AR Activity:** In the context of ENZ resistance, the study noted changes in chromatin accessibility at AR-binding sites, indicating altered AR activity. + - **Regulation of Target Genes:** AR binding site maps are essential for understanding how AR regulates target genes and how its activity changes under different conditions, such as drug treatment or resistance. + +## Findings from the Study + +1. **Chromatin Reprogramming:** + - The study highlights that ENZ resistance is associated with extensive chromatin reprogramming, leading to changes in the accessibility of binding sites for transcription factors like MYC and AR. + - **Open Chromatin Regions:** Increased open chromatin regions at MYC-binding sites suggest that MYC plays a significant role in the transcriptional reprogramming observed in ENZ-resistant cells. + +2. **Transcriptional Reprogramming:** + - The altered binding patterns of MYC and AR are linked to changes in the expression of their target genes, driving the resistance phenotype. + - **Compensatory Mechanisms:** The study suggests that in the absence of functional AR signaling due to ENZ treatment, prostate cancer cells may rely more on MYC signaling to sustain growth and survival. + +## Implications for Prostate Cancer Research + +1. **Understanding Resistance Mechanisms:** + - Binding site maps provide critical insights into the molecular mechanisms underlying drug resistance, helping to identify potential therapeutic targets. + - **Targeting MYC:** Given the increased MYC activity in ENZ-resistant cells, targeting MYC or its downstream pathways could be a potential strategy to overcome resistance. + +2. **Developing Combination Therapies:** + - Combining AR-targeted therapies like ENZ with inhibitors of MYC signaling might prevent or delay the development of resistance, offering more effective treatment options for prostate cancer patients. + +### Summary + +**Binding site maps** for transcription factors like MYC and AR are essential tools in understanding the regulatory networks that drive cellular responses and disease states. In the study on ENZ resistance, these maps revealed significant chromatin reprogramming and transcriptional changes associated with increased MYC activity and altered AR activity. These findings underscore the importance of MYC in compensating for inhibited AR signaling and highlight potential avenues for therapeutic intervention in resistant prostate cancer. \ No newline at end of file diff --git a/Biology/Concepts/Cell Cycle.md b/Biology/Concepts/Cell Cycle.md new file mode 100644 index 0000000..0243941 --- /dev/null +++ b/Biology/Concepts/Cell Cycle.md @@ -0,0 +1,19 @@ +# Cell Cycle + +## Phases + +The cell cycle consists of four main phases: + +1. **G1 Phase (Gap 1 phase)**: This is the first growth phase where the cell grows and carries out its normal functions. It prepares for DNA replication. + +2. **S Phase (Synthesis phase)**: During this phase, DNA synthesis or replication occurs. The cell duplicates its genetic material to prepare for division. + +3. **G2 Phase (Gap 2 phase)**: In this phase, the cell continues to grow and prepares for **mitosis** (cell division). It also synthesizes proteins and organelles needed for division. + +4. **M Phase (Mitotic phase)**: This is the phase where actual cell division occurs. It includes two main processes: + - **Mitosis**: The division of the cell nucleus into two identical nuclei, each with a full set of chromosomes. + - **Cytokinesis**: The division of the cytoplasm and other organelles to form two separate daughter cells. + +These phases ensure that the cell grows, replicates its DNA accurately, and divides properly into two daughter cells. + +## Cell Cycle Hetero \ No newline at end of file diff --git a/Biology/Concepts/Co-Accessible Peaks.md b/Biology/Concepts/Co-Accessible Peaks.md new file mode 100644 index 0000000..9ec4953 --- /dev/null +++ b/Biology/Concepts/Co-Accessible Peaks.md @@ -0,0 +1,61 @@ +# Co-accessible Peaks + +**Co-accessible peaks** in single-cell ATAC-seq ([scATAC-seq](scATAC-seq.md)) refer to regions of [Open Chromatin Regions](Open%20Chromatin%20Regions.md) that show coordinated accessibility across single cells. These peaks often correspond to [regulatory](Transcription%20Regulation.md) elements such as [Promoter](Promoter.md)s, [Enhancer](Enhancer.md)s, or other cis-regulatory elements that are functionally linked. The concept of co-accessibility suggests that the accessibility of one region of the genome is related to the accessibility of another, indicating a potential _regulatory interaction_ between these regions. + +## Importance of Co-accessible Peaks + +1. **Regulatory Interactions**: Co-accessible peaks can identify potential regulatory interactions between different genomic regions, such as enhancer-promoter interactions. +2. **Chromatin Architecture**: They provide insights into the three-dimensional organization of the genome and how chromatin looping brings distant regulatory elements into proximity with their target genes. +3. **Cell-type Specificity**: Identifying co-accessible peaks can help delineate cell-type-specific [Regulatory Network](Gene%20Regulatory%20Network.md)s and understand how gene regulation varies between different cell types or states. + +### Identification of Co-accessible Peaks + +Identifying co-accessible peaks involves several computational steps: + +1. **Peak Calling**: + - Initially, peaks of accessible chromatin are identified for each single cell or aggregated across cells using peak calling algorithms (e.g., MACS2). + - These peaks represent regions of the genome where the chromatin is accessible to the Tn5 transposase used in ATAC-seq. + +2. **Accessibility Matrix Construction**: + - An _accessibility matrix_ is constructed where rows represent individual peaks, columns represent individual cells, and the entries indicate the accessibility status of each peak in each cell. + - The matrix can be binary (accessible or not) or quantitative (degree of accessibility). + +3. **Correlation Analysis**: + - Correlation or other statistical measures are used to assess the co-accessibility between pairs of peaks across the single cells. + - Highly correlated peaks are considered co-accessible, suggesting a functional relationship. + +4. **Graph-based Methods**: + - Peaks can be represented as nodes in a graph, with edges connecting co-accessible peaks. + - Graph-based clustering algorithms can identify modules or clusters of co-accessible peaks, representing putative regulatory elements working together. + +5. **Latent Variable Models**: + - Methods such as latent semantic indexing (LSI) or topic modeling can be used to identify patterns of co-accessibility, capturing the underlying structure of chromatin accessibility data. + +6. **Integration with Other Data**: + - Co-accessibility analysis can be integrated with other genomic data, such as gene expression ([scRNA-seq](scRNA-seq.md)), to correlate co-accessible peaks with gene regulatory activity. + - Chromatin conformation data (e.g., Hi-C) can be used to validate and refine co-accessible peak predictions. + +## Tools for Identifying Co-accessible Peaks + +Several computational tools and frameworks have been developed for identifying co-accessible peaks in scATAC-seq data: + +1. **Cicero**: + - Part of the Monocle3 package, Cicero identifies co-accessible peaks by constructing co-accessibility maps using single-cell chromatin accessibility data. + - It uses a machine-learning approach to infer regulatory interactions and visualize chromatin structure. + +2. **ArchR**: + - ArchR is a comprehensive package for analyzing single-cell chromatin accessibility data. + - It includes functionality for identifying co-accessible peaks and integrating with other single-cell omics data. + +3. **Signac**: + - Part of the Seurat ecosystem, Signac is designed for the analysis of single-cell chromatin data. + - It provides tools for peak calling, dimensionality reduction, and identification of co-accessible peaks. + +## Biological Implications + +- **Gene Regulation**: Co-accessible peaks help in mapping the regulatory landscape of the genome, identifying potential enhancers and promoters that work together to control gene expression. +- **Cell Identity**: They reveal cell-type-specific regulatory networks, contributing to our understanding of how different cell types establish and maintain their identity. +- **Development and Differentiation**: Co-accessibility analysis provides insights into dynamic changes in chromatin structure during development and differentiation. +- **Disease Mechanisms**: Understanding co-accessible peaks can uncover regulatory disruptions in diseases such as cancer, where chromatin accessibility patterns are often altered. + +Overall, identifying co-accessible peaks in scATAC-seq data is crucial for understanding the complex regulatory interactions and chromatin architecture that underlie gene regulation and cellular function. \ No newline at end of file diff --git a/Biology/Concepts/FAIRE-seq.md b/Biology/Concepts/FAIRE-seq.md new file mode 100644 index 0000000..6377d77 --- /dev/null +++ b/Biology/Concepts/FAIRE-seq.md @@ -0,0 +1,105 @@ +# FAIRE-seq: Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing + +**FAIRE-seq** is a genomic technique used to identify [Open Chromatin Regions](Open%20Chromatin%20Regions.md), which are often associated with active regulatory elements such as [Promoter](Promoter.md)s, [Enhancer](Enhancer.md)s, and other [Transcription Factor](Transcription%20Factor.md) binding sites. The method leverages the fact that open chromatin regions are more accessible and less tightly bound to nucleosomes compared to closed chromatin regions. + +## Steps + +1. **Crosslinking:** + - Cells are treated with formaldehyde, which crosslinks proteins to DNA, thereby preserving protein-DNA interactions in both open and closed chromatin regions. + +2. **Chromatin Fragmentation:** + - The crosslinked chromatin is then sheared into small fragments using sonication or other mechanical methods. + +3. **Phenol-Chloroform Extraction:** + - The fragmented chromatin is subjected to phenol-chloroform extraction, a method used to separate proteins from DNA. Open chromatin regions, which are less protein-bound, preferentially partition into the aqueous phase, whereas the protein-bound (closed) chromatin remains in the organic phase. + +4. **DNA Purification:** + - The DNA from the aqueous phase is purified. This DNA represents the regions of open chromatin. + +5. **Sequencing:** + - The purified DNA is then subjected to high-throughput sequencing. The sequencing reads are aligned to a reference genome to identify the genomic locations of open chromatin regions. + +## Applications of FAIRE-seq + +1. **Identification of Regulatory Elements:** + - FAIRE-seq is used to map active [regulatory elements](Transcription%20Regulation.md) such as promoters, enhancers, and insulators across the genome. + +2. **Comparison of Chromatin States:** + - By comparing FAIRE-seq profiles under different conditions or in different cell types, researchers can identify changes in chromatin accessibility associated with various biological processes or diseases. + +3. **Gene Regulation Studies:** + - FAIRE-seq helps in understanding the regulatory architecture of the genome and how changes in chromatin accessibility affect gene expression. + +4. **Cancer Research:** + - In cancer research, FAIRE-seq can be used to identify regulatory elements that are aberrantly activated or repressed in cancer cells, providing insights into mechanisms of oncogenesis and potential therapeutic targets. + +## Advantages of FAIRE-seq + +1. **High Resolution:** + - FAIRE-seq provides high-resolution maps of open chromatin regions, enabling precise identification of regulatory elements. + +2. **No Antibodies Required:** + - Unlike ChIP-seq (Chromatin Immunoprecipitation sequencing), which requires specific antibodies for the target proteins, FAIRE-seq does not rely on antibodies, making it more straightforward and less biased. + +3. **Applicability to Various Conditions:** + - FAIRE-seq can be applied to any cell type or condition, allowing for broad utility in different biological contexts. + +## Comparison with Other Techniques + +1. **DNase-seq:** + - DNase-seq uses DNase I to digest accessible chromatin regions and identifies them through sequencing. It also maps open chromatin but can have different biases compared to FAIRE-seq. + +2. **ATAC-seq:** + - [ATAC-seq](scATAC-seq.md) (Assay for Transposase-Accessible Chromatin using sequencing) uses a transposase enzyme to insert sequencing adapters into open chromatin regions. It is faster and requires fewer cells compared to FAIRE-seq and DNase-seq. + +## vs. ATAC-seq + +**FAIRE-seq** and **ATAC-seq** are both techniques used to map open chromatin regions in the genome. However, they differ in their methodologies, sensitivities, and specific applications. + +### FAIRE-seq + +**Methodology:** +1. **Crosslinking:** Cells are treated with formaldehyde to crosslink proteins to DNA, preserving protein-DNA interactions. +2. **Chromatin Fragmentation:** The chromatin is sheared into small fragments using sonication. +3. **Phenol-Chloroform Extraction:** The fragmented chromatin is subjected to phenol-chloroform extraction to separate open chromatin regions into the aqueous phase. +4. **DNA Purification:** The DNA from the aqueous phase is purified and sequenced. + +**Advantages:** +1. **No Specific Antibodies Required:** Unlike ChIP-seq, FAIRE-seq does not require specific antibodies, reducing bias and simplifying the procedure. +2. **Broad Applicability:** Can be applied to various cell types and conditions without needing specific reagents for each context. + +**Disadvantages:** +1. **Lower Sensitivity:** FAIRE-seq may have lower sensitivity compared to ATAC-seq, especially for regions with moderate accessibility. +2. **Formaldehyde Crosslinking:** The requirement for formaldehyde crosslinking can introduce variability and requires careful handling. + +### ATAC-seq + +**Methodology:** +1. **Transposase Tagmentation:** A transposase enzyme inserts sequencing adapters into regions of open chromatin. +2. **Fragmentation:** The transposase simultaneously fragments the DNA and adds adapters, targeting accessible regions. +3. **Sequencing:** The tagmented DNA is purified and sequenced. + +**Advantages:** +1. **High Sensitivity:** ATAC-seq is _highly sensitive_ and can detect regions with varying degrees of chromatin accessibility, including those that are less accessible. +2. **Low Input Requirement:** Requires fewer cells and less starting material compared to FAIRE-seq. +3. **Rapid Protocol:** The entire procedure is faster, often taking less than a day from cell preparation to library preparation. + +**Disadvantages:** +1. **Transposase Bias:** The transposase enzyme may introduce some sequence bias, affecting the uniformity of chromatin accessibility detection. Meaning the transposase may prefer some open regions than others. +2. **Cost:** ATAC-seq can be more expensive due to the specialized reagents and equipment required. + +### Applications and Suitability + +**FAIRE-seq:** +- **Broad Mapping of Open Chromatin:** Suitable for general mapping of open chromatin regions, particularly in studies where antibody specificity is a concern. +- **Historical Use:** Has been a standard method for many years and is well-established in the literature. +- **Comparative Studies:** Can be used alongside other techniques to validate findings or to study chromatin accessibility under different conditions. + +**ATAC-seq:** +- **High-Resolution and Sensitivity:** Ideal for high-resolution mapping of chromatin accessibility, including fine-scale variations. +- **Single-Cell Applications:** ATAC-seq can be adapted for single-cell applications (scATAC-seq), enabling the study of chromatin accessibility at the single-cell level. +- **Quick Turnaround:** Useful in time-sensitive studies due to its rapid protocol. + +### Summary + +Both FAIRE-seq and ATAC-seq are valuable for mapping open chromatin regions, but they have distinct advantages and limitations. FAIRE-seq is a robust, broadly applicable method with no need for specific antibodies, while ATAC-seq offers higher sensitivity, requires less input material, and can be adapted for single-cell analysis. The choice between the two techniques depends on the specific requirements of the study, including the desired sensitivity, resolution, and throughput. \ No newline at end of file diff --git a/Biology/Concepts/MYC Signaling.md b/Biology/Concepts/MYC Signaling.md new file mode 100644 index 0000000..ec48a6a --- /dev/null +++ b/Biology/Concepts/MYC Signaling.md @@ -0,0 +1,61 @@ +# MYC Signaling + +**MYC** is a family of regulator genes and proto-oncogenes that encode [Transcription Factor](Transcription%20Factor.md)s. These transcription factors play a crucial role in cell cycle progression, apoptosis, and cellular transformation. The most well-known members of the MYC family are **c-MYC, N-MYC,** and **L-MYC**. MYC proteins function primarily by regulating the expression of a large number of **target genes**. + +## Functions of MYC + +1. **Cell Proliferation:** + - MYC promotes cell growth and division by upregulating genes involved in DNA replication, cell cycle progression, and ribosome biogenesis. + +2. **Metabolism:** + - MYC regulates metabolic pathways to support rapid cell growth, including glucose metabolism, amino acid metabolism, and lipid synthesis. + +3. **Apoptosis:** + - MYC can induce apoptosis under certain conditions, particularly when overexpressed or in the absence of sufficient survival signals. + +4. **Differentiation:** + - MYC plays a role in maintaining the balance between self-renewal and differentiation in stem cells and during tissue regeneration. + +5. **Oncogenesis:** + - Overexpression or dysregulation of MYC is implicated in the development of many cancers. MYC drives oncogenesis by promoting proliferation and inhibiting differentiation and apoptosis. + +## Mechanism of MYC Signaling + +1. **Transcription Factor:** + - MYC proteins bind to specific DNA sequences known as **E-boxes** (enhancer boxes) in the [Promoter](Promoter.md)s of target genes. +2. **Regulation of Gene Expression:** + - MYC recruits co-factors such as histone acetyltransferases (HATs) to modify chromatin structure and enhance the transcription of target genes. +3. **Interaction with Other Proteins:** + - MYC forms heterodimers with other proteins, like MAX, to stabilize its binding to DNA and regulate gene expression. + +## MYC Target Genes + +MYC target genes encompass a wide range of functions, reflecting its role in diverse cellular processes. Here are some key categories and examples of MYC target genes: + +1. **Cell Cycle Regulation:** + - **CCND1 (Cyclin D1):** Promotes G1 to S phase transition. + - **CDK4:** Cyclin-dependent kinase involved in cell cycle progression. + +2. **Metabolism:** + - **LDHA (Lactate Dehydrogenase A):** Involved in glycolysis. + - **GLS (Glutaminase):** Important for glutamine metabolism. + +3. **Ribosome Biogenesis and Protein Synthesis:** + - **NPM1 (Nucleophosmin):** Involved in ribosome biogenesis. + - **EIF4E (Eukaryotic Translation Initiation Factor 4E):** Facilitates translation initiation. + +4. **DNA Replication and Repair:** + - **RRM2 (Ribonucleotide Reductase Regulatory Subunit M2):** Involved in DNA synthesis. + - **MCM4 (Minichromosome Maintenance Complex Component 4):** Essential for DNA replication. + +5. **Apoptosis and Survival:** + - **BCL2 (B-cell Lymphoma 2):** Inhibits apoptosis. + - **TP53 (Tumor Protein P53):** Regulates the cell cycle and induces apoptosis. + +6. **Transcription Factors:** + - **TFAP4 (Transcription Factor AP-4):** Regulates various cellular processes. + - **E2F1:** Controls the expression of genes required for cell cycle progression. + +## Summary + +**MYC signaling** is crucial for regulating cell growth, metabolism, and proliferation. MYC acts as a transcription factor that controls the expression of a broad spectrum of target genes involved in key cellular functions. Dysregulation of MYC signaling, often through overexpression or mutations, is a common feature in many cancers, making it a critical focus of cancer research and therapeutic targeting. \ No newline at end of file diff --git a/Biology/Concepts/Promoter.md b/Biology/Concepts/Promoter.md index e2550e7..9196f40 100644 --- a/Biology/Concepts/Promoter.md +++ b/Biology/Concepts/Promoter.md @@ -2,7 +2,7 @@ ## Definition -A promoter is a region of DNA that controls the initiation of transcription of a gene. It is located upstream of the gene, on the same strand of DNA. Promoters are typically about 100-1000 base pairs long, and they contain specific sequences that bind to [transcription factors](Transcription%20Factor.md). Transcription factors are proteins that [regulate](Transcription%20Regulation.md) gene expression. When transcription factors bind to the promoter, they help to recruit RNA polymerase, the enzyme that transcribes DNA into RNA. +A promoter is a region of DNA that controls the initiation of transcription of a gene. It is located upstream of the gene, on the same strand of DNA. Promoters are typically about 100-1000 base pairs long, and they contain specific sequences that **bind** to [transcription factors](Transcription%20Factor.md). Transcription factors are proteins that [regulate](Transcription%20Regulation.md) gene expression. When transcription factors bind to the promoter, they help to recruit RNA polymerase, the enzyme that transcribes DNA into RNA. The promoter is a critical part of the gene regulatory network. It determines when and how a gene is expressed. Promoters can be regulated by a variety of factors, including the cell type, the developmental stage, and the environment. diff --git a/Biology/Concepts/Regulatory Candidate Genes.md b/Biology/Concepts/Regulatory Candidate Genes.md index b77dfe0..27a18b0 100644 --- a/Biology/Concepts/Regulatory Candidate Genes.md +++ b/Biology/Concepts/Regulatory Candidate Genes.md @@ -2,7 +2,7 @@ **Regulatory candidate genes** are genes that are proposed or hypothesized to have a role in regulating biological processes, pathways, or networks. These genes are often identified through various experimental, computational, and bioinformatic approaches and are candidates for further functional validation. Regulatory candidate genes can influence gene expression, signal transduction, metabolic pathways, and other critical cellular functions. -### Identification of Regulatory Candidate Genes +## Identification of Regulatory Candidate Genes 1. **Genomic Studies**: - **Genome-Wide Association Studies (GWAS)**: GWAS identify genetic variants associated with specific traits or diseases. Candidate genes near these variants are considered potential regulators of the observed phenotype. @@ -14,7 +14,7 @@ 3. **Epigenomic and Chromatin Studies**: - **ChIP-seq**: Chromatin Immunoprecipitation followed by sequencing identifies binding sites of transcription factors and other DNA-binding proteins. Genes near these sites are candidates for regulation by the identified proteins. - - **[ATAC-seq](scATAC-seq.md) and DNase-seq**: These methods identify [Open Chromatin Regions](Open%20Chromatin%20Regions.md), suggesting potential regulatory regions. Genes near these regions are candidate regulatory genes. + - **[ATAC-seq](scATAC-seq.md) and DNase-seq**: These methods identify [Open Chromatin Regions](Open%20Chromatin%20Regions.md), suggesting potential regulatory regions. _Genes near these regions_ are candidate regulatory genes. 4. **Functional Genomics**: - **CRISPR/Cas9 Screening**: Genome-wide CRISPR screens can identify genes whose disruption affects specific phenotypes, highlighting their regulatory roles. @@ -24,7 +24,7 @@ - **Motif Analysis**: Identifying DNA motifs (specific sequences) that are binding sites for known transcription factors can suggest candidate regulatory genes. - **Network Analysis**: Computational models and [network Analysis](GRN%20Analysis.md) can identify key nodes or hubs in regulatory networks that are candidate regulatory genes. -### Functional Validation of Regulatory Candidate Genes +## Functional Validation of Regulatory Candidate Genes Once candidate genes are identified, they undergo functional validation to confirm their regulatory roles. This process includes: @@ -43,7 +43,7 @@ Once candidate genes are identified, they undergo functional validation to confi 5. **Functional Assays**: - Performing assays to measure cellular functions, such as cell proliferation, differentiation, apoptosis, or metabolic activity, in the presence or absence of the candidate gene. -### Examples and Applications +## Examples and Applications - **Disease Research**: Identifying regulatory candidate genes involved in diseases (e.g., cancer, diabetes, cardiovascular diseases) can provide insights into disease mechanisms and potential therapeutic targets. - **Developmental Biology**: Studying candidate genes that regulate developmental processes can reveal key regulators of cell differentiation, tissue formation, and organogenesis. diff --git a/Biology/Concepts/mRNA.md b/Biology/Concepts/mRNA.md index 12e7cdb..cc7229c 100644 --- a/Biology/Concepts/mRNA.md +++ b/Biology/Concepts/mRNA.md @@ -6,23 +6,33 @@ The sense strand is the top strand. It has the same sequence as the mRNA except T is exchanged for U in mRNA. The antisense strand is the complement to the sense strand and the template for transcription. -Regulators that bind to the antisense strand can stop transcription from happening, acting as a gene regulatory mechanism. - +Regulators that bind to the antisense strand can stop transcription from happening, acting as a gene [regulatory mechanism](Transcription%20Regulation.md). + +### mRNA abundance + +mRNA abundance is determined by 2 factors: **rate of transcription** and **rate of degradation**. +- The rate of transcription determines the rate at which transcripts are produced. +- The rate of degradation determines the rate at which transcripts are broken down. +- These rates are regulated by various factors. + - High expression levels are typically a result of strong promoter activity, efficient transcriptional machinery, and other regulatory factors that enhance transcription. + - Regulatory mechanisms, including microRNAs and RNA-binding proteins, can enhance or reduce mRNA stability and translation efficiency. + - Processes such as RNA splicing, export from the nucleus, and degradation all influence the steady-state levels of mRNA. +- **Gene expression** in [scRNA-seq](scRNA-seq.md) is determined by the number of mRNA transcripts of a gene at the time of sampling. Genes with a high transcription rate, high mRNA stability, or both, will have a high number of mRNAs. ## Introns and exons The introns do not code for proteins, only the exons do. The introns must be removed in mRNA processing. -Alternative splicing is a mechanism that combines different exons in the same gene to produce different proteins. +**Alternative splicing** is a mechanism that combines different exons in the same gene to produce different proteins. The nucleotide sequences GU and AG are splicing signals, but having those signals does not indicate mandatory splicing. Splicing only takes place at the intersection between introns and exons. ## Small nuclear RNA -Small nuclear RNA (snRNA) is essential for mRNA splicing. +Small nuclear RNA (snRNA) is essential for _mRNA splicing_. ## Small interfering RNA -Small interfering RNA (siRNA) mediates degradation of distinct mRNAs and closing of gene loci leading to decreased gene expression. +Small interfering RNA (siRNA) mediates _degradation_ of distinct mRNAs and closing of gene loci leading to decreased gene expression. ## [Micro RNA](microRNA.md) diff --git a/Biology/Concepts/scATAC-seq.md b/Biology/Concepts/scATAC-seq.md index 58cde1c..873fbde 100644 --- a/Biology/Concepts/scATAC-seq.md +++ b/Biology/Concepts/scATAC-seq.md @@ -1,3 +1,38 @@ -## Co-accessible Peaks +# scATAC-seq -Co-accessible peaks mean that when one peak is open, the other is also open. Co-accessible peaks are usually consistent within a cell type, but does not imply regulatory relationships. \ No newline at end of file +Single cell ATAC sequencing + +**Single-cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing)** is a technique used to identify open chromatin regions at the single-cell level. This method allows researchers to study the chromatin accessibility landscape in individual cells, providing insights into cellular heterogeneity and gene regulation. + +## How Single Cell ATAC-seq Works + +1. **Cell Preparation**: Individual cells are isolated using methods such as microfluidics or droplet-based systems. Each cell is processed separately to maintain single-cell resolution. + +2. **Transposition Reaction**: The isolated cells are treated with **Tn5 transposase**, an enzyme that simultaneously cuts and tags accessible DNA regions with sequencing adapters. The Tn5 transposase preferentially inserts adapters into open chromatin regions where the DNA is not tightly bound by nucleosomes. + +3. **Library Preparation**: The tagged DNA fragments are then amplified by PCR to create a sequencing library. This library contains fragments of DNA from the open chromatin regions of each individual cell, flanked by sequencing adapters. + +4. **Sequencing**: The sequencing library is subjected to high-throughput sequencing. The resulting reads represent the locations of open chromatin regions within each single cell. + +5. **Data Analysis**: + + - **Alignment**: Sequencing reads are aligned to a reference genome. + - **Peak Calling**: Regions with a high density of reads are identified as **peaks**, representing open chromatin regions. + - **Cell-Specific Analysis**: Data from individual cells are analyzed to determine cell-specific chromatin accessibility patterns. This can involve clustering cells based on their **accessibility profiles** to identify different cell types or states. + - Identify [Co-Accessible Peaks](Co-Accessible%20Peaks.md) + - **Integration with Other Data**: The ATAC-seq data can be integrated with other single-cell data, such as [scRNA-seq](scRNA-seq.md), to correlate chromatin accessibility with gene expression. + +## Applications of Single Cell ATAC-seq + +- **Characterizing Cellular Heterogeneity**: Identifying different cell types and states based on their unique chromatin accessibility profiles. +- **Developmental Biology**: Studying the changes in chromatin accessibility during development and differentiation. +- **Cancer Research**: Investigating the chromatin landscape in cancer cells to understand mechanisms of oncogenesis and identify potential therapeutic targets. +- **Gene Regulation**: Mapping regulatory elements like enhancers and promoters, and understanding how they contribute to gene expression regulation in different cell types and conditions. + +By providing high-resolution insights into the chromatin landscape at the single-cell level, single-cell ATAC-seq is a powerful tool for understanding the complexities of gene regulation and cellular diversity. + +## Differentially Accessible Regions + +A **genomic region** where the accessibility of chromatin differs between two or more conditions, cell types, or treatments. Chromatin accessibility refers to how [open](Open%20Chromatin%20Regions.md) or closed a region of the genome is, which in turn affects the binding of transcription factors and the transcriptional machinery to DNA. + +DARs can be regions that are differentially open regions or differentially closed regions. \ No newline at end of file diff --git a/Biology/Concepts/scRNA-seq.md b/Biology/Concepts/scRNA-seq.md index 9d3eb9a..f6676e8 100644 --- a/Biology/Concepts/scRNA-seq.md +++ b/Biology/Concepts/scRNA-seq.md @@ -1 +1,299 @@ -# scRNA-seq \ No newline at end of file +# scRNA-seq + +## Steps + +### 1. Sample Preparation + +#### A. Cell Isolation +- **Tissue Dissociation**: If starting from a tissue sample, the tissue must be dissociated into a single-cell suspension. This can be done using enzymatic digestion (e.g., collagenase, trypsin) and mechanical dissociation (e.g., pipetting, using a tissue dissociator). +- **Cell Sorting**: Cells are often sorted to ensure a pure population of viable cells. Fluorescence-activated cell sorting (FACS) or magnetic-activated cell sorting (MACS) can be used. + +#### B. Viability Assessment +- **Cell Viability Staining**: Use dyes such as Trypan Blue or propidium iodide to assess and ensure that only live cells are processed further. + +### 2. Cell Capture and Barcoding + +#### A. Droplet-based Methods (e.g., 10x Genomics Chromium) +- **Encapsulation**: Individual cells are encapsulated into droplets along with barcoding beads. Each bead is coated with oligonucleotides containing a **unique cell barcode**, a **unique molecular identifier (UMI)**, and a **poly(T) sequence** to capture polyadenylated mRNA. +- **Reverse Transcription**: Inside each droplet, reverse transcription occurs, where the poly(T) sequence binds to the poly(A) tail of mRNA, and the barcode and UMI are incorporated into the **cDNA**. + +#### B. Well-based Methods (e.g., Smart-seq2) +- **Cell Sorting into Wells**: Single cells are sorted into individual wells of a microplate. +- **Lysis and Reverse Transcription**: Cells are lysed, and mRNA is reverse transcribed into cDNA using oligo(dT) primers with a unique barcode and UMI sequence. + +#### C. Combinatorial Indexing (e.g., SPLiT-seq) +- **Split-Pool Barcoding**: Cells are split into multiple wells, each well receiving a unique barcode. Cells are then pooled and split into wells again, receiving a second barcode. This process is repeated several times to generate unique combinations of barcodes for each cell. + +### 3. cDNA Amplification + +- **PCR Amplification**: The barcoded cDNA is amplified by _PCR_. This step increases the quantity of cDNA for downstream applications. + +### 4. Library Preparation + +- **Fragmentation and Size Selection**: Amplified cDNA is _fragmented_ to appropriate sizes (typically 200-500 bp) and size-selected to ensure uniformity. +- **Adapter Ligation**: _Sequencing adapters_ are ligated to the ends of the cDNA fragments. These adapters are necessary for attaching the cDNA to the sequencing flow cell. +- **Library Amplification**: Another round of PCR may be performed to amplify the library, incorporating _sample-specific barcodes_ if multiplexing. + +### 5. Sequencing + +- **High-throughput Sequencing**: The prepared **cDNA library** is sequenced using high-throughput sequencing platforms such as Illumina's NovaSeq or HiSeq. During sequencing, both the **cell barcodes, UMIs, and RNA sequences** are read. + +### 6. Data Processing and Analysis + +#### A. Base Calling and Demultiplexing +- **Base Calling**: Converts raw sequencing data into nucleotide sequences. +- **Demultiplexing**: Assigns each read to its originating cell based on the cell-specific barcodes. + +#### B. Read Alignment +- **Quality Control**: Filters out low-quality reads and adapters. +- **Alignment**: Maps the reads to a reference genome or transcriptome to determine the origin of each RNA fragment. + +#### C. UMI Deduplication +- **UMI Counting**: Counts the number of unique UMIs associated with each gene to estimate the number of unique RNA molecules, removing PCR duplicates. + +> [!note] Deduplication +> During PCR, each original RNA/cDNA molecule is amplified to many copies, so does the UMI associated with each of them. Deduplication makes sure each original molecule is only counted once. A highly expressed genes will be in many [mRNA](mRNA.md) transcripts, or high UMI count. + +#### D. Normalization and Batch Correction +- **Normalization**: Adjusts UMI counts to account for differences in sequencing depth and other technical variations. +- **Batch Effect Correction**: Removes technical variability introduced by different experimental batches. + +#### E. Downstream Analysis +- **Gene Expression Analysis**: Identifies and quantifies gene expression levels for each cell. +- **Clustering and Dimensionality Reduction**: Groups cells based on similar gene expression profiles and reduces data dimensions using methods like PCA, t-SNE, or UMAP. +- **Differential Expression**: Identifies genes that are differentially expressed between cell clusters or conditions. +- **Cell Type Identification**: Assigns cell types based on known marker genes or through comparison with reference datasets. + +### Summary of scRNA-seq Workflow + +1. **Sample Preparation**: Isolate single cells from the sample. +2. **Cell Capture and Barcoding**: Encapsulate cells and add unique barcodes and UMIs. +3. **cDNA Amplification**: Amplify barcoded cDNA. +4. **Library Preparation**: Fragment, size-select, and ligate adapters to cDNA. +5. **Sequencing**: Sequence the cDNA library. +6. **Data Processing and Analysis**: Process raw data, align reads, deduplicate UMIs, normalize, correct batch effects, and perform downstream analyses. + +By following these steps, researchers can obtain high-resolution gene expression profiles at the single-cell level, enabling the study of cellular heterogeneity and gene expression dynamics in complex biological systems. + +## Read depth + +Read depth, also known as sequencing depth or coverage, refers to the number of times a specific nucleotide in the genome or transcriptome is sequenced during an RNA sequencing (RNA-seq) experiment. It is a crucial metric in sequencing studies and has significant implications for the accuracy and reliability of the data. + +### Read Depth in RNA-Seq + +1. **Definition**: + + - Read depth is the average number of sequencing reads that align to a particular region of the genome or transcriptome. It is typically expressed as "X-fold coverage" (e.g., 10x, 50x), where "X" indicates the average number of reads covering each nucleotide. +2. **Importance**: + + - **Accuracy**: Higher read depth increases the accuracy of detecting RNA molecules, reducing the likelihood of missing low-abundance transcripts. + - **Quantification**: Adequate read depth ensures more reliable quantification of gene expression levels, particularly for genes expressed at low levels. + - **Variant Detection**: In genomic studies, higher read depth improves the ability to detect single nucleotide polymorphisms (SNPs) and other genetic variants. + - **Error Reduction**: Higher coverage helps to average out sequencing errors and biases, leading to more robust data. +3. **Determining Optimal Read Depth**: + + - The optimal read depth depends on the goals of the study, the complexity of the transcriptome, and the expected expression levels of genes. + - For standard RNA-seq studies, typical read depths range from 10-30 million reads per sample. For more comprehensive studies, such as those aiming to detect low-abundance transcripts or rare variants, deeper sequencing might be required. +4. **Balancing Cost and Depth**: + + - Higher read depth generally leads to more accurate data but also increases sequencing costs. Researchers need to balance the desired depth with available budget and the specific requirements of their study. + +### Practical Example + +In a single-cell RNA sequencing (scRNA-seq) experiment, achieving sufficient read depth is particularly important due to the low amount of starting material and the inherent variability between individual cells. In such cases, a high read depth ensures that even low-abundance transcripts are detected and quantified accurately, providing a more comprehensive view of the cellular transcriptome. + +## UMIs + +Unique Molecular Identifiers (UMIs) are designed to be unique for each RNA molecule through the incorporation of random nucleotide sequences during the preparation of RNA-seq libraries. Here’s how UMIs are made to ensure their uniqueness: + +1. **Random Sequence Generation**: + - UMIs are typically short sequences, often 6-12 nucleotides long, that are composed of random nucleotides (A, T, C, G). + - The randomness of the sequence ensures a high diversity of UMIs. For example, a 10-nucleotide UMI can theoretically generate $4^{10} = 1,048,576$ unique combinations. + +2. **Incorporation into Primers**: + - During the RNA library preparation, the UMI sequences are incorporated into the primers used for reverse transcription or during the initial amplification step. + - These primers attach to each RNA molecule and include a random UMI sequence at a specific location. + +3. **Attachment to RNA Molecules**: + - Each RNA molecule is tagged with a unique UMI as it is reverse-transcribed into cDNA. This tagging is done before any amplification steps, ensuring that each original RNA molecule has a distinct UMI. + +4. **Sequencing and Data Analysis**: + - During sequencing, both the cDNA and its associated UMI are sequenced together. + - In data analysis, reads are grouped by their UMI sequences, allowing for the identification and counting of unique RNA molecules. + +### Ensuring Uniqueness and Reducing Collisions + +1. **Sufficient UMI Length**: + - The length of the UMI is chosen to provide a high number of unique combinations, significantly reducing the probability of two different RNA molecules receiving the same UMI (a collision). + +2. **Randomness Verification**: + - Quality control steps can verify the randomness and distribution of UMIs. Non-random distribution might indicate issues with the synthesis of UMIs or biases in the library preparation process. + +3. **Error Correction**: + - During data analysis, error correction algorithms can help to distinguish true UMIs from those altered by sequencing errors, further ensuring the accuracy of UMI-based quantification. + +In summary, UMIs are made unique for each RNA molecule by incorporating random nucleotide sequences into primers used during library preparation. The high diversity of these sequences ensures that each RNA molecule is tagged with a distinct identifier, allowing for accurate counting and reducing biases introduced by PCR amplification. +## Raw read counts + +Raw read counts in the context of RNA sequencing (RNA-seq) refer to the initial number of reads (short sequences of nucleotides) that are generated by sequencing technology and mapped back to a reference genome or transcriptome. Here's an overview: + +1. **RNA Sequencing Process**: + - RNA molecules are extracted from cells or tissues. + - These RNA molecules are converted into complementary DNA (cDNA) through reverse transcription. + - The cDNA is then sequenced, generating millions of short reads. + +2. **Reads**: + - Each read represents a fragment of the original RNA molecule. + - These reads are typically 50-150 base pairs in length, depending on the sequencing technology used. + +3. **Mapping Reads**: + - The generated reads are aligned or mapped to a reference genome or transcriptome. + - The process of mapping assigns each read to a specific gene or genomic location. + +4. **Counting Reads**: + - Raw read counts are the number of reads that map to each gene or transcript. + - These counts provide a measure of the abundance of each RNA molecule in the sample. + +5. **Purpose**: + - Raw read counts serve as the starting point for downstream analysis in RNA-seq studies. + - They are used to quantify gene expression levels, identify differentially expressed genes, and perform other types of genomic analyses. + +However, raw read counts can be influenced by various _technical biases_, such as PCR amplification artifacts and differences in sequencing depth. To account for these factors, normalization techniques and the use of UMIs (Unique Molecular Identifiers) are often employed to obtain more accurate estimates of gene expression levels. + +In summary, raw read counts are the initial counts of sequencing reads that map to genes or transcripts, providing a fundamental measure of RNA abundance before further processing and normalization. + +### PCR Artifacts + +PCR (Polymerase Chain Reaction) amplification artifacts can significantly affect raw read counts in RNA sequencing (RNA-seq) by introducing biases and distortions in the data. Here are the main ways PCR artifacts impact raw read counts: + +1. **Over-Representation of Certain Sequences**: + - During PCR amplification, some RNA fragments can be preferentially amplified over others due to sequence-specific biases or the efficiency of the primers used. This leads to an over-representation of these sequences in the final read counts. + +2. **PCR Duplicates**: + - Each original RNA molecule should ideally be represented by a unique read in the sequencing data. However, PCR can generate multiple copies (duplicates) of the same RNA molecule. These duplicates are indistinguishable from genuinely unique reads in raw read count data, inflating the apparent abundance of certain RNA molecules. + +3. **Random Amplification Errors**: + - PCR amplification can introduce random errors, such as nucleotide misincorporations, which can create reads that do not accurately reflect the original RNA sequence. These errors can complicate the mapping process and lead to incorrect read assignments. + +4. **Amplification Efficiency Variability**: + - The efficiency of PCR amplification can vary across different RNA molecules, leading to variable amplification rates. Some sequences might amplify more efficiently than others, causing discrepancies in the final read counts that do not accurately reflect the original RNA molecule abundance. + +### Mitigating PCR Artifacts + +1. **Unique Molecular Identifiers (UMIs)**: + - As previously mentioned, UMIs can be used to tag each original RNA molecule before PCR amplification. By counting the number of unique UMIs instead of raw reads, researchers can more accurately estimate the true abundance of each RNA molecule and reduce the impact of PCR duplicates. + +2. **Normalization Techniques**: + - Applying normalization methods to adjust for sequencing depth and other technical variations helps to reduce the impact of PCR amplification biases on the final data. + +3. **Duplicate Removal**: + - Bioinformatics tools can identify and remove PCR duplicates by looking for reads that start and end at the same position in the genome, which are likely to be PCR artifacts. + +In summary, PCR amplification artifacts can lead to over-representation, duplication, and errors in raw read counts, thus skewing the data. Employing techniques like UMIs, normalization, and duplicate removal can help mitigate these issues and produce more accurate measures of RNA abundance. + +## Read depth calculation + +Read depth, also known as sequencing depth or coverage, is calculated from sequencing results by determining the average number of times each nucleotide in the genome or transcriptome is sequenced. Here's how read depth is typically calculated: + +1. **Obtain Sequencing Data**: + - After sequencing, you have a collection of sequencing reads in a **FASTQ** file. These reads need to be processed and aligned to a reference genome or transcriptome. + +2. **Quality Control and Trimming**: + - Perform quality control (QC) on the raw reads using tools like FastQC to check for issues such as low-quality bases and adapter contamination. + - Trim low-quality bases and adapter sequences using tools like Trimmomatic or Cutadapt. + +3. **Align Reads to Reference**: + - Use an alignment tool (e.g., STAR, HISAT2 for RNA-seq; BWA, Bowtie2 for DNA-seq) to map sequencing reads to the reference genome or transcriptome. + - The output of this alignment process is typically a **SAM or BAM** file, which contains information about where each read aligns on the reference. + +4. **Count Aligned Reads**: + - Count the number of reads that align to each position in the reference genome or transcriptome. This can be done using tools like `samtools depth`, `bedtools coverage`, or specialized RNA-seq quantification tools like HTSeq-count or featureCounts. + + Example command using `samtools depth`: + ```bash + samtools depth aligned_reads.bam > depth.txt + ``` + The `depth.txt` file will contain the depth of coverage for each position in the genome. + +5. **Calculate Average Read Depth**: + - To calculate the average read depth, sum the depths of all positions and divide by the total number of positions covered. + + Example calculation: + ```bash + total_depth = sum(depths at all positions) + number_of_positions = total number of positions covered + average_read_depth = total_depth / number_of_positions + ``` + +### Detailed Example + +1. **Align Reads**: + - Suppose you have 50 million reads from an RNA-seq experiment, and you align them to a reference genome using an aligner like HISAT2. + - The alignment process results in a BAM file (`aligned_reads.bam`). + +2. **Count Aligned Reads**: + - Use `samtools depth` to count the reads at each position. + ```bash + samtools depth aligned_reads.bam > depth.txt + ``` + +3. **Extract Depth Information**: + - The `depth.txt` file will look something like this: + ``` + chr1 1 30 + chr1 2 32 + chr1 3 35 + ... + ``` + Each line represents a genomic position and the number of reads covering that position. + +4. **Calculate Average Read Depth**: + - Sum the coverage values and divide by the number of positions covered. + + ``` python + depths = [] + with open('depth.txt') as file: + for line in file: + parts = line.split() + depth = int(parts[2]) + depths.append(depth) + + total_depth = sum(depths) + number_of_positions = len(depths) + average_read_depth = total_depth / number_of_positions + + print(f"Average Read Depth: {average_read_depth}") + ``` + +### Read Depth in RNA-Seq + +In RNA-seq, read depth is often summarized at the level of genes or exons rather than individual nucleotides. Here's how you calculate it: + +1. **Feature Quantification**: + - Tools like HTSeq-count or featureCounts assign reads to genomic features (e.g., genes, exons) and provide counts for each feature. + +2. **Summarize Read Depth**: + - Calculate read depth per feature by dividing the total number of reads mapped to that feature by the length of the feature (e.g., length of the gene or exon). + +### Example of RNA-Seq Feature Quantification + +1. **Using HTSeq-count**: + - Align reads to the reference genome. + - Use HTSeq-count to quantify reads per gene. + + ```bash + htseq-count -f bam -s no aligned_reads.bam genes.gtf > gene_counts.txt + ``` + +2. **Calculate Depth per Gene**: + - The output `gene_counts.txt` contains read counts for each gene. + - Divide the read count by the gene length to get the depth for each gene. + +### Summary + +- **Align Reads**: Map sequencing reads to the reference genome. +- **Count Aligned Reads**: Count the number of reads at each position using tools like `samtools depth`. +- **Calculate Average Read Depth**: Sum the depths at all positions and divide by the number of positions covered. +- **Feature Quantification (RNA-Seq)**: Use tools like HTSeq-count to quantify reads per gene or feature and calculate depth per feature. + +By following these steps, you can accurately calculate the read depth from sequencing results, providing insights into the coverage and quality of your sequencing data. diff --git a/Biology/Introduction to Bioinformatics VU/0. Introduction to Bioinformatics VU.md b/Biology/Introduction to Bioinformatics VU/0. Introduction to Bioinformatics VU.md new file mode 100644 index 0000000..d4e5613 --- /dev/null +++ b/Biology/Introduction to Bioinformatics VU/0. Introduction to Bioinformatics VU.md @@ -0,0 +1 @@ +# Introduction to Bioinformatics VU diff --git a/Biology/Introduction to Bioinformatics VU/1. The introduction of introduction.md b/Biology/Introduction to Bioinformatics VU/1. The introduction of introduction.md new file mode 100644 index 0000000..2f8f6df --- /dev/null +++ b/Biology/Introduction to Bioinformatics VU/1. The introduction of introduction.md @@ -0,0 +1,25 @@ + +[0. Introduction to Bioinformatics VU](0.%20Introduction%20to%20Bioinformatics%20VU.md) + +# Introduction + +## Sequencing data analyses flow + +1. Images: read color of the dots to identify A, T, G, or C +2. Reads +3. Aligned reads +4. Called variants, SNPs, SNVs + +## What can we measure + +- levels of molecules/variants in cells +- also called omics data +- molecular profiling +- huge number of measurements per sample + +## What you should know + +- ML +- Statistics +- Sequence analysis methods +- Domain knowledge in biomedicine and molecular biology \ No newline at end of file diff --git a/Biology/Multiomics Competition/2. GRN Programming Environment.md b/Biology/Multiomics Competition/2. GRN Programming Environment.md new file mode 100644 index 0000000..a3549a7 --- /dev/null +++ b/Biology/Multiomics Competition/2. GRN Programming Environment.md @@ -0,0 +1,248 @@ +# Packages installed + +## CellOracle + +[CellOracle](Dissecting%20cell%20identity%20via%20network%20inference%20and%20in%20silico%20gene%20perturbation.md) +CellOracle via [conda](conda.md) + +``` shell +conda create -n grn_env python=3.10 pip pysocks numpy Cython +conda activate grn_env +``` + +``` sh +pip install celloracle +``` + +### Check installation + +``` python +import celloracle as co +co.check_python_requirements() +``` + +## Gene2role + +[Gene2role](Gene2role%20a%20role-based%20gene%20embedding%20method%20for%20comparative%20analysis%20of%20signed%20gene%20regulatory%20networks.md) +Install from [Github](git.md#Github) + +## BioIB + +[BioIB](Identifying%20maximally%20informative%20signal-aware%20representations%20of%20single-cell%20data%20using%20the%20Information%20Bottleneck.md) + +``` sh +pip install bioib +``` + + +## run shell commands in py + +https://stackoverflow.com/questions/89228/how-do-i-execute-a-program-or-call-a-system-command + +## Troubleshooting + +### Can't install velocyto on MacOS + +Error: `clang: error: unsupported option '-fopenmp'` + +On apple's `llvm`, `-fopenmp` is not supported. One should use [homebrew](homebrew.md)'s `llvm`. + +The following commands will allow linking `openmp`: + +``` bash +brew install llvm libomp +export CC=/opt/homebrew/opt/llvm/bin/clang +``` + +Installed lots of packages... +Follow the commands after installation to set the correct environment variables + +[Source](https://stackoverflow.com/questions/60005176/how-to-deal-with-clang-error-unsupported-option-fopenmp-on-travis) + +### Packages fail to compile on stomics + +- Install old version before G++ 11.2 (2021) +- igraph yes +- interp yes +- units yes +- s2 +- dqrng + +Install old interp + +``` sh +Rscript -e "url='https://cran.r-project.org/src/contrib/Archive/interp/interp_1.0-33.tar.gz';install.packages(url, repos=NULL, type='source')" +``` + +Install old igraph + +``` sh +Rscript -e "url='https://cran.r-project.org/src/contrib/Archive/igraph/igraph_1.2.6.tar.gz';install.packages(url, repos=NULL, type='source')" +``` + +Install old units + +``` sh +Rscript -e "devtools:install_version('units','0.6-7')" +``` + +Install old s2 + +``` sh +Rscript -e "devtools::install_version('s2','1.0.1')" +``` + +Install old Rcpp + +``` sh +devtools::install_version('Rcpp','1.0.5') +``` + +**Scrap all this use conda version** +### terra missing dependency + +- For monocle3 +- No libgdal + - Dependency for terra + - Installed via conda + +**Install from conda** +### XML + +- for cicero + +``` sh +Rscript -e "withCallingHandlers(options('BioC_mirror'='http://mirrors.ustc.edu.cn/bioc/','repos' = c(CRAN='https://mirrors.tuna.tsinghua.edu.cn/CRAN/')), install.packages('XML'), warning = function(w) stop(w))" +``` + +- no libxml + - It's actually installed, find via `xmllint --version` + - Try `export XML_CONFIG=/usr/local/bin/xml2-config` + - Try `export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"` + - [Source](https://stackoverflow.com/questions/47540782/cannot-install-xml-package-in-r-rstudio) +- install libxml via conda +``` sh +conda install anaconda::libxml2 +``` + +- another option is this package +``` r +devtools::install_github("r-lib/xml2") +``` +- or this +``` sh +conda install conda-forge::r-xml +``` + +``` error +configure: error: "libxml not found" +ERROR: configuration failed for package ‘XML’ +* removing ‘/opt/conda/lib/R/library/XML’ + +The downloaded source packages are in + ‘/tmp/RtmpC9WDdS/downloaded_packages’ +Updating HTML index of packages in '.Library' +Making 'packages.html' ... done +Warning message: +In install.packages("XML") : + installation of package ‘XML’ had non-zero exit status +``` + +**Problem: r installed via conda cannot find libxml, even if it is already installed. Since I do not have sudo, and conda takes forever to solve environment, nothing I can do about it.** +### Slow downloads + +Set the correct mirror permanently + +``` sh +rm ~/.Rprofile +echo "options(repos = c(CRAN = 'https://mirrors.bfsu.edu.cn/CRAN/'))" >> ~/.Rprofile +echo "options(BioC_mirror='https://mirrors.westlake.edu.cn/bioconductor')" >> ~/.Rprofile +cat ~/.Rprofile +Rscript -e "options('repos')" +Rscript -e "source('~/.Rprofile')" +Rscript -e "BiocManager::install(c("Gviz", "GenomicRanges", "rtracklayer"))" +``` + +You can source the .Rprofile within R +``` r +source("path/to/.Rprofile") +``` + +``` sh +Rscript -e 'withCallingHandlers(options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/",repos = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")), BiocManager::install(c("BSgenome") ), warning = function(w) stop(w))' +``` +## stomics image + +``` +## Mirror image on stomics + +System version: +``` sh +NAME="CentOS Linux" +VERSION="7 (Core)" +ID="centos" +ID_LIKE="rhel fedora" +VERSION_ID="7" +PRETTY_NAME="CentOS Linux 7 (Core)" +ANSI_COLOR="0;31" +CPE_NAME="cpe:/o:centos:centos:7" +HOME_URL="https://www.centos.org/" +BUG_REPORT_URL="https://bugs.centos.org/" + +CENTOS_MANTISBT_PROJECT="CentOS-7" +CENTOS_MANTISBT_PROJECT_VERSION="7" +REDHAT_SUPPORT_PRODUCT="centos" +REDHAT_SUPPORT_PRODUCT_VERSION="7" +``` + +### monocle3-cicero + +``` sh +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/ +conda config --show channels +conda create -y -n monocle3 +conda clean --all -y +conda init bash +echo 'source ~/.bashrc' >> /etc/profile +echo 'conda activate monocle3' >> ~/.bashrc +echo "Rscript -e 'IRkernel::installspec()'" >> ~/.bashrc +conda install -y -n monocle3 bioconda::r-monocle3 bioconda::bioconductor-gviz bioconda::bioconductor-rtracklayer conda-forge::r-terra conda-forge::r-irkernel bioconda::bioconductor-cicero +``` + +Install nb_conda_kernels to base environment to use the r kernel in monocle3. For details read [github](https://github.com/anaconda/nb_conda_kernels#installation) +``` sh +conda install -y -n base conda-forge::nb_conda_kernels conda-forge::nbconvert +``` + +Set up nbconvert to discover all kernels +``` sh +echo "{" >> /home/stereonote/.jupyter/jupyter_config.json +echo " \"CondaKernelSpecManager\": {" >> /home/stereonote/.jupyter/jupyter_config.json +echo " \"kernelspec_path\": \"--user\"" >> /home/stereonote/.jupyter/jupyter_config.json +echo " }" >> /home/stereonote/.jupyter/jupyter_config.json +echo "}" >> /home/stereonote/.jupyter/jupyter_config.json +``` + +``` r +# Bioconductor设置为中科大的镜像 +options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/") +# CRAN设置成清华大学的镜像 +options(repos = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")) +``` + +### CellOracle + + + +### bioib + + + +### gene2role + + +jupyter --config-dir +/home/stereonote/.jupyter \ No newline at end of file diff --git a/Machine Learning/Biopharma Tech Strategies in an Era Defined by Data and AI.md b/Biology/Seminars/Biopharma Tech Strategies in an Era Defined by Data and AI.md similarity index 98% rename from Machine Learning/Biopharma Tech Strategies in an Era Defined by Data and AI.md rename to Biology/Seminars/Biopharma Tech Strategies in an Era Defined by Data and AI.md index 9c74e6d..3851707 100644 --- a/Machine Learning/Biopharma Tech Strategies in an Era Defined by Data and AI.md +++ b/Biology/Seminars/Biopharma Tech Strategies in an Era Defined by Data and AI.md @@ -1,6 +1,7 @@ # Biopharma Tech Strategies in an Era Defined by Data and AI > GEN Webinar +> #seminar ## Eli Lilly diff --git a/Biology/Seminars/Dissecting cell identity via network inference and in silico gene perturbation.md b/Biology/Seminars/Dissecting cell identity via network inference and in silico gene perturbation.md index 1b5ef3b..f9d4180 100644 --- a/Biology/Seminars/Dissecting cell identity via network inference and in silico gene perturbation.md +++ b/Biology/Seminars/Dissecting cell identity via network inference and in silico gene perturbation.md @@ -3,15 +3,16 @@ #paper [Github](https://github.com/morris-lab/CellOracle) Nature [Paper](https://www.nature.com/articles/s41586-022-05688-9#Sec9), 2023 and widely used already +[Tutorial](https://morris-lab.github.io/CellOracle.documentation/) +[Notebooks used for the paper, contain code](https://github.com/morris-lab/CellOracle/tree/master/docs/notebooks) Washington University School of Medicine in St Louis, St Louis -[Gene Regulatory Network](Gene%20Regulatory%20Network.md) inference with SCENIC+ GRN perturbation: change [Transcription Factor](Transcription%20Factor.md) level and **predict** cell identity shift ## Abstract -Developed CellOracle for GRN inference and perturbation analysis. +Developed CellOracle for [Gene Regulatory Network](Gene%20Regulatory%20Network.md) inference and perturbation analysis. Evaluated on mouse and human hematopoiesis, and zebrafish embryogenesis data. The algorithm correctly predicts the outcome of TF knockout and identified additional regulators. @@ -23,7 +24,7 @@ CellOracle consists of several steps: 1. **Base GRN** construction using [scATAC-seq](scATAC-seq.md) data or [Promoter](Promoter.md) databases. 2. [scRNA-seq](scRNA-seq.md) data preprocessing. 3. Context-dependent **GRN inference** using scRNA-seq data. -4. Network analysis. [GRN Analysis](GRN%20Analysis.md)? +4. Network analysis. Basic [GRN topology Analysis](GRN%20Analysis.md) 5. Simulation of **cell identity** following [Transcription Factor](Transcription%20Factor.md) perturbation. 6. Calculation of the _pseudotime gradient vector field_ and the _inner-product score_ to generate **perturbation scores**. @@ -31,9 +32,278 @@ CellOracle consists of several steps: The base GRN contains _unweighted_, _directional edges_ between a TF and its target gene. -CellOracle uses the _regulatory region_'s genomic DNA sequence and TF-binding motifs for this task. CellOracle identifies [Regulatory Candidate Genes](Regulatory%20Candidate%20Genes.md) by scanning for TF-binding motifs within the regulatory DNA sequences ([Promoter](Promoter.md) and [Enhancer](Enhancer.md)) of [Open Chromatin Regions](Open%20Chromatin%20Regions.md). +There are two steps, explained below. + +CellOracle uses the _regulatory region_'s genomic DNA sequence and TF-binding motifs to construct the base GRN. CellOracle identifies [Regulatory Candidate Genes](Regulatory%20Candidate%20Genes.md) by scanning for TF-binding motifs within the regulatory DNA sequences ([Promoter](Promoter.md) and [Enhancer](Enhancer.md)) of [Open Chromatin Regions](Open%20Chromatin%20Regions.md). i.e. look for TF-binding motifs within regulatory DNA sequences within open chromatin regions. The TF-binding motifs are associated with [Regulatory Candidate Genes](Regulatory%20Candidate%20Genes.md). This steps narrows down the scope of [Regulatory Candidate Genes](Regulatory%20Candidate%20Genes.md) and helps define the directionality of regulatory edges in the GRN. +However, not all TF binding motifs are actively regulating gene expression. [Transcription Regulation](Transcription%20Regulation.md) contain many context-dependent factors. For example, both [Insulator](Insulator.md)s and [Silencer](Silencer.md)s can prevent [Enhancer](Enhancer.md)s and [Promoter](Promoter.md)s from activating transcription. In the next step, [scRNA-seq](scRNA-seq.md) data are used to refine the network. + +#### Identification of promoter and enhancer regions + +Species and tissue-specific scATAC-seq data, which identify [Open Chromatin Regions](Open%20Chromatin%20Regions.md), are used to find promoter and enhancer regions. + +In the absence of sample/tissue-specific data, using broader scATAC-seq data yields a more general network. This more general network can be fitted to a specific sample using scRNA-seq data during the model fitting process. The final product will consist of context-dependent (cell-type or state-specific) GRN configurations. + +**Proximal regulatory DNA elements** are located by locating **TSSs** within the accessible ATAC-seq peaks. This annotation is performed using [HOMER]([http://homer.ucsd.edu/homer/](http://homer.ucsd.edu/homer/)). + +TSS: transcription starting sites, within [Promoter](Promoter.md)s. + +**Distal regulatory DNA elements** are obtained using Cicero, a computational tool that identifies [Cis-Regulatory Elements](Transcription%20Regulation.md#Cis-Regulatory%20Elements) on the basis of [Co-Accessibility](Co-Accessible%20Peaks.md), as derived from ATAC-seq peak information. When two ATAC peaks are co-accessible, there is likely some regulatory relationship between them. + +Using Cicero, pairs of peaks within 500 kb of each other are identified and calculated a co-accessibility score. Pairs of peaks overlapping a TSS and with a **high co-accessibility score** (>0.8) are identified as distal regulatory elements. + +The output is a *bed file* in which all _cis_-regulatory peaks are paired with the target gene name. This bed file is used in the next step. + +CellOracle can also use other input data types to define _cis_-regulatory elements. For example, a database of promoter and enhancer DNA sequences or bulk ATAC-seq data can serve as an alternative if available as a .bed file. + +A mouse haematopoiesis base GRN is built with a mouse scATAC-seq atlas consisting of around 100k cells across 13 tissues. This is built into the CellOracle library to support GRN inference when there is none sample-specific scATAC-seq datasets. They also generated general promoter base GRNs for several key organisms commonly used to study _development_, including *10 species and 23 reference genomes*. + +#### Motif scan of promoter and enhancer DNA sequences + +CellOracle uses [gimmemotifs]([https://gimmemotifs.readthedocs.io/en/master/](https://gimmemotifs.readthedocs.io/en/master/)) to look for TF binding motifs in each sequence identified above. + +A database for TF binding motifs is required. In this paper they used the gimmemotifs motif v.5 data. CellOracle also provides motif datasets for ten species generated from the [CisBP v.2 database](http://cisbp.ccbr.utoronto.ca/). + +CellOracle exports a binary data table to indicate potential connections between a *TF* and a *gene* . It also reports the TF-binding DNA region. + +### scRNA-seq data preprocessing + +scRNA-seq data need be preprocessed in `Anndata` format. The python library `scanpy` should be used. `Seurat` can also be used. Then the data must be converted to `Anndata` using the CellOracle function `seuratToAnndata`. + +The following steps are performed: +1. filter out zero-count genes by UMI count using `scanpy.pp.filter_genes(min_counts=1)` +2. Normalize by total UMI count per cell `sc.pp.normalize_per_cell(key_n_counts=‘n_counts_all’)` +3. Find **highly variable genes** by `scanpy.pp.filter_genes_dispersion(n_top_genes=2000~3000)` +4. Gene expression is log transformed, scaled, and subjected to dimensional reduction and clustering +5. Non log-transformed gene expression matrix (**GEM**) is also retained for downstream GRN calculation and simulation + +### Context-dependent GRN inference using scRNA-seq data + +A **regularized linear model** is used to predict target gene expression from regulatory gene expression. The regulatory genes and their corresponding target genes are identified in the previous base GRN construction step. The relationship is shown below: $${x}_{j}=\,\mathop{\sum }\limits_{i=0}^{n}{b}_{i,j}{x}_{i}+\,{c}_{j},$$ +where $x_j$ is single target gene expression and $x_i$ is the gene expression value of the [Regulatory Candidate Gene](Regulatory%20Candidate%20Genes.md) that regulates gene $x_j$. $b_{i,j}$ is the coefficient value of the linear model (but $b_{i,j}$ = 0 if $i$ = $j$), and $c$ is the intercept for this model. Here, we use the list of potential regulatory genes for each target gene generated in the [previous base GRN construction step](#Motif%20scan%20of%20promoter%20and%20enhancer%20DNA%20sequences). + +$${x}_{i}\in \{{x}_{0},\,{x}_{1},\,\ldots {x}_{n}\}={\rm{Regulatory}}\,{\rm{candidate\; TFs\; of\; gene}}\,{x}_{j}$$ + +The GEM is first clustered. Then regression is calculated for each cluster. This is called [Cluster-wise Linear Regression](Cluster-wise%20Linear%20Regression.md). This is done because regulatory gene-target gene relationship is non-linear at the whole cell population level, but within each cluster, the cells are more similar to each other, and linear regression has better performance. + +[L2 Regularization](L2%20Regularization.md) is applied for the linear model to reduce overfitting. Regularization reduces large coefficients due to overfitting and identifies informative variables. The [Bayesian Ridge Regression](Bayesian%20Ridge%20Regression.md) model or [Bagging Ridge Regression](Bagging%20Ridge%20Regression.md) model are used. Think of them like an advanced linear model that is more robust against overfitting. They also provide the coefficients as a distribution such that the reproducibility of the inferred gene-gene connection can be analyzed. Remember that the coefficient $b$ is estimated. Hence its reliability should be analyzed. Read more about parameter estimation at [MLE vs. Bayesian Parameter Estimation](MLE%20vs.%20Bayesian%20Parameter%20Estimation.md) + +In both models, the output is posterior distribution of coefficient value $b$: + + +$${x}_{j}\sim {\rm{N}}{\rm{o}}{\rm{r}}{\rm{m}}{\rm{a}}{\rm{l}}\,\,(\mathop{\sum }\limits_{i=1}^{n}{b}_{i,j}{x}_{i}+{c}_{j},{\epsilon })$$ + +$$b\sim {\rm{N}}{\rm{o}}{\rm{r}}{\rm{m}}{\rm{a}}{\rm{l}}\,(\,{\mu }_{b},{{\sigma }}_{b})$$ +where $\mu_b$ is the centre of the distribution of $b$, and $\sigma_b$ is the standard deviation of $b$. The normal distribution is defined using mean and standard deviation. + +Between the two, Bayesian ridge uses less computational resource. Bagging ridge gives better results. + +Using the posterior distribution, one sample t-test (null hypothesis: $b=0$) can be performed for $b$ to calculate its p value. This helps identify robust connections while minimizing connections derived from random noise. + +The Bayesian ridge uses the following priors: + +$b\sim {\rm{N}}{\rm{o}}{\rm{r}}{\rm{m}}{\rm{a}}{\rm{l}}\,(0,{{\sigma }}_{b})$ + +${{\sigma }}_{b}^{-1}\sim {\rm{G}}{\rm{a}}{\rm{m}}{\rm{m}}{\rm{a}}\,({10}^{-6},{10}^{-6})$ + +$b$ is assumed to have a normal distribution centered at 0 with standard deviation $\sigma_b$, while $\sigma_b$ is assumed to have an inverse-[Gamma Distribution](Gamma%20Distribution.md) defined by its two parameters. ${\sigma }_{b}$ is selected to represent **non-informative prior distributions**. This allows the scRNA data to dominate the inference process. Read more at [Non-informative Prior](Bayesian%20Ridge%20Regression.md#Non-informative%20Prior) + +Bayesian ridge uses data in the fitting process to estimate the optimal regularization strength. In the Bagging Ridge model, custom regularization strength can be manually set. + +### Simulation of cell identity following perturbation of regulatory genes + +After obtaining the GRN, CellOracle simulates how cell identity shifts following perturbation of regulatory genes. The simulated gene expression values are converted into **2D vectors** representing the direction of cell-state transition, a adapting the visualization method previously used by [RNA velocity](https://www.nature.com/articles/s41586-022-05688-9#ref-CR52 "la Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).") + +The process consists of four steps: +1. data preprocessing +2. signal propagation within the GRN +3. estimation of transition probabilities +4. analysis of simulated transition in cell identity + +#### Data preprocessing + +Velocyto, a Python package for RNA-velocity analysis ([https://velocyto.org](https://velocyto.org/)) is adapted for the simulation of cell identity. The scRNA-seq data is processed by first filtering the genes then imputing dropouts using [K-nearest Neighbor Imputation](K-nearest%20Neighbor%20Imputation.md), according to Velocyto's requirements. + +#### Within-network signal propagation + +##### Gradient of the linear model + +The linear relationships between target genes and regulatory genes computed in the [previous step](#Context-dependent%20GRN%20inference%20using%20scRNA-seq%20data) is used to predict target gene ($x_j$) expression change following regulatory gene ($x_i$) expression change. A partial derivative $\frac{\partial {x}_{j}}{\partial {x}_{i}}$ is calculated as the _rate of change_. Since the model is linear, the derivative $\frac{\partial {x}_{j}}{\partial {x}_{i}}$ is a constant value and already calculated as $b_{i,j}$ in the previous step if the gene $j$ is directly regulated by gene $i$: + +$\frac{\partial {x}_{j}}{\partial {x}_{i}}={b}_{i,j}.$ + +_That's why they used a linear model - easier derivative._ + +And we calculate the shift of target gene ${\Delta x}_{j}$ in response to the shift of regulatory gene ${\Delta x}_{i}$: + +${\Delta x}_{j}=\frac{\partial {x}_{j}}{\partial {x}_{i}}{\Delta x}_{i}={b}_{i,j}{\Delta x}_{i}$ + +Indirect connections are accounted for by multiplying the edges to form a composite function of the linear models, which is differentiable accordingly. We apply the _chain rule_ to calculate the partial derivative of the target genes, even between indirectly connected nodes. + +$\frac{\partial {x}_{j}}{\partial {x}_{i}}=\mathop{\prod }\limits_{k=0}^{n}\frac{\partial {x}_{k+1}}{\partial {x}_{k}}=\mathop{\prod }\limits_{k=0}^{n}{b}_{k,k+1}$, + +where + +$\begin{array}{c}{x}_{k}\in \{{x}_{0},\,{x}_{1},\,\ldots {x}_{n}\}\,=\,\text{Gene expression of ordered network}\\ \,\text{nodes on the shortest path from gene}\,i\,\text{to gene}\,j.\end{array}$ + +For example, when we consider the network edge from gene 0 to 1 to 2, the small shift of gene 2 in response to gene 0 can be calculated using the intermediate connection with gene 1. + +$\frac{\partial {x}_{2}}{\partial {x}_{0}}=\frac{\partial {x}_{1}}{\partial {x}_{0}}\times \frac{\partial {x}_{2}}{\partial {x}_{1}}={b}_{0,1}\times {b}_{1,2}$ + +${\Delta x}_{2}=\frac{\partial {x}_{2}}{\partial {x}_{0}}{\Delta x}_{0}={b}_{0,1}{b}_{1,2}{\Delta x}_{0}$ + +This way, we focus on the _gradient_ of gene expression equations rather than the _absolute expression values_ so that we do not model the error or the intercept of the model, which potentially includes unobservable factors within the scRNA-seq data. + +The calculation above is implemented as vector and matrix multiplication. First, the linear regression model can be shown as follows. + +${X}^{{\prime} }=X\cdot B+C$, + +where the $X\in {{\mathbb{R}}}^{1\times N}$ is a gene expression vector containing **$N$ genes**, $C\in {{\mathbb{R}}}^{1\times N}$ is the intercept vector, $B\in {{\mathbb{R}}}^{N\times N}$ is the [Network Adjacency Matrix](Network%20Adjacency%20Matrix.md), and each element $b_{i,j}$ is the coefficient value of the linear model from regulatory gene $i$ to target gene $j$. + +##### Perturbation propagation + +First, we set the perturbation input vector $\Delta {X}_{{\rm{input}}}\in {{\mathbb{R}}}^{1\times N}$, a sparse vector consisting of zeros except for the perturbed gene $i$. For the perturbed gene, we set the shift of the TF to be simulated. The CellOracle function will produce an error if the user enters a gene shift corresponding to an out-of-distribution value. + +> [!note] +> The calculation is done for each cell. Hence the input vector has 1 row and N columns matching N genes in the cell. + +Next, we calculate the shift of the first target gene. The change then propagates down the GRN: + +$\Delta {X}_{{\rm{simulated}},n=1}=\Delta {X}_{{\rm{input}}}\cdot B$. + +However, we **fix** the perturbed gene $i$ value, and the ${\Delta x}_{i}$ retains the same value as the input state. Thus, the following calculation will correspond to both the first and the second downstream gene shift calculations. + +$\Delta {X}_{{\rm{simulated}},n=2}=\Delta {X}_{{\rm{simulated}},n=1}\cdot B$. + +Likewise, the recurrent calculation is performed to propagate the shift from gene to gene in the GRN. Repeating this calculation for $n$ iterations, we can estimate the effects on the first to the $n_{th}$ indirect target gene: + +$\Delta {X}_{{\rm{simulated}},n}=\Delta {X}_{{\rm{simulated}},n-1}\cdot B$. + +In each of the calculation above, the respective coefficients between the input and output genes in $B$ are used for calculation. CellOracle performs **three** iterative cycles in the default setting, sufficient to predict the directionality of changes in cell identity. Higher numbers can lead to errors. + +Of note, CellOracle performs the calculations cluster-wise after splitting the whole GEM into **gene expression submatrices** on the basis of the assumption that each cluster has a unique GRN configuration. Remember that the linear relationships were also computed after clustering the GEM into clusters. Each cluster has a different linear relationship. + +Also, gene expression values are checked between each iterative calculation to confirm whether the simulated shift corresponds to a _biologically plausible range_. If the expression value for a gene is negative, this value is adjusted to zero. + +#### Estimation of transition probabilities + +Using the simulated gene expression shift vector $\Delta {X}_{{\rm{simulated}}}\in {{\mathbb{R}}}^{1\times N}$ after TF perturbation, CellOracle calculates the probabilities of cell identity transition. Then it projects the directionality of cell identity transition onto the dimensional reduction embedding, visualized as arrows pointing in the direction of cell identity shift. + +> [!info] +> For this task, CellOracle uses a similar approach to [Velocyto](https://github.com/velocyto-team/velocyto.py). Velocyto visualizes future cell identity on the basis of the RNA-splicing information and calculated vectors from RNA synthesis and degradation differential equations. CellOracle uses the simulated gene expression vector $\Delta {X}_{{\rm{simulated}}}$ instead of RNA-velocity vectors. + +First, CellOracle estimates the cell transition probability matrix $P\in {{\mathbb{R}}}^{M\times M}$ ($M$ is number of cells): $p_{ij}$, the element in the matrix $P$, is defined as the probability that cell $i$ will adopt a similar cell identity to cell $j$ after perturbation. To calculate $p_{ij}$, CellOracle calculates the Pearson’s correlation coefficient between $d_i$ and $r_{ij}$: + +${p}_{ij}=\frac{\exp \left(corr\left({r}_{ij}{,d}_{i}\right)/T\right)}{\sum _{j\in G}\exp \left(corr\left({r}_{ij}{,d}_{i}\right)/T\right)}$, + +where $d_i$ is the simulated gene expression shift vector $\Delta {X}_{{\rm{simulated}}}\in {{\mathbb{R}}}^{1\times N}$ for cell $i$, and ${r}_{ij}\in {{\mathbb{R}}}^{1\times N}$ is a subtraction of the gene expression vector $X\in {{\mathbb{R}}}^{1\times N}$ between cell $i$ and cell $j$ in the original GEM. The value is normalized by the [Softmax](Softmax.md) function (default temperature parameter $T$ is 0.05). The calculation of $p_ij$ uses neighboring cells of cell $i$. The KNN method selects local neighbors in the dimensional reduction embedding space (_k_ = 200 as default). + +> [!note] +> Pearson correlation measures how similar the simulated gene expression shift vector is with the actual gene expression difference. The Softmax function first exponentiate the correlation value (call it $r$ for now), making the difference larger then divide $r$ by the total $r$ for all the cells. This way a probability to shift to cell $i$ is calculated. The total number of neighboring cells is 200. $G$ refers to these cells. + +> [!note] KNN +> Just like [K-nearest Neighbor Imputation](K-nearest%20Neighbor%20Imputation.md), the KNN method to select neighbors is based on a chosen distance metric. The closest 200 neighbors are selected to calculate the transition probabilities of cell $i$ to them. + +#### Calculation of simulated cell-state transition vector + +With the transition probability matrix $P$, the direction of cell identity shift can be plotted on the dimension reduced plot. + +> [!note] Dimension Reduction +> They recommend force directed graph as the dimension reduction method. An alternative when there are too many overlapping branches is UMAP. + +CellOracle calculates the local **weighted average** of vector ${V}_{i,j}\in {{\mathbb{R}}}^{1\times 2},{V}_{i,j}$ denotes the 2D vector obtained by subtracting the 2D coordinates in the dimensional reduction embedding between cell $i$ and cell $j$ $({\rm{cell}}\;j\in G)$. + +${V}_{i,{\rm{s}}{\rm{i}}{\rm{m}}{\rm{u}}{\rm{l}}{\rm{a}}{\rm{t}}{\rm{e}}{\rm{d}}}=\,\sum _{j\in G}{p}_{ij}{V}_{i,j}$ + +${V}_{i,{\rm{simulated}}}$ is the simulated 2D cell identity shift vector. + +> [!note] Why is it Weighted Average +> Imagine two vectors going from the origin. If you subtract one from the other, you get the vector that connects the tips of both. This difference is then scaled by the transition probability and summed. +> +> Remember that in the Softmax function, the probability is obtained by dividing by the **total** exponentiated Pearson correlation. The probabilities are the weights that makes ${V}_{i,{\rm{simulated}}}$ a **weighted average**. + +#### Calculation of vector field + +It is hard to visualize the vectors for each cell in a reasonably large graph. A grid, ${V}_{{\rm{vector}}{\rm{field}}}=\,{{\mathbb{R}}}^{2\times L\times L}$, is defined for the plot. $L$ is grid number, default $L$ is 40. Then for each grid point, ${v}_{{\rm{grid}}}\in {{\mathbb{R}}}^{2}$, an element in the ${V}_{{\rm{vector}}{\rm{field}}}$, the nearby vectors are averaged using [Gaussian Kernel Smoothing](Gaussian%20Smoothing.md). + +> [!note] Gaussian Smoothing +> Gaussian smoothing does a weighted average for all the vectors in the window using weight defined by a Gaussian distribution. + + +${v}_{{\rm{grid}}}={\sum }_{i\in H}{K}_{\sigma }(g,\,{V}_{i,{\rm{simulated}}}){V}_{i,{\rm{simulated}}}$, + +where the $g\in {{\mathbb{R}}}^{2}$ denotes _grid point coordinates_, $H$ is the neighbor cells of $g$, and ${K}_{\sigma }$ is the Gaussian kernel weight. The higher the distance between the vector and the grid point, the smaller the weight: + +${K}_{\sigma }({v}_{0},{v}_{1})=\exp \left(\frac{-{\parallel {v}_{0}-{v}_{1}\parallel }^{2}}{2{\sigma }^{2}}\right)$. + +This way each grid has a vector that is the weighted average of nearby vectors. The whole trend can be visualized much more easily. + +> [!note] Symbols +> $N$: the number of genes +> $M$: the number of cells +> $G$: group of neighboring cells of cell $i$ +> $B$: the network adjacency matrix +> $g$: grid points + +### Calculation of pseudotime gradient vector field and inner-product score to generate a perturbation score + +[Pseudotime](6.%20时空组学拟时序分析.md) is an analysis technique of single cell/spatial RNA sequencing data. It infers the developmental trajectory from the gene expression in the cells that come from a common origin. For example, how neural cells develop into each other can be studied using pseudotime analysis. + +To aid the interpretation of CellOracle's simulation of identity shift, pseudotime analysis is done for the data. The result is then converted into a grid of vectors like the cell identity shift result. We quantify the similarity between the differentiation vector fields and KO simulation vector fields by calculating their inner-product value, which we term the **perturbation score** (PS) + +#### Differentiation pseudotime calculation + +Differentiation pseudotime is calculated using DPT, a diffusion-map-based pseudotime calculation algorithm, using the scanpy.tl.dpt function. Other methods such as _Monocle_ and _URD_ also work with CellOracle. + +#### Differentiation vector calculation based on pseudotime data + +The pseudotime data are transferred to the _n_ by _n_ 2D grid points (_n_ = 40 as default). Two methods are implemented: KNN regression and polynomial regression (add notes later). + +> [!info] Choice of Method +> Polynomial regression should be used when the developmental branch is relatively simple bifurcation. KNN regression should be used when the branching is more complex. + +The gradient of pseudotime data on the 2D grid points are calculated using the numpy.gradient function, producing the 2D vector map representing the _direction of differentiation_. + +#### Inner-product value calculation between differentiation and KO simulation vector field + +The inner product is calculated between the two matrices to get PS (-1 to 1). A positive PS (green) means that perturbation __promotes__ differentiation, while a negative PS means perturbation __represses__ differentiation. + +> [!note] KO Simulation +> In a knockout (KO) simulation, if the PS is positive, it means removing the TF promotes differentiation. Hence the TF itself inhibits differentiation. If the PS is negative, it means the TF promotes differentiation. + +#### PS calculation with randomized GRN model to calculate PS cut-off value + +CellOracle produces randomized GRN models to calculate a cutoff value for negative PS in TF KO simulations. + +1. A random GRN is generated (randomized $B$) +2. A randomized simulation vector ($\Delta X_\text{input}$ from [Perturbation propagation](#Perturbation%20propagation)) is used to calculate PS. +3. The PS from random GRN forms the null distribution. Then the cut-off value is chosen from the 99% percentile of the null distribution. + +> [!note] FPR +> The 99% percentile is chosen to have a low [False Positive Rate (FPR)](Confusion%20Matrix.md#False%20Positive%20Rate%20(FPR)). At 99% percentile, the FPR is 0.01. FPR refers to the proportion of negative cases (in this case negative PS due to random chance) that are labeled as positive (in this case negative PS is caused by the TF promoting differentiation). Only the negative scores lower than the threshold are labeled as positives. +> +> Why not set the threshold to the minimum of randomized PS? Because the minimum value can be very low due to random chance. It will reduce FPR to minimum, but can reduce [True Positive Rate (TPR) / Sensitivity / Recall](Confusion%20Matrix.md#True%20Positive%20Rate%20(TPR)%20/%20Sensitivity%20/%20Recall). Furthermore, the randomization can vary the minimum greatly, which is not good for reproducibility, while the 99% percentile is more stable for a distribution. + +#### Focus on one development branch + +CellOracle enables analysis of one development branch. See their official tutorial. + +### Network Analysis + +CellOracle can do standard [GRN Analysis](GRN%20Analysis.md) like [Network Topology Metrics](Network%20Topology%20Metrics.md) and module analysis. Before the analysis, weak connections need to be filtered out. The criteria can be defined by the user. + +CellOracle uses [igraph]([https://igraph.org](https://igraph.org/)) for this step. + +### Choice of dimensionality reduction method + +For the force-directed graph calculation, we recommend using Scanpy’s sc.pl.draw_graph [function](https://www.nature.com/articles/s41586-022-05688-9#ref-CR59 "Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).") or [SPRING](https://www.nature.com/articles/s41586-022-05688-9#ref-CR60 "Weinreb, C., Wolock, S. & Klein, A. M. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics 34, 1246–1248 (2018)."). Both internally use force atlas 2 (ref. [61](https://www.nature.com/articles/s41586-022-05688-9#ref-CR61 "Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS ONE 9, e98679 (2014).")). Compared to UMAP, force-directed graphs can capture more fine-branching structures but can be unstable if the data have many branches that can overlap. To avoid branch overlap, PAGA cell trajectory information can be used to initiate the force-directed graph calculation: [https://scanpy.readthedocs.io/en/stable/tutorials.html#](https://scanpy.readthedocs.io/en/stable/tutorials.html#)[https://github.com/theislab/paga](https://github.com/theislab/paga). + +We recommend using force-directed graphs as a first choice because they generally produce a high-resolution lineage structure. However, we recommend UMAP as a reliable alternative if overlapping branches are observed. In our CellOracle tutorial, we show the detailed guide and code for the dimensionality reduction implementation, including data preprocessing: [https://morris-lab.github.io/CellOracle.documentation](https://morris-lab.github.io/CellOracle.documentation). + +## Data + +All data, including sequencing reads and single-cell expression matrices, are available from the GEO under accession codes [GSE72859](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72859) (ref. [16](https://www.nature.com/articles/s41586-022-05688-9#ref-CR16 "Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).")), [GSE112824](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112824) (ref. [32](https://www.nature.com/articles/s41586-022-05688-9#ref-CR32 "Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, eaar3131 (2018).")) and [GSE145298](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145298) for the zebrafish profiling from this study; and from ArrayExpress under accession codes [E-MTAB-7325](http://www.ebi.ac.uk/microarray-as/aer/result?queryFor=Experiment&eAccession=E-MTAB-7325) (_Tal1_−/− chimeras) and [E-MTAB-7324](http://www.ebi.ac.uk/microarray-as/aer/result?queryFor=Experiment&eAccession=E-MTAB-7324) (wild-type chimeras). Simulations can be explored at [https://celloracle.org](https://celloracle.org/). \ No newline at end of file diff --git a/Biology/Seminars/Gene2role a role-based gene embedding method for comparative analysis of signed gene regulatory networks.md b/Biology/Seminars/Gene2role a role-based gene embedding method for comparative analysis of signed gene regulatory networks.md index 7ba2e25..77551f2 100644 --- a/Biology/Seminars/Gene2role a role-based gene embedding method for comparative analysis of signed gene regulatory networks.md +++ b/Biology/Seminars/Gene2role a role-based gene embedding method for comparative analysis of signed gene regulatory networks.md @@ -217,4 +217,8 @@ Also network topology is an important piece of information from network. ### Why not use proximity based embeddings to compare? -As [Simple network](#Simple%20network) showed, proximity based methods cannot accurately capture structural information of genes. \ No newline at end of file +As [Simple network](#Simple%20network) showed, proximity based methods cannot accurately capture structural information of genes. + + + + diff --git a/Biology/Seminars/Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck.md b/Biology/Seminars/Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck.md index 87e668b..5a8f7c9 100644 --- a/Biology/Seminars/Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck.md +++ b/Biology/Seminars/Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck.md @@ -6,8 +6,8 @@ #paper difference between doing ANOVA or PLS-DA? -identify cell subpopulations, metagene grouping, hierarchy -outperforms baseline clustering method +- identify cell subpopulations, metagene grouping, hierarchy +- outperforms baseline clustering method ## Abstract @@ -33,6 +33,13 @@ When aiming to uncover factors related to a specific biological signal, it is a The bioIB algorithm outputs a compressed representation of scRNA data with **metagenes**, which are clustered probabilistic mapping of genes. The probabilistic construction preserves gene-level biological _interpretability_, allowing characterization of each metagene. +> [!info] Clustered Probabilistic Mapping of Genes +> The genes are mapped into clusters called metagenes. The mapping is not absolute but probabilistic. +> +> For example, gene $g_i$ might have a 70% probability of belonging to metagene $M_1$, a 20% probability of belonging to metagene $M_2$​, and a 10% probability of belonging to metagene $M_3$. +> +> Using a probabilistic approach captures the inherent complexity in biological data. Genes may be involved in multiple pathways and functions. + Compared with dimensionality reduction techniques such as PCA, bioIB takes into account the biological signal. Compared with deep learning clustering methods, BioIB is more interpretable. BioIB also outputs a hierarchy of metagenes, reflecting the inherent data structure relative to the signal of interest, elucidating their significance in distinguishing between biological labels, and illustrating their interrelations with both one another and the underlying cellular populations. @@ -47,7 +54,7 @@ $R$ means real numbers Input to bioIB: a count matrix $X\in{R^{N\times{G}}}$, where N is the number of cells and G the number of genes; a vector of cell labels $S\in{R^{N\times1}}$. -BioIB outputs a compressed representation $\hat{X}\in{R^{N\times{M}}}$ of $N$ cells and $M$ metagenes, which optimizes tradeoff between compression and information about the signal of interest, denoted by $Y\in{R^K}$. $Y$ is all possible cell states, like {disease, healthy}. $S$ must be a member of $Y$, or $\forall{i}$, $S_i\in{Y}$. Metagenes signify major patterns of gene expression variation underlying the labeled signal. +BioIB outputs a compressed representation $\hat{X}\in{R^{N\times{M}}}$ of $N$ cells and $M$ metagenes, which optimizes tradeoff between compression and information about the signal of interest, denoted by $Y\in{R^K}$. $Y$ is all possible cell states, like {disease, healthy}. Members of $S$ must be a member of $Y$, or $\forall{i}$, $S_i\in{Y}$. Metagenes signify major patterns of gene expression variation underlying the labeled signal. $\hat{X}$ is obtained by minimizing the **mutual information** with the original data $X$, representing complexity, and maximizing the mutual information with the label $Y$, representing accuracy. $$\hat{X} = argmin_\hat{X}(I(X,\hat{X})-\beta{I(\hat{X},Y)})$$ The parameter $\beta$ determines the the level of compression. When it is $\infty$, it means only maximize accuracy, or no compression, $X=\hat{X}$. When it is 0, it means only minimize complexity, or cluster all genes into 1 cluster. Hence, $\beta$ determines the number of clusters. The hierarchy of metagenes is obtained by gradually decreasing $\beta$ in a reverse-annealing process. @@ -62,7 +69,7 @@ bioIB can also capture the relationships between related cell types, defined as Heterogeneity can be attributed to multiple factors, like organ of origin and developmental stage. Using bioIB, different compression of data can be achieved by selecting different signals of interest. -Chose to look at macrophages. +Chose to look at macrophages from different organs and different developmental stages. Looking at the hierarchy, found a metagene specific to late gestational stage. GO analysis found they are enriched for immune processes. diff --git a/Biology/Seminars/ScenicPlus single cell multiomic inference of enhancers and gene regulatory networks.md b/Biology/Seminars/ScenicPlus single cell multiomic inference of enhancers and gene regulatory networks.md index 4b63c9a..a0c037f 100644 --- a/Biology/Seminars/ScenicPlus single cell multiomic inference of enhancers and gene regulatory networks.md +++ b/Biology/Seminars/ScenicPlus single cell multiomic inference of enhancers and gene regulatory networks.md @@ -2,3 +2,4 @@ [Paper](https://www.nature.com/articles/s41592-023-01938-4) +GRN inference \ No newline at end of file diff --git a/Biology/Seminars/Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse.md b/Biology/Seminars/Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse.md new file mode 100644 index 0000000..5942da2 --- /dev/null +++ b/Biology/Seminars/Single-cell ATAC and RNA sequencing reveal pre-existing and persistent cells associated with prostate cancer relapse.md @@ -0,0 +1,95 @@ +# scRNA and scATAC for studying prostate cancer drug resistance + +[Paper](https://www.nature.com/articles/s41467-021-25624-1) +[scRNA-seq](scRNA-seq.md) +[scATAC-seq](scATAC-seq.md) +Uses cell line data, not tissue sample +## Introduction + +- **Androgen receptor (AR)**: a type of nuclear receptor that is activated by binding to androgens, which are male hormones such as testosterone and dihydrotestosterone (DHT). Androgens promote prostate cancer cell growth. +- **ENZ**: Enzalutamide is an anti-androgen drug that inhibits AR signaling through several mechanisms such as competitive binding and downstream interference of the signaling pathway. Prolonged use of ENZ make the cancer cells adapt to low androgen environment. + - CRPC (Castration-Resistant Prostate Cancer): cancer cells evolve to be more easily activated, express more copies of AR, or does not bind ENZ but still binds androgens. + - NEPC (Neuroendocrine Prostate Cancer): cancer cells evolve to be no longer depend on androgens to grow. + - NEPC cells lose their reliance on androgen signaling and gain neuroendocrine features, characterized by the expression of neuroendocrine markers such as chromogranin A, synaptophysin, and neuron-specific enolase (NSE). + - They become reliant on neuroendocrine for promoting growth. +- **LNcaP cells**: a prostate cancer cell line that is known for its sensitivity to androgens. It produces prostate-specific antigens that can be used as markers for prostate cancer. +- **Androgen starvation**: reducing the amount of androgen available to cancer cells. Initially growth is suppressed. Resistance develops later. +- [MYC Signaling](MYC%20Signaling.md) +- **Binding site map**: +## Result + +### Chromatin reprogramming underpins ENZ resistance + +- Chromatin accessibility observed in ENZ-resistant cells: + - The ATAC-seq signal at transcription-start sites (TSS) decreased in ENZ-resistant cells compared with the parental + - RES-A and RES-B cells shared a large proportion of ENZ-resistance-specific open-chromatin regions not found in parental LNCaP + - the chromatin of ENZ-resistant cells is more open in the presence of androgens (_p_ < 0.001, _t_-test) (Supplementary Fig. [1e](https://www.nature.com/articles/s41467-021-25624-1#MOESM1)) and in castrate conditions + - In castrate condition, cells evolve to grow without androgens. + - Being resistant to ENZ makes them respond to androgen again? + - _Idea: Compare parental (LNcaP) with resistant cell lines_ +- Clustered **all** cells by their chromatin accessibility profile + - Identified proportions of cells from each group in each cluster + - Unique clusters for RESA and RESB are named **ENZ-induced clusters** + - Unique clusters for DMSO and ENZ48 are named **initial clusters** + - **Persistant clusters**: similar proportions across the groups + - 74% of the cells share an overall similar chromatin-accessibility profile during the development of ENZ resistance + - [Differentially Accessible Regions](scATAC-seq.md#Differentially%20Accessible%20Regions) (DARs) were observed around _MYC_ and _TP53_ in several clusters during the short-term response to enzalutamide +- Neuroendocrine behavior in ENZ resistant cells + - The largest fold changes in chromatin accessibility based on average signal from all cells showed **over representation for neuronal system processes** between the parental (LNCaP or LNCaP–ENZ48) and resistant cells (RES-A or RES-B) + - Accordingly, we found elevated expression of NEPC-derived signatures in RES-A and RES-B cells (particularly _EZH2_, _AURKA_, _STMN1_, _DNMT1_, and _CDC25B_), as well as increased expression of NEPC-downregulated genes in initial clusters + + +| Cluster | Type | +| ------- | ----------- | +| 0 | Persistant | +| 1 | Persistant | +| 2 | Persistant | +| 3 | Persistant | +| 4 | Initial | +| 5 | Persistant | +| 6 | ENZ-induced | +| 7 | ENZ-induced | + + +### ENZ resistance reconfigures availability of TF binding DNA motifs in the chromatin + +After finding chromatin reprogramming happens in ENZ resistant cells, they found the availability of TF binding [motifs](1.%20MIT%20CompBio%20-%20Introduction.md#^6a1795) change in these cells. + +A motif is a sequence of DNA that has a biological function, like binding [TFs](Transcription%20Factor.md). + +Opening the chromatin exposes the TF binding motifs within. Hence, they studied the difference in the binding motifs within the [Open Chromatin Regions](Open%20Chromatin%20Regions.md) among the groups. + +They observed a significant increase in open chromatin at [MYC](MYC%20Signaling.md)-binding sites in ENZ-resistant cells. MYC are a family of TFs that are oncogenic. They also observed a reduction of open chromatin at AR-binding sites in **castrate** conditions, and an increase in open chromatin at AR-binding sites in androgen-exposed conditions. + +These findings align with previous reports that in ENZ resistant cells MYC signaling increases and AR signaling decreases. This could be understood as a compensatory mechanism to adapt to androgen scarcity. + +For each cell cluster based on scATAC profile, a set of marker DARs were identified. Those DARs would open up some TF binding motifs. TF binding motif enrichment analysis confirmed the enrichment of motifs for several PC-associated TFs such as AR and MYC. There are more, read the paper for details. + +Comparing open DARs in RES-A or RES-B to the LNCaP parental retrieved distinct sets of TFs, with _MYC_ and _ESR1_ being the most common across all clusters in RES-A and RES-B. Similarly, comparing open DARs in RES-A or RES-B vs LNCaP–ENZ48 showed enrichment of most of the PC-related TF motifs tested in most clusters +### Transcriptional patterns of ENZ resistance are induced by divergent chromatin reprogramming + +Did scRNA sequencing to verify if chromatin reprogramming leads to transcriptional differences. + +Used [Cluster Label Transfer](Cluster%20Label%20Transfer.md) and different cell lines to verify if results observed in LNCaP cell line is generalizable. + +Clustered LNCaP parental, RESA, RESB, ENZ48 after integrating them. Identified 7 persistant, 3 ENZ-induced, and 3 initial clusters, a total of 13 scRNA clusters. _Idea: just study how cell states shift among three cluster types. Use bioib to do functional difference among the clusters_. + +scRNA seq confirms chromatin reprogramming induces transcriptional changes. Using annotated databases, they queried the **transcriptional targets** of the enriched TFs in the open DARs when comparing RES-A or B to the parental LNCaP in the matching scRNA-seq samples. DAR: differentially accessible regions between RESA or B and parental LNCaP. + +Used cluster label transfer again to transfer label from scRNA clusters to scATAC clusters and vice versa. Found that one chromatin state can correspond to multiple transcriptional state. + +### Prostate cancer cell subpopulations with features of stemness precede ENZ resistance + +Used Seurat to score the clusters for their [Cell Cycle](Cell%20Cycle.md) stage. +- Identified _persistent_ clusters are more actively cycling and proliferating. +- They are also characterized by genes associated with stemness, chromatin remodeling and organization, and DNA repair; considering they are actively cycling, it makes sense. +- This set of genes defined as Persist +Another set of genes that promotes tissue regeneration defined as PROSGenesis. Found PROSGenesis in _initial cluster_ 10 + +Analyzed how cell clusters develop: +- CytoTRACE: ranks developmental potential +- RNA velocity analysis + + + + diff --git a/LCMS/Lipid_slides.md b/LCMS/Lipid_slides.md new file mode 100644 index 0000000..ba892e6 --- /dev/null +++ b/LCMS/Lipid_slides.md @@ -0,0 +1,768 @@ +--- +title: Edits +theme: white +width: 1920 +height: 1080 +--- + + + +## Page 1 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Good afternoon everyone. I am Zhaoyameng from SRICI. Today I will present our work on isotope abundance determination in 13-C labeled intracellular sugar metabolites using hydrophilic interaction liquid chromatography.
  • +
+
+::: + +
+ +--- + +## Page 2 + + + +::: block +**PPT** +
+
    +
  • Background
  • +
  • LC-MS Method Development
  • +
  • Method Application
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • I will first briefly introduce the background of the isotope tracing technology. Then I will describe our isotope abundance detection method. Finally, I will give some application examples of our work.
  • +
+
+::: + +
+ +--- + +## Page 3 + +😀 + +--- + +## Page 4 + + + +::: block +**PPT** +
+
    +
  • Manual enrichment -> artificial enrichment
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Isotopes are atoms of the same element with the same number of protons but different number of neutrons.
  • +
  • In 1912 (nineteen twelve), natural isotopes of Neon with mass numbers 20 and 22 were identified, marking the first discovery of natural isotopes.
  • +
  • Isotopes can arise naturally or be artificially enriched. They have different utilities in research.
  • +
+
+::: + +
+ +--- + +## Page 5 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Combined with analytical instruments such as mass spectrometers, stable isotope labeled tracers and internal standards find wide applications in fields such as agriculture, biomedical sciences, and geochemistry.
  • +
+
+::: + +
+ +--- + +## Page 6 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • The quality of stable isotope labeled internal reagents is crucial for its applications in clinical mass spectrometry, breath testing, and hyperpolarized PET-CT.
  • +
  • Stable isotope reagents have also been widely applied in proteomics and metabolomics. In recent years, metabolic flux analysis or MFA using stable isotope labeled glucose or glutamine has been developing rapidly and led to publications in many high-impact journals.
  • +
+
+::: + +
+ +--- + +## Page 7 + + + +::: block +**PPT** +
+
    +
  • strain modification -> metabolic engineering
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • MFA has been applied for the study of metabolic mechanisms, biomarker discovery, and metabolic engineering.
  • +
  • The Mass Isotopomer Distribution calculated from isotope abundance is the critical parameters in MFA. Accurate and precise quantification of the metabolite concentrations and their labeling states is essential for correct MFA results.
  • +
+
+::: + +
+ +--- + +## Page 8 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Since 1990, our institute has been developing and applying isotope analysis methods using gas isotope MS, LC-MS, and GC-MS.
  • +
  • In 2021 (==2022?==), we obtained the OE120 high resolution MS, which enables us to analyze organic compounds such as amino acids, organic acids, and other metabolites in complex biological samples.
  • +
+
+::: + +
+ +--- + +## Page 9 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • I will now present our work on quantifying sugar metabolites concentration and mass isotopomer distribution using high resolution LC-MS.
  • +
+
+::: + +
+ +--- + +## Page 10 + + + +::: block +**PPT** +
+
    +
  • 放个已发表论文的截图?
  • +
  • Sugar Metabolism
  • +
      +
    • Important metabolic pathway
    • +
    • Poor ionization efficiency
    • +
    • Multiple isomers
    • +
    • Lack of tested methods for isotope abundance analysis
    • +
    +
+
+::: + +::: block +**讲稿** +
+
    +
  • Sugar metabolism is crucial for cellular activities and implicated in many diseases including cancer. However, detecting and quantifying them have been challenging due to their poor ionization efficiency.
  • +
  • We have developed a method to simultaneously quantify sugar metabolite concentration and mass isotopomer distribution.
  • +
+
+::: + +
+ +--- + +## Page 11 + + + +::: block +**PPT** +
+
    +
  • Quantification of ... -> Metabolite Concentration
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • First, with multiple rounds of experimentation, we optimized MS parameters to ensure high ionization efficiency for sugar metabolites.
  • +
+
+::: + +
+ +--- + +## Page 12 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • We then selected the combination of LC column and buffer salt concentration with the best separation effect.
  • +
+
+::: + +
+ +--- + +## Page 13 + + + +::: block +**PPT** +
+
    +
  • mixture打错字了
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • After optimizing LC-MS settings, we validated our method. With excellent recovery rate, accuracy, and precision, this method can reliably quantify sugar metabolite concentration.
  • +
+
+::: + +
+ +--- + +## Page 14 + + + +::: block +**PPT** +
+
    +
  • Detection of ... -> Isotope Abundance
  • +
  • 文字根据讲稿修改
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Fully labeled glucose is a common tracer in MFA studies. In the mass spectrometer, it can have peaks for each isotopomer with different number of 13C substitutions, such as M0, M1, M2, etc. These are the peaks we are interested in measuring. +
  • However, naturally appearing isotopes of carbon and other atoms in the molecule can form peaks with similar mass numbers with our peaks of interest.
  • +
  • Using high resolution MS, we are able to distinguish these peaks and make accurate calculation of 13C abundance.
  • +
+
+::: + +
+ +--- + +## Page 15 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Our method was validated against a reference material. The result falls within the confidence interval of the RM certified abundance, proving it to be highly stable and accurate.
  • +
+
+::: + +
+ +--- + + +## Page 16 + + + +::: block +**PPT** +
+
    +
  • With concentration > 1 ug/mL and MS intensity >106,  isotope abundance calculations stabilized with small fluctuations.
  • +
  • RSD < 0.02% for different scan numbers; no significant difference in calculated abundances found.
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • We then investigated if factors such as concentration, scan numbers affect abundance calculation.
  • +
+
+::: + +
+ +--- + +## Page 17 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • We compared gas isotope MS and HR MS for abundance calculation of 15N labeled amino acid. Both methods can reliably quantify the isotope abundance. However, the HRMS method requires a thousands times less sample and is ten times faster.
  • +
+
+::: + +
+ +--- + +## Page 18 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Our method can also be applied to quantify isotope abundances in doubly labeled compounds
  • +
  • Saxitoxins reagents, steroids hormone and other isotope reagents.
  • +
+
+::: + +
+ +--- + +## Page 19 + + + +::: block +**PPT** +
+
    +
  • based R -> based on R
  • +
  • Step 2: input element type,  numbers, labeled mode and other parameters
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • To facilitate processing of large amounts of data, we have developed an R app that automates isotope abundance calculation. It can calculate the isotope abundance of singly and doubly labeled compounds.
  • +
+
+::: + +
+ +--- + +## Page 20 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • We applied our method to study biosynthesis of 13C labeled amino acids in bacteria. We quantified abundances and MIDs of 7 sugar metabolites, which provides insights on glucose utilization efficiency and ways to improve the yield of our target amino acid in the bacteria.
  • +
+
+::: + +
+ +--- + +## Page 21 + + + +::: block +**PPT** +
+
    +
  • Application of Stable Isotope Technology
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • 😀
  • +
+
+::: + +
+ +--- + +## Page 22 + + + +::: block +**PPT** +
+
    +
  • Renal打错了
  • +
  • PCA: 这张图的PC1和PC2一共解释了37.8%的variance,就是说还有62.2%的样本间差异是PC1和PC2没反映的,在PC1和PC2的坐标轴上聚集不足以证明同组样本间结果差异小
  • +
  • PLKO的点明显更散,看看PLKO1和2是不是有什么问题
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Using our method, concentration and mass isotopomer distribution of sugar metabolites in kidney cancer cells can be quantified to study the abnormal sugar metabolism in cancer.
  • +
+
+::: + +
+ +--- + +## Page 23 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Isotope abundance detection of 15N in PST, a toxic substance, could help us trace the accumulation and transformation of PSTs in oysters.
  • +
+
+::: + +
+ +--- + +## Page 24 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • In another study, isotope abundance detection was used in the development of an artificial nitrogen fixation method.
  • +
+
+::: + +
+ +--- + +## Page 25 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Complex proteins in milk could be quantified using labeled peptides as internal standards.
  • +
+
+::: + +
+ +--- + +## Page 26 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Relying on ... -> Backed by ...
  • +
+
+::: + +
+ +--- + +## Page 27 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • analytical -> analysis
  • +
+
+::: + +
+ +--- + +## Page 28 + + + +::: block +**PPT** +
+
    +
  • Reagents products -> Reagents
  • +
  • Biomedical Research
  • +
  • Food & Environmental Analysis
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • We also have more than 1000 reagents including 13C labeled steroid hormone and amino acids, which are commonly used in metabolomics and proteomics. +
+
+::: + +
+ +--- + +## Page 29 + + + +::: block +**PPT** +
+
    +
  • 😀
  • +
+
+::: + +::: block +**讲稿** +
+
    +
  • Lastly, I would like to extend an invitation to all of you to attend the 3rd Symposium on Isotope Technology that will be held in Shanghai this November.
  • +
+
+::: + +
+ +--- \ No newline at end of file diff --git a/Machine Learning/Concepts/Bagging Ridge Regression.md b/Machine Learning/Concepts/Bagging Ridge Regression.md new file mode 100644 index 0000000..7cbbf8b --- /dev/null +++ b/Machine Learning/Concepts/Bagging Ridge Regression.md @@ -0,0 +1,40 @@ +# Bagging Ridge Regression + +Bagging Ridge Regression is an ensemble learning method that combines the predictions of multiple ridge regression models to improve the overall performance. "Bagging" stands for Bootstrap Aggregating, and it aims to reduce variance and prevent [Overfitting](Overfitting.md). + +Bagging Ridge Regression Enhances the performance of [ridge regression](L2%20Regularization.md) by leveraging the power of ensemble methods, specifically by reducing variance and improving model stability. + +## Key Concepts: + +1. **Bootstrap Sampling:** + - Multiple subsets of the training data are created by randomly sampling with replacement. Each subset is called a bootstrap sample. + +2. **Ridge Regression Models:** + - A separate ridge regression model is trained on each bootstrap sample. This results in multiple models that capture different aspects of the data. + +3. **Aggregation:** + - The predictions from all the ridge regression models are aggregated (usually by averaging) to produce the final prediction. This aggregation helps to reduce the variance of the predictions and improves robustness. + +## Benefits: + +- **Variance Reduction:** By averaging the predictions of multiple models, bagging reduces the variance and helps in creating a more stable and robust model. +- **Improved Performance:** Bagging often improves predictive performance, especially in cases where individual models have high variance. + +#### Implementation: + +In Python's scikit-learn library, Bagging Ridge Regression can be implemented using the `BaggingRegressor` with `Ridge` as the base estimator: + +```python +from sklearn.ensemble import BaggingRegressor +from sklearn.linear_model import Ridge + +# Create a Ridge regression model +ridge_model = Ridge(alpha=1.0) + +# Create a Bagging regressor with Ridge as the base estimator +bagging_model = BaggingRegressor(base_estimator=ridge_model, n_estimators=10, random_state=0) +bagging_model.fit(X_train, y_train) + +# Predict using the bagging model +y_pred = bagging_model.predict(X_test) +``` diff --git a/Machine Learning/Concepts/Bayesian Ridge Regression.md b/Machine Learning/Concepts/Bayesian Ridge Regression.md new file mode 100644 index 0000000..9d09e85 --- /dev/null +++ b/Machine Learning/Concepts/Bayesian Ridge Regression.md @@ -0,0 +1,140 @@ +# Bayesian Ridge Regression + +Bayesian Ridge Regression is a probabilistic approach to linear regression. Unlike [standard ridge regression](L2%20Regularization.md), which penalizes large weights with a fixed regularization parameter, Bayesian Ridge Regression introduces a probabilistic framework that allows for uncertainty in the model parameters. + +See also [The Bayesian View](MLE%20vs.%20Bayesian%20Parameter%20Estimation.md#The%20Bayesian%20View) +## Key Concepts: + +1. **Probabilistic Model:** + - Bayesian Ridge Regression models the weights as random variables with a prior distribution. Typically, a _Gaussian prior_ is used for the weights, which introduces regularization. + - The model also includes a likelihood function that describes the probability of the observed data given the weights. + +2. **Posterior Distribution:** + - By combining the prior distribution and the likelihood function using Bayes' theorem, the posterior distribution of the weights is obtained. This posterior distribution captures the uncertainty about the weights given the observed data. + +3. **Hyperparameters:** + - Bayesian Ridge Regression includes hyperparameters for the precision (inverse of the variance) of the weights and the noise in the data. These hyperparameters can be estimated from the data using techniques such as empirical Bayes or Markov Chain Monte Carlo (MCMC). + +4. **Regularization:** + - The regularization effect in Bayesian Ridge Regression arises naturally from the prior distribution on the weights. This regularization helps prevent [Overfitting](Overfitting.md) and improves generalization. + +## Prior, Likelihood function, posterior + +In Bayesian Ridge Regression, using a Gaussian (normal) prior for the regression coefficients introduces regularization by incorporating _prior knowledge_ about the likely values of these coefficients. This approach effectively penalizes large coefficients, similar to how L2 regularization works in traditional ridge regression. + +### Key Concepts + +#### 1. **Gaussian Prior:** + - The prior distribution for the regression coefficients $\mathbf{w}$ is assumed to be a Gaussian distribution centered at zero with a certain variance. + - Mathematically, this can be expressed as: + $p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})$ + where $\tau^2$ is the variance (precision $\lambda = \frac{1}{\tau^2}$) and $\mathbf{I}$ is the identity matrix. + +#### 2. **Likelihood Function:** + - The likelihood of the observed data given the coefficients is also modeled as a Gaussian distribution: + $p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \alpha) = \mathcal{N}(\mathbf{y} \mid \mathbf{X}\mathbf{w}, \alpha^{-1} \mathbf{I})$ + where $\alpha$ is the precision (inverse of the variance) of the **noise**. + +#### 3. **Posterior Distribution:** + - The posterior distribution of the coefficients $\mathbf{w}$ given the data is obtained using Bayes' theorem: + $p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}, \alpha, \tau) \propto p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}, \alpha) \cdot p(\mathbf{w} \mid \tau)$ + - This posterior combines the likelihood of the data and the prior, balancing the fit to the data and the prior belief about the coefficients. + +### Regularization Effect + +The Gaussian prior acts as a form of regularization in the following ways: + +1. **Penalizing Large Coefficients:** + - The Gaussian prior $\mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})$ has its peak at zero, meaning it assigns higher probabilities to smaller values of $\mathbf{w}$ and lower probabilities to larger values. This preference for smaller coefficients acts similarly to an L2 penalty in ridge regression. + - In the Bayesian framework, this prior encourages the posterior distribution of the coefficients to be centered around zero unless the data strongly suggests otherwise. + +2. **Regularization Term:** + - When deriving the posterior, the negative log of the prior distribution $-\log p(\mathbf{w} \mid \tau)$introduces an additional term to the loss function, akin to the regularization term in ridge regression: + $-\log p(\mathbf{w} \mid \tau) = \frac{1}{2\tau^2} \|\mathbf{w}\|^2 + \text{constant}$ + - This term directly penalizes the magnitude of the coefficients $\mathbf{w}$. + +3. **Balancing Data and Prior:** + - The posterior distribution represents a balance between the likelihood (fit to the data) and the prior (regularization). If the data strongly supports certain coefficient values, the posterior will reflect that. Otherwise, the prior will dominate, effectively shrinking the coefficients toward zero. + +### Mathematical Derivation + +To see this more formally, consider the Bayesian Ridge Regression loss function, which combines the likelihood of the data and the Gaussian prior: + +$\mathcal{L}(\mathbf{w}) = \frac{\alpha}{2} \sum_{i=1}^n (y_i - \mathbf{w} \cdot \mathbf{x}_i)^2 + \frac{1}{2\tau^2} \sum_{j=1}^p w_j^2$ + +Here: +- The first term is the data fit term (scaled by precision of the noise $\alpha$). +- The second term is the regularization term (scaled by precision $1/\tau^2$). + +Minimizing this loss function results in estimates of $\mathbf{w}$that balance fitting the data well while keeping the coefficients small, effectively achieving regularization. + +## Relationship to L2 Regularization + +Bayesian Ridge Regression is related to L2 regularization, but the terminology and conceptual framework are different. While L2 regularization explicitly uses the L2 norm in its penalty term, Bayesian Ridge Regression incorporates regularization through a probabilistic approach using Gaussian priors. +### Bayesian Ridge Regression + +Bayesian Ridge Regression uses a Bayesian framework to introduce regularization. In this approach, the coefficients $\mathbf{w}$ are treated as random variables with a Gaussian prior: + +$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})$ + +This prior distribution assumes that the coefficients are normally distributed with mean zero and variance $\tau^2$, which introduces a preference for smaller coefficients. The regularization effect comes from the combination of the prior and the likelihood of the data. + +### Relationship and Terminology + +- **Implicit L2 Regularization:** Although Bayesian Ridge Regression does not explicitly add an L2 norm penalty term to the loss function, the Gaussian prior $\mathcal{N}(\mathbf{w} \mid \mathbf{0}, \tau^2 \mathbf{I})$ effectively imposes a similar constraint. The quadratic form of the Gaussian prior leads to a penalty on the sum of the squares of the coefficients, akin to the L2 norm. +- **Different Terminology:** In Bayesian Ridge Regression, the regularization is described in terms of prior distributions and posterior inference rather than as a direct penalty term. Thus, while it achieves similar regularization effects, it is not typically referred to as "L2 regularization." Instead, it is described in terms of Bayesian priors and posterior distributions. + +## Implementation: + +In Python's scikit-learn library, Bayesian Ridge Regression can be implemented as follows: + +```python +from sklearn.linear_model import BayesianRidge + +# Create a Bayesian Ridge regression model +model = BayesianRidge() +model.fit(X_train, y_train) + +# Predict using the model +y_pred = model.predict(X_test) +``` + +## Non-informative Prior + +In Bayesian Ridge Regression, a non-informative prior distribution refers to a prior distribution that exerts minimal influence on the posterior distribution, effectively allowing the data to dominate the inference process. This is particularly useful when there is little prior knowledge about the parameters. Here’s a detailed explanation: + +### Non-Informative Priors + +1. **Purpose**: + Non-informative priors are used when we want to avoid biasing the results with prior assumptions and instead rely primarily on the data to determine the posterior distribution. + +2. **Forms of Non-Informative Priors**: + - **Uniform Prior**: One common type of non-informative prior is a uniform prior, which assigns equal probability to all values within a certain range. For instance, a uniform prior over the coefficients would imply that any value of the coefficient is equally likely before observing the data. + - **Jeffreys Prior**: Another type of non-informative prior is the Jeffreys prior, which is invariant under reparameterization and often used for scale parameters. For a parameter $\theta$, the Jeffreys prior is proportional to the square root of the Fisher information. + +### Bayesian Ridge Regression with Non-Informative Priors + +In Bayesian Ridge Regression, we model the linear relationship between the predictors $X$ and the response $y$ with a Gaussian likelihood, and we place priors on the regression coefficients $\beta$ and the noise variance $\sigma^2$. + +1. **Model Specification**: + $y \sim \mathcal{N}(X\beta, \sigma^2I)$ + $\beta \sim \mathcal{N}(0, \tau^2I)$ + $\sigma^2 \sim \text{Inverse-Gamma}(a, b)$ + +2. **Non-Informative Priors on $\beta$**: + - A common choice for a non-informative prior on $\beta$ is a Gaussian prior with a very large variance, effectively making it close to a uniform distribution over a wide range. + - $\beta \sim \mathcal{N}(0, \tau^2I)$ + Here, $\tau^2$ is set to a very large value, making the prior almost flat. + +3. **Non-Informative Priors on $\sigma^2$**: + - For the noise variance $\sigma^2$, a common non-informative prior is an Inverse-Gamma distribution with very small shape and scale parameters (close to zero), which spreads the probability over a wide range of values. + - $\sigma^2 \sim \text{Inverse-Gamma}(\epsilon, \epsilon)$, with $\epsilon \rightarrow 0$. + +### Implications of Non-Informative Priors + +- **Data Dominance**: With non-informative priors, the posterior distribution is primarily influenced by the likelihood derived from the data, meaning that the resulting inference and parameter estimates are largely driven by the observed data rather than any strong prior beliefs. +- **Parameter Uncertainty**: Non-informative priors reflect high uncertainty about the parameter values before observing the data, making them a cautious choice when prior information is scarce or unreliable. + +### Conclusion + +Non-informative prior distributions in Bayesian Ridge Regression are designed to have minimal influence on the posterior, allowing the data to play the dominant role in determining the model parameters. This approach is particularly useful when there is little or no prior knowledge about the parameters, ensuring that the inference is primarily data-driven. \ No newline at end of file diff --git a/Machine Learning/Concepts/Cluster Label Transfer.md b/Machine Learning/Concepts/Cluster Label Transfer.md new file mode 100644 index 0000000..7d11871 --- /dev/null +++ b/Machine Learning/Concepts/Cluster Label Transfer.md @@ -0,0 +1,125 @@ +# Cluster Label Transfer + +**Cluster-label transfer** is a computational technique used in [scRNA-seq](scRNA-seq.md) and [scATAC-seq](scATAC-seq.md) studies to map and compare cell clusters identified in one dataset to another dataset. + +It involves transferring the labels (or identities) of clusters from a reference dataset (where cell types or states have already been annotated) to a target dataset (where these annotations are not yet known). + +## Purpose of Cluster-Label Transfer + +1. **Identification of Similar Cell Populations:** + - Cluster-label transfer helps in identifying and annotating similar cell populations across different datasets, conditions, or experiments. It allows researchers to recognize if a particular cell state or cluster observed in one dataset is present in another. + +2. **Consistency and Validation:** + - By comparing clusters between datasets, researchers can validate the presence of specific cell states and ensure that findings are consistent across different experimental conditions or biological replicates. + +3. **Characterization of Cell States:** + - This technique helps in characterizing new or unknown cell states by leveraging information from well-annotated reference datasets. It provides insights into the functions and characteristics of cell clusters in the target dataset. + +## Steps + +1. **Pre-processing:** + - **Data Normalization:** Both the reference and target single-cell datasets are normalized to ensure comparability. + - **Feature Selection:** Highly variable genes or features are selected for downstream analysis to capture the most informative aspects of the data. + +2. **Dimensionality Reduction:** + - Techniques such as Principal Component Analysis (PCA) are applied to reduce the dimensionality of the data while retaining the most important features. + - This step is performed on both the reference and target datasets to facilitate alignment. + +3. **Integration and Alignment:** + - **Shared Space Mapping:** Methods like Canonical Correlation Analysis (CCA) or mutual nearest neighbors (MNN) are used to align the reference and target datasets in a shared low-dimensional space. + - **Batch Effect Correction:** Techniques such as Harmony or Seurat’s integration are employed to correct for batch effects and technical differences between datasets. + +4. **Cluster Identification in Reference Dataset:** + - Clustering algorithms (e.g., Louvain or Leiden) are used to identify distinct cell clusters in the reference dataset. + - These clusters are annotated based on known cell types, states, or conditions. + +5. **Label Transfer:** + - **Nearest Neighbor Search:** For each cell in the target dataset, the nearest neighbors in the reference dataset are identified based on their positions in the shared low-dimensional space. + - **Label Assignment:** The cluster labels from the nearest neighbors in the reference dataset are transferred to the corresponding cells in the target dataset. Majority voting or probabilistic methods are used to assign the most likely label to each cell. + +6. **Validation and Fine-Tuning:** + - The transferred labels are validated by comparing the expression patterns of known marker genes in both the reference and target datasets. + - Fine-tuning may involve re-clustering the target dataset with the transferred labels as initial conditions to refine the cluster assignments. + +## Tools and Algorithms + +Several computational tools and algorithms are commonly used for cluster label transfer: + +1. **Seurat:** + - The Seurat package in R includes functions for data normalization, dimensionality reduction, integration, and label transfer. The `FindTransferAnchors` and `TransferData` functions facilitate label transfer between datasets. + +2. **Scanpy:** + - The Scanpy package in Python also supports integration and label transfer. Functions like `scanpy.pp.neighbors` and `scanpy.tl.ingest` are used for these purposes. + +3. **Liger:** + - Liger (Linked Inference of Genomic Experimental Relationships) is another tool that integrates single-cell datasets and supports label transfer. + +## Why the Authors Used Cluster Label Transfer + +In the context of the [study](Single-cell%20ATAC%20and%20RNA%20sequencing%20reveal%20pre-existing%20and%20persistent%20cells%20associated%20with%20prostate%20cancer%20relapse.md) on enzalutamide (ENZ) resistance in prostate cancer: + +1. **Identification of Consistent Cell States:** + - The authors aimed to identify and validate specific cell clusters associated with ENZ resistance across different datasets. By transferring cluster labels from a well-annotated reference dataset to other datasets, they ensured that the identified cell states were consistently observed. + +2. **Validation Across Models:** + - Using cluster label transfer allowed the authors to validate that the ENZ-resistant cell states observed in one model (e.g., LNCaP-derived RES-A and RES-B cells) were also present in other independent models and conditions (e.g., LNCaP-ENZ168 and RES-C cells). + +3. **Robustness of Findings:** + - By confirming the presence of resistant cell clusters in multiple datasets, the authors strengthened the robustness and generalizability of their findings. This approach provided confidence that the observed mechanisms of resistance were not artifacts of a single experimental setup. + +### Methods used + + - **Integrated Clustering:** The authors first performed integrated clustering of scRNA-seq data from different samples (e.g., LNCaP parental, ENZ-resistant models RES-A, and RES-B) to identify distinct cell clusters. + - **Label Transfer:** They then transferred these cluster labels to independent scRNA-seq datasets, such as LNCaP ENZ-treated for one week (LNCaP-ENZ168) and an independent ENZ-resistant LNCaP-derived cell line (RES-C). + +### Findings + + - The label-transfer process confirmed the presence of initial, persistent, and ENZ-induced clusters in independent datasets. This demonstrated that specific cell states associated with ENZ resistance were consistently observed across different models, reinforcing the robustness of their findings. + +### Implications + + - **Validation of Resistant States:** The consistent identification of resistant cell states across different datasets validated the relevance of these states in ENZ resistance. + - **Generalizability:** The findings suggested that the mechanisms driving ENZ resistance were not restricted to a single cell line or experimental condition but were generalizable to other prostate cancer models. + - **Therapeutic Insights:** Understanding these consistent cell states can provide insights into potential therapeutic targets and strategies to overcome resistance in prostate cancer. + +## Why didn't they do it for sc-ATAC data + +Cluster label transfer can be more **challenging** for single-cell ATAC-seq (scATAC-seq) data compared to single-cell RNA-seq (scRNA-seq) data due to several reasons: + +1. **Data Characteristics:** + - **Sparse Data:** scATAC-seq data is inherently [Sparse](Sparsity.md) because only a small fraction of the genome is accessible and captured in each cell. This sparsity makes it difficult to identify robust clusters and transfer labels reliably. + - **Different Signal Types:** scATAC-seq measures chromatin accessibility, which is a different type of signal compared to gene expression measured by scRNA-seq. The regions of open chromatin are fewer and more variable, complicating direct label transfer. + +2. **Lack of Established Reference Maps:** + - **Reference Datasets:** There are fewer well-annotated reference datasets for scATAC-seq compared to scRNA-seq. Without robust reference maps, it becomes challenging to perform reliable label transfer. + - **Complexity of Chromatin States:** The chromatin landscape is highly dynamic and context-dependent, which means that chromatin accessibility profiles can vary widely between different cell types and conditions. + +3. **Computational Tools and Methods:** + - **Tool Availability:** While there are many tools and methods available for integrating and transferring labels in scRNA-seq data, the computational methods for scATAC-seq are still developing. Tools like Seurat and Scanpy are primarily designed for scRNA-seq data. + - **Integration Challenges:** Integrating scATAC-seq data with scRNA-seq data or other scATAC-seq datasets requires sophisticated methods to align different data modalities, which can be computationally intensive and complex. + +### Alternative Approaches Used + +Instead of cluster label transfer, the authors might have used other methods to analyze and validate their scATAC-seq data: + +1. **Independent Clustering:** + - The authors likely performed independent clustering of the scATAC-seq data to identify distinct chromatin accessibility profiles within the dataset. + - This approach allows them to characterize the unique features of chromatin accessibility without relying on external labels. + +2. **Integration with scRNA-seq Data:** + - The authors might have integrated scATAC-seq data with scRNA-seq data to correlate chromatin accessibility with gene expression profiles. Techniques like joint embedding or linked analysis can help map the relationships between chromatin states and transcriptional programs. + - This integration can provide insights into how changes in chromatin accessibility influence gene expression and contribute to drug resistance. + +3. **Functional Validation:** + - The authors could have validated their findings through functional assays, such as identifying differentially accessible regions (DARs) and linking them to gene regulatory elements. + - By performing functional validation, they can ensure that the observed chromatin accessibility changes have biological significance and are relevant to the study's goals. + +### Specific Context in the Study + +1. **Focus on scRNA-seq for Label Transfer:** + - In the study on enzalutamide resistance, the authors might have focused on scRNA-seq data for cluster label transfer because gene expression profiles are more directly interpretable and easier to compare across different conditions. + - The scRNA-seq data provides a more straightforward way to identify and validate cell states associated with drug resistance. + +2. **Characterizing Chromatin Reprogramming:** + - For the scATAC-seq data, the primary goal might have been to characterize the chromatin reprogramming events underlying resistance. This involves identifying regions of differential accessibility and linking them to transcription factor binding sites and regulatory elements. + - The authors might have used scATAC-seq data to provide a complementary layer of information that supports the findings from scRNA-seq analysis. diff --git a/Machine Learning/Concepts/Confusion Matrix.md b/Machine Learning/Concepts/Confusion Matrix.md new file mode 100644 index 0000000..0f55759 --- /dev/null +++ b/Machine Learning/Concepts/Confusion Matrix.md @@ -0,0 +1,34 @@ +# Confusion Matrix + +## Confusion Matrix + +| | Real Positive | Real Negative | +| ------------------ | ------------------- | ------------------- | +| Predicted Positive | True Positive (TP) | False Positive (FP) | +| Predicted Negative | False Negative (FN) | True Negative (TN) | + +## True Positive Rate (TPR) / Sensitivity / Recall + +- Measures the proportion of actual positives correctly identified. +- $\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}$ +- **Interpretation**: Measures the ability to correctly label actual positives as positives, indicating the model's ability to detect actual positives. + +## False Positive Rate (FPR) + +- Measures the proportion of actual negatives incorrectly identified as positives. +- $\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$ +- **Interpretation**: Measures the rate at which false alarms occur. A lower FPR indicates that fewer incorrect positive predictions are made. + +## True Negative Rate (TNR) / Specificity + +- Measures the proportion of actual negatives correctly identified. +- $\text{TNR} = \frac{\text{TN}}{\text{TN} + \text{FP}}$ +- **Interpretation**: Measures the ability of the test to correctly identify negatives. Higher TNR (Specificity) means fewer false positives. + +## Positive Predictive Value (PPV) / Precision + +- Measures the proportion of positive predictions that are true positives. +- $\text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}}$ +- **Interpretation**: Measures the proportion of positive predictions that are actually correct. Higher PPV means that when the test predicts a positive, it is more likely to be correct. + +These metrics provide a comprehensive understanding of the performance of a binary classifier by evaluating both its ability to correctly identify positive cases and its ability to correctly reject negative cases. diff --git a/Machine Learning/Concepts/Gamma Distribution.md b/Machine Learning/Concepts/Gamma Distribution.md new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/Concepts/Gamma Function.md b/Machine Learning/Concepts/Gamma Function.md new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/Concepts/Gaussian Smoothing.md b/Machine Learning/Concepts/Gaussian Smoothing.md new file mode 100644 index 0000000..3f92344 --- /dev/null +++ b/Machine Learning/Concepts/Gaussian Smoothing.md @@ -0,0 +1,63 @@ +# Gaussian Kernel Smoothing + +A method for image processing. Also used for other tasks. + +Gaussian kernel smoothing, also known as Gaussian smoothing or Gaussian blurring, is a technique used to smooth data by _averaging nearby points with weights given by a Gaussian (normal) distribution_. This method is commonly used in image processing, signal processing, and various statistical applications to reduce noise and make patterns in the data more apparent. + +### How Gaussian Kernel Smoothing Works + +1. **Gaussian Kernel**: + - A Gaussian kernel is defined by the Gaussian function: + $$ + G(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{x^2}{2\sigma^2}} + $$ + where $x$ is the distance from the center of the kernel, and $\sigma$ (sigma) is the standard deviation of the Gaussian distribution. The standard deviation controls the width of the Gaussian kernel. + +2. **Convolution**: + - Gaussian smoothing is performed by convolving the input data with the Gaussian kernel. Convolution involves sliding the kernel over the data and computing the weighted average of the data points within the kernel's window. + - For a one-dimensional signal $f(x)$, the smoothed value at point $x$ is given by: + $$ + (f * G)(x) = \int_{-\infty}^{\infty} f(t) G(x - t) \, dt + $$ + - For discrete data (e.g., pixels in an image or samples in a time series), the convolution is typically performed using a finite kernel and a summation: + $$ + (f * G)[i] = \sum_{j=-k}^{k} f[i-j] G[j] + $$ + where $k$ is the half-width of the kernel (usually chosen based on $\sigma$). + +### Applications of Gaussian Kernel Smoothing + +1. **Image Processing**: + - **Blurring**: Gaussian smoothing is often used to blur images, which helps to reduce noise and detail. This is useful in applications like edge detection, where noise reduction can improve the accuracy of edge detection algorithms. + - **Preprocessing**: Before applying more complex image processing algorithms, Gaussian smoothing can be used as a preprocessing step to remove high-frequency noise. + +2. **Signal Processing**: + - **Noise Reduction**: In time series data or signals, Gaussian smoothing helps to reduce noise and make underlying patterns more visible. + - **Feature Extraction**: Smoothing can be used to extract relevant features from noisy signals, improving the performance of subsequent analysis steps. + +3. **Statistics and Data Analysis**: + - **Density Estimation**: Gaussian kernel smoothing is used in kernel density estimation (KDE) to estimate the probability density function of a random variable. + - **Trend Analysis**: In time series analysis, Gaussian smoothing can help identify underlying trends by filtering out short-term fluctuations. + +### Example + +Suppose you have a one-dimensional signal (e.g., a time series) represented by the vector $\mathbf{f}$ and you want to smooth it using a Gaussian kernel with a standard deviation $\sigma$. + +1. **Create the Gaussian Kernel**: + - Choose a suitable kernel size (typically, a few standard deviations wide, e.g., 3 times $\sigma$). + - Compute the Gaussian weights: + $$ + G[i] = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{i^2}{2\sigma^2}} + $$ + where $i$ ranges from \(-k\) to $k$, and $k$ is the half-width of the kernel. + +2. **Convolve the Signal with the Kernel**: + - For each point in the signal, compute the weighted average using the Gaussian weights: + $$ + \text{smoothed}[i] = \sum_{j=-k}^{k} \mathbf{f}[i-j] G[j] + $$ + - Handle the boundaries by either extending the signal (e.g., mirroring, padding) or using a smaller kernel near the edges. + +### Conclusion + +Gaussian kernel smoothing is a powerful technique for reducing noise and highlighting patterns in data. By applying a Gaussian-weighted average, this method effectively smooths data while preserving important features, making it widely used in image and signal processing, as well as statistical analysis. \ No newline at end of file diff --git a/Machine Learning/Concepts/K-nearest Neighbor Imputation.md b/Machine Learning/Concepts/K-nearest Neighbor Imputation.md new file mode 100644 index 0000000..63bdc5a --- /dev/null +++ b/Machine Learning/Concepts/K-nearest Neighbor Imputation.md @@ -0,0 +1,58 @@ +# KNN Imputation + +K-Nearest Neighbor (KNN) imputation is a technique used to fill in missing values in a dataset by leveraging the similarity between instances. The basic idea is to find the 'k' closest instances (neighbors) to the instance with the missing value and use these neighbors to estimate the missing value. Here’s a step-by-step explanation of how KNN imputation works: + +1. **Identify Missing Values**: Determine which data points in the dataset have missing values. + +2. **Choose a Distance Metric**: Select a distance metric to measure the similarity between instances. Common choices include Euclidean distance, Manhattan distance, and cosine similarity. The choice of metric can influence the imputation results. + +3. **Find K Nearest Neighbors**: For each instance with a missing value, find the 'k' nearest neighbors that have complete data (i.e., no missing values for the relevant attributes). + +4. **Compute Imputation Value**: + - **Numerical Data**: For numerical attributes, the missing value is often imputed by taking the mean, median, or a weighted average of the corresponding attribute values from the 'k' nearest neighbors. The weighted average might use the inverse of the distance as weights. + - **Categorical Data**: For categorical attributes, the missing value can be imputed by taking the mode (most frequent value) of the corresponding attribute values from the 'k' nearest neighbors. + +5. **Replace Missing Values**: Substitute the missing value with the imputed value derived from the 'k' nearest neighbors. + +### Example + +Suppose we have a dataset with three features \( F1, F2, \) and \( F3 \), and we want to impute the missing value in \( F3 \) for a particular instance: + +| F1 | F2 | F3 | +|-----|-----|-----| +| 5.1 | 3.5 | 1.4 | +| 4.9 | 3.0 | 1.4 | +| 4.7 | 3.2 | 1.3 | +| 4.6 | 3.1 | NaN | +| 5.0 | 3.6 | 1.4 | + +1. **Identify Missing Value**: The missing value in the fourth instance for \( F3 \). + +2. **Choose Distance Metric**: Use Euclidean distance. We only use \( F1 \) and \( F2 \) for calculating distance since \( F3 \) has the missing value. + +3. **Find K Nearest Neighbors**: Suppose \( k=3 \). Compute distances from the fourth instance to all other instances: + - Distance to (5.1, 3.5): \(\sqrt{(5.1-4.6)^2 + (3.5-3.1)^2} = 0.64\) + - Distance to (4.9, 3.0): \(\sqrt{(4.9-4.6)^2 + (3.0-3.1)^2} = 0.31\) + - Distance to (4.7, 3.2): \(\sqrt{(4.7-4.6)^2 + (3.2-3.1)^2} = 0.14\) + - Distance to (5.0, 3.6): \(\sqrt{(5.0-4.6)^2 + (3.6-3.1)^2} = 0.64\) + +4. **Select the Nearest Neighbors**: The three nearest neighbors are (4.7, 3.2), (4.9, 3.0), and either (5.1, 3.5) or (5.0, 3.6). + +5. **Compute Imputation Value**: Take the mean of \( F3 \) values for these nearest neighbors: + - Neighbors' \( F3 \) values: 1.3, 1.4, 1.4 + - Mean: \( (1.3 + 1.4 + 1.4)/3 = 1.37 \) + +6. **Replace Missing Value**: Impute \( F3 \) in the fourth instance with 1.37. + +### Advantages and Disadvantages + +**Advantages:** +- Simple to understand and implement. +- Works well when data has a strong local structure. + +**Disadvantages:** +- Computationally intensive for large datasets. +- Requires a careful choice of 'k' and distance metric. +- Assumes that the distance metric used is appropriate for the data's distribution and relationships. + +KNN imputation is a powerful method when used correctly, especially in cases where the local patterns in the data are informative. \ No newline at end of file diff --git a/Machine Learning/Concepts/Kronecker Delta.md b/Machine Learning/Concepts/Kronecker Delta.md new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/Concepts/L1 Regularization.md b/Machine Learning/Concepts/L1 Regularization.md new file mode 100644 index 0000000..c7dcd7f --- /dev/null +++ b/Machine Learning/Concepts/L1 Regularization.md @@ -0,0 +1,9 @@ +# L1 Regression + +Note that the [L2](L2%20Regularization.md) penalty shrinks coefficients towards zero but never to absolute zero; although model feature weights may become negligibly small, they never equal zero in ridge regression. Reducing a coefficient to zero effectively removes the paired predictor from the model. This is called feature selection, which is another means of correcting multicollinearity.[8](https://www.ibm.com/topics/ridge-regression#f8) Because ridge regression does not reduce regression coefficients to zero, it does not perform feature selection.[9](https://www.ibm.com/topics/ridge-regression#f9) This is often cited as a disadvantage of ridge regression. Moreover, another oft-cited disadvantage is ridge regression’s inability to separate predictor effects in the face of severe multicollinearity.[10](https://www.ibm.com/topics/ridge-regression#f10) + +Lasso regression—also called L1 regularization—is one of several other regularization methods in linear regression. L1 regularization works by reducing coefficients to zero, essentially eliminating those independent variables from the model. Both lasso regression and ridge regression thus reduce model complexity, albeit by different means. Lasso regression reduces the number of independent variables affecting the output. Ridge regression reduces the weight each independent variable has on the output. + +Elastic net is an additional form of regularization. Whereas ridge regression obtains its regularization parameter from the sum of squared errors and lasso obtains its own from the sum of the absolute value of errors, Elastic net incorporates both regularization parameters into the [RSS](Residual%20Sum%20of%20Squares.md) cost function.[11](https://www.ibm.com/topics/ridge-regression#f11) + +Principal componenet regression (PCR) can also act as a regularizing procedure. While PCR can resolve multicollinearity, it does not do so by enforcing a penalty on the RSS function as in ridge and lasso regression. Rather PCR produces linear combinations of correlated predictors from which to create a new least squares model. \ No newline at end of file diff --git a/Machine Learning/Concepts/L2 Regularization.md b/Machine Learning/Concepts/L2 Regularization.md new file mode 100644 index 0000000..f7bcc2d --- /dev/null +++ b/Machine Learning/Concepts/L2 Regularization.md @@ -0,0 +1,266 @@ +# L2 Regularization + +[IBM post](https://www.ibm.com/topics/ridge-regression) + +L2 regularization, or ridge regularization, Tikhonov regularization, or ridge regression, is a ML technique to prevent [Overfitting](Overfitting.md) by adding a penalty term to the loss function. The term is proportional to the sum of the squared values of the model's weights, thus penalizes **large weights**. + +Ridge regression is also applied to multiple regression data that suffer from multicollinearity. When independent variables are highly correlated, the least squares estimates become _unbiased_, but their variances are _large_, which means that they may be far from the true value. Ridge regression addresses this issue by imposing a penalty on the size of the coefficients. + +## General Form + +The ridge regression estimate is defined as the minimization of the following objective function: + +$\min_{\beta} \left( \| y - X\beta \|^2_2 + \lambda \| \beta \|^2_2 \right)$ + +Where: +- $y$ is the response variable. +- $X$ is the matrix of predictors. +- $\beta$ is the vector of coefficients. +- $\lambda$ is the regularization parameter. +- $\| y - X\beta \|^2_2$ is the residual sum of squares. +- $\| \beta \|^2_2$ is the sum of squares of the coefficients (also known as the L2 norm). + +The key idea is to add a penalty term, $\lambda \| \beta \|^2_2$, to the ordinary least squares (OLS) objective. The parameter $\lambda$ controls the strength of the penalty. When $\lambda = 0$, ridge regression is equivalent to OLS. As $\lambda$ increases, the impact of the penalty term grows, leading to smaller coefficient estimates. + +## Linear Model + +In the context of a linear model, the regularized loss function with L2 regularization is defined as: + +$$L(\mathbf{w}) = L_0(\mathbf{w}) + \lambda \sum_{j=1}^n w_j^2$$ + +where: + +- $L(\mathbf{w})$ is the regularized loss function. +- $L_0(\mathbf{w})$ is the original loss function (e.g., mean squared error for regression). +- $\lambda$ is the regularization parameter that controls the strength of the penalty. +- $w_j$​ are the weights of the model. +- $n$ is the number of weights. + +## Effects of L2 Regularization + +### Advantages + +1. **Shrinkage of Weights:** The penalty term $\lambda \sum_{j=1}^n w_j^2$​ forces the optimization process to keep the weights small, thus shrinking their values. This helps to prevent the model from fitting the noise in the training data. +2. **Smoothness:** By penalizing large weights, L2 regularization encourages smoother and more stable models, which are less sensitive to small changes in the input data. +3. **Bias-Variance Tradeoff:** L2 regularization introduces a tradeoff between bias and variance. While it may increase the bias slightly (leading to a simpler model), it significantly reduces the variance, thus improving the model's generalization performance. + 1. $\lambda$ controls the strength of regularization. The higher it is, the more bias introduced into the model, and the less variance there is. + 2. Mean square error (MSE) can help determine a suitable $\lambda$ value. MSE is closely related to residual sum of squares (RSS) and is a means of measuring the difference, on average, between predicted and true values. As $\lambda$ increases, MSE increases. Nevertheless, it is argued that there always exists a value of λ greater than zero such that MSE obtained through ridge regression is smaller than that obtained through OLS. One method for deducing a suitable λ value is to find the highest value for λ that does not increase MSE. Additional [cross-validation](https://www.ibm.com/docs/en/spss-modeler/18.0.0?topic=settings-cross-validation) techniques can help users select optimal λ values for tuning their model. +4. **Handles Multicollinearity**: Ridge regression can handle multicollinearity by imposing a penalty on the size of the coefficients, which helps to stabilize the estimation process. +5. **Informative Variable Selection**: Although L2 regularization does not perform variable selection in the strictest sense (like [L1 Regularization](L1%20Regularization.md), which can set some coefficients exactly to zero), it does help in highlighting informative variables. Variables that consistently have **non-zero**, **relatively large coefficients** after regularization are likely to be informative, as the regularization would have shrunk less important variables more significantly. + +### Disadvantages + +1. **Interpretability**: The coefficients are shrunk towards zero, making them harder to interpret. +2. **Choice of $\lambda$**: Selecting an appropriate value for $\lambda$ is crucial and can be done using techniques like cross-validation. + +## Practical Implementation + +In many machine learning libraries, L2 regularization can be easily applied by specifying the regularization parameter λ\lambdaλ. For example, in scikit-learn, L2 regularization is applied by default in models like `Ridge` regression and logistic regression. + + +``` python +from sklearn.linear_model import Ridge + +# Create a Ridge regression model with L2 regularization +model = Ridge(alpha=1.0) # alpha is the regularization parameter (lambda) +model.fit(X_train, y_train) +``` + +In neural networks, L2 regularization can be applied to the weights during the optimization process. In Keras, for example: + +``` python +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.regularizers import l2 + +model = Sequential([ + Dense(64, input_dim=input_dim, activation='relu', kernel_regularizer=l2(0.01)), + Dense(1, activation='linear', kernel_regularizer=l2(0.01)) + ]) +model.compile(optimizer='adam', loss='mse') +model.fit(X_train, y_train, epochs=100, batch_size=32)` +``` + +By adding L2 regularization, the model is less likely to overfit and should perform better on unseen data. + +## Why avoid large weights? + +Large weights in a machine learning model can be problematic for several reasons: + +### 1. **Overfitting:** + - **Sensitivity to Noise:** Large weights can make the model overly sensitive to the noise in the training data. This means that the model might fit the training data very well but perform poorly on new, unseen data because it has essentially memorized the training data rather than learning the underlying patterns. + - **Complexity:** Large weights often correspond to complex decision boundaries. While this complexity might fit the training data closely, it can lead to poor generalization to new data. + +### 2. **Stability:** + - **Numerical Instability:** Models with large weights can become numerically unstable, especially in deep learning. This instability can lead to large changes in the output for small changes in the input, which is undesirable. + - **Gradient Explosion:** In gradient-based optimization methods, large weights can cause gradients to explode, making the training process difficult and leading to convergence issues. + +### 3. **Regularization and Generalization:** + - **Bias-Variance Tradeoff:** Large weights typically reduce the bias but increase the variance of the model. High variance means the model is likely to change significantly with slight variations in the training data, indicating poor generalization. + - **Regularization:** Techniques like L2 regularization penalize large weights, encouraging the model to learn simpler patterns that generalize better to new data. This regularization helps in achieving a balance between bias and variance. + +### 4. **Interpretability:** + - **Less Interpretability:** Models with large weights are often more complex and harder to interpret. This lack of interpretability can be a significant drawback, especially in applications where understanding the model's decisions is crucial (e.g., healthcare, finance). + +### 5. **Physical Constraints:** + - **Real-World Constraints:** In some applications, weights correspond to physical quantities. Large weights might imply unrealistic or impractical values that don't make sense in the real-world context. + +### Example: + +Consider a simple linear regression problem: + +$y = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b$ + +If the weights $w_1, w_2, \ldots, w_n$ are very large, a small change in any of the input features $x_i$ can lead to a large change in the output $y$. This makes the model highly sensitive to the input data, and such sensitivity is usually undesirable because it indicates that the model is not robust. + +### Visualization: + +Here's a visualization to illustrate the effect of large weights: + +- **Without Regularization (Large Weights):** + - The decision boundary is very complex. + - The model fits the training data very closely, including the noise. + - High variance and poor generalization to new data. + +- **With Regularization (Smaller Weights):** + - The decision boundary is smoother and simpler. + - The model captures the underlying pattern in the data without fitting the noise. + - Better generalization to new data. + +In summary, large weights are often not desirable because they can lead to overfitting, instability, poor generalization, and less interpretable models. Regularization techniques like L2 regularization help mitigate these issues by penalizing large weights, encouraging the model to find simpler, more generalizable solutions. + +## Why called ridge regression? + +The term "ridge regression" originates from the way the solution to the regularized regression problem is visualized geometrically. +### Geometric Interpretation + +In standard linear regression, the solution corresponds to finding the point in the parameter space that minimizes the sum of squared residuals. When multicollinearity is present, this solution can be unstable and may lie in a "ridge" or elongated region of the parameter space where small changes in the data can lead to large changes in the estimated coefficients. + +### Regularization and the Ridge + +Ridge regression adds a penalty term to the sum of squared residuals, which effectively shrinks the coefficients towards zero. This penalty term is based on the L2 norm of the coefficients, $\| \beta \|^2_2$. The constraint imposed by the regularization term can be visualized as adding an ellipsoidal contour (or ridge) to the error surface. The ellipsoid is centered at the origin, and its shape is determined by the regularization parameter $\lambda$. + +### Visual Analogy + +- **Ordinary Least Squares (OLS)**: The solution lies at the point where the residual sum of squares is minimized. This can be visualized as finding the point at the bottom of a bowl-shaped error surface. +- **Ridge Regression**: The solution lies at the point where the residual sum of squares plus the penalty term is minimized. This can be visualized as finding the point on the ridge of the ellipsoidal contour that intersects the bowl-shaped error surface. + +The term "ridge" reflects the added stability and control over the solution provided by the regularization term, which helps prevent the solution from lying on an unstable, elongated region of the parameter space. +## L2 regularization for other models + +### Non-Linear Models + +#### 1. Polynomial Regression + +Polynomial regression is a type of non-linear regression where the relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $n$-degree polynomial. + +$y = w_0 + w_1 x + w_2 x^2 + \cdots + w_n x^n$ + +L2 regularization can be added to the polynomial regression model by including the penalty term: + +$\text{Loss}_{\text{regularized}} = \frac{1}{2m} \sum_{i=1}^m (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^n w_j^2$ + +#### 2. Support Vector Regression (SVR) + +Support Vector Regression (SVR) is a type of regression model that uses support vector machines (SVMs) for regression tasks. SVR can be kernelized to handle non-linear relationships. L2 regularization is commonly used in SVR to penalize the magnitude of the coefficients: + +$\text{Loss}_{\text{regularized}} = \frac{1}{2} \sum_{i=1}^m \max(0, |y_i - \hat{y}_i| - \epsilon) + \frac{1}{2} \sum_{j=1}^n w_j^2$ + +#### 3. Neural Networks + +Neural networks are highly flexible non-linear models that can learn complex relationships between inputs and outputs. L2 regularization, also known as weight decay in the context of neural networks, can be applied to the weights of the network: + +$\text{Loss}_{\text{regularized}} = \text{Loss}_{\text{original}} + \lambda \sum_{j=1}^n w_j^2$ + +In this case, the regularization term is added to the original loss function (e.g., cross-entropy loss or mean squared error). + +### Implementation in Python + +L2 regularization can be implemented using various machine learning libraries in Python. Here are examples for polynomial regression, SVR, and neural networks: + +#### Polynomial Regression with L2 Regularization + +```python +from sklearn.preprocessing import PolynomialFeatures +from sklearn.linear_model import Ridge +from sklearn.pipeline import make_pipeline + +# Example data +X = [[1], [2], [3], [4], [5]] +y = [1, 4, 9, 16, 25] + +# Create a polynomial regression model with L2 regularization +degree = 3 +model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=1.0)) +model.fit(X, y) +y_pred = model.predict(X) +``` + +#### Support Vector Regression with L2 Regularization + +```python +from sklearn.svm import SVR + +# Example data +X = [[1], [2], [3], [4], [5]] +y = [1, 4, 9, 16, 25] + +# Create a support vector regression model with L2 regularization +model = SVR(kernel='poly', degree=3, C=1.0) +model.fit(X, y) +y_pred = model.predict(X) +``` + +#### Neural Networks with L2 Regularization + +```python +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.regularizers import l2 + +# Example data +import numpy as np +X = np.array([[1], [2], [3], [4], [5]]) +y = np.array([1, 4, 9, 16, 25]) + +# Create a neural network model with L2 regularization +model = Sequential([ + Dense(10, activation='relu', kernel_regularizer=l2(0.01), input_shape=(1,)), + Dense(1, kernel_regularizer=l2(0.01)) +]) + +model.compile(optimizer='adam', loss='mse') +model.fit(X, y, epochs=100, verbose=0) +y_pred = model.predict(X) +``` + +## L2 norm + +L2 regularization gets its name from the L2 norm, which is also known as the Euclidean norm. The L2 norm of a vector is defined as the square root of the sum of the squares of its components (like how the Euclidean distance is calculated). + +### L2 Norm + +For a vector $\mathbf{w} = [w_1, w_2, \ldots, w_n]$, the L2 norm is defined as: + +$\| \mathbf{w} \|_2 = \sqrt{w_1^2 + w_2^2 + \cdots + w_n^2}$ + +### L2 Regularization in the Context of Machine Learning + +In the context of regularizing a regression model, L2 regularization involves adding a penalty term to the loss function that is proportional to the L2 norm of the coefficients (without the square root for simplicity). This penalty term is the sum of the squares of the coefficients: + +$\lambda \sum_{j=1}^n w_j^2$ + +Here, $\lambda$ is the regularization parameter that controls the strength of the penalty. + +### Mathematical Formulation + +For a linear regression model, the standard loss function (mean squared error) without regularization is: + +$L_0(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^m (y_i - \mathbf{w} \cdot \mathbf{x}_i - b)^2$ + +When we add L2 regularization, the loss function becomes: + +$L(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^m (y_i - \mathbf{w} \cdot \mathbf{x}_i - b)^2 + \lambda \sum_{j=1}^n w_j^2$ + +In this regularized loss function, the term $\lambda \sum_{j=1}^n w_j^2$ is the L2 regularization term. It penalizes large values of the weights $w_j$, encouraging them to be small. + diff --git a/Machine Learning/Concepts/L3 Regularization.md b/Machine Learning/Concepts/L3 Regularization.md new file mode 100644 index 0000000..e69de29 diff --git a/Machine Learning/Concepts/Mean Squared Error.md b/Machine Learning/Concepts/Mean Squared Error.md new file mode 100644 index 0000000..5ecc276 --- /dev/null +++ b/Machine Learning/Concepts/Mean Squared Error.md @@ -0,0 +1,69 @@ +# Mean Squared Error + +MSE measures the average of the squares of the residuals. It is essentially the [Residual Sum of Squares](Residual%20Sum%20of%20Squares.md) divided by the number of observations. + +$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ + +## Steps to Calculate RSS and MSE + +1. **Compute the Residuals:** + - For each observation, subtract the predicted value from the actual value to get the residual (error). + +2. **Square the Residuals:** + - Square each residual to ensure they are all positive values. + +3. **Sum the Squared Residuals:** + - Sum all the squared residuals to get the RSS. + +4. **Calculate the Mean of the Squared Residuals:** + - Divide the RSS by the number of observations to get the MSE. + +### Example in Python + +Here is a simple example using Python to calculate RSS and MSE: + +```python +import numpy as np + +# Example actual values and predicted values +y_actual = np.array([3, -0.5, 2, 7]) +y_pred = np.array([2.5, 0.0, 2, 8]) + +# Calculate residuals +residuals = y_actual - y_pred + +# Calculate RSS +rss = np.sum(np.square(residuals)) +print(f"RSS: {rss}") + +# Calculate MSE +mse = np.mean(np.square(residuals)) +print(f"MSE: {mse}") +``` + +### Example in R + +Here is a simple example using R to calculate RSS and MSE: + +```r +# Example actual values and predicted values +y_actual <- c(3, -0.5, 2, 7) +y_pred <- c(2.5, 0.0, 2, 8) + +# Calculate residuals +residuals <- y_actual - y_pred + +# Calculate RSS +rss <- sum(residuals^2) +cat("RSS:", rss, "\n") + +# Calculate MSE +mse <- mean(residuals^2) +cat("MSE:", mse, "\n") +``` + +## RSS vs. MSE +- **RSS (Residual Sum of Squares):** The sum of the squared differences between the actual and predicted values. +- **MSE (Mean Squared Error):** The average of the squared differences between the actual and predicted values. + +Both metrics are useful for evaluating the accuracy of a regression model, with MSE providing a normalized measure that accounts for the number of observations. \ No newline at end of file diff --git a/Machine Learning/Concepts/Multicollinearity.md b/Machine Learning/Concepts/Multicollinearity.md new file mode 100644 index 0000000..b36c287 --- /dev/null +++ b/Machine Learning/Concepts/Multicollinearity.md @@ -0,0 +1,137 @@ +# Multicollinearity + +Multicollinearity refers to a situation in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one predictor variable can be linearly predicted from the others with a substantial degree of accuracy. This situation can lead to several issues in the estimation of regression coefficients using ordinary least squares (OLS). Here’s why multicollinearity causes the effects mentioned: + +## Unbiased Estimates with High Variance + +1. **Unbiased Estimates**: + - OLS estimates are unbiased, meaning that the expected value of the estimated _coefficients_ is equal to the true value of the _coefficients_. This property holds regardless of multicollinearity. + - Mathematically, if $\beta$ is the true coefficient vector, then $\mathbb{E}[\hat{\beta}] = \beta$. + +2. **High Variance**: + - When predictor variables are highly correlated, the matrix $X^TX$ (where $X$ is the matrix of predictor variables) becomes close to singular, meaning it is nearly non-invertible. + - The variance of the OLS estimator $\hat{\beta}$ is given by: + $\text{Var}(\hat{\beta}) = \sigma^2 (X^TX)^{-1}$ + where $\sigma^2$ is the variance of the error term. + - If $X^TX$ is close to singular, the entries of $(X^TX)^{-1}$ can become very large. This results in large variances for the coefficient estimates $\hat{\beta}$. + - High variance implies that the coefficient estimates $\hat{\beta}$ can **vary widely with different samples**, even if the true underlying relationship is stable. + +## Effects of Multicollinearity + +1. **Instability of Coefficient Estimates**: + - High variance in the coefficient estimates leads to instability. Small changes in the data can cause large changes in the estimates of the coefficients. This makes the model _sensitive to the particular sample used_. + +2. **Interpretation Difficulties**: + - When predictors are highly correlated, it becomes difficult to assess the individual effect of each predictor on the response variable. The estimated coefficients may not accurately reflect the true relationship between each predictor and the response. + +3. **Increased Standard Errors**: + - The standard errors of the coefficients increase due to multicollinearity. Larger standard errors mean wider confidence intervals for the coefficients, making it harder to determine if a predictor is statistically significant. + +### Why Multicollinearity Causes These Effects + +1. **Redundancy in Information**: + - When predictors are highly correlated, they provide redundant information about the response variable. The model cannot distinguish the unique contribution of each predictor to the response. + +2. **Numerical Instability**: + - The near-singularity of $X^TX$ leads to numerical instability in computing the inverse of this matrix, which is required for estimating the coefficients. This instability manifests as large variances in the estimates. + +### Ridge Regression as a Solution + +[Ridge regression](L2%20Regularization.md) addresses the problem of multicollinearity by adding a penalty term to the loss function, which shrinks the coefficient estimates. This penalty term is controlled by the regularization parameter $\lambda$: + +$\hat{\beta}_{\text{ridge}} = \arg \min_{\beta} \left( \| y - X\beta \|^2_2 + \lambda \| \beta \|^2_2 \right)$ + +- The ridge penalty $\lambda \| \beta \|^2_2$ stabilizes the inversion of $X^TX$ by making the problem better conditioned. +- It reduces the variance of the coefficient estimates at the cost of introducing some bias (biased but with lower variance), leading to more reliable and interpretable models. + +In summary, multicollinearity causes high variance in OLS estimates because it makes the predictor matrix $X$nearly singular, leading to instability in the coefficient estimates. Ridge regression mitigates this issue by adding a penalty term, which regularizes the estimates and reduces their variance. + +## Coefficient Variance + +Reducing coefficients does not necessarily reduce variance. The relationship between the size of the coefficients and their variance depends on several factors, including the presence of multicollinearity, the amount of regularization applied, and the structure of the data. + +### Factors Affecting Coefficient Variance + +1. **Multicollinearity**: + - When predictors are highly correlated, even small changes in the data can lead to large changes in the estimated coefficients. This results in high variance, regardless of whether the coefficients themselves are small or large. + - For example, in a situation of perfect multicollinearity, the coefficients can be arbitrarily large or small with very high variance. + +2. **Regularization**: + - Techniques like ridge regression (L2 regularization) shrink the coefficients by adding a penalty proportional to the sum of the squares of the coefficients. This reduces the variance of the coefficients. + - However, regularization introduces bias. While regularized coefficients are generally smaller and have reduced variance compared to their unregularized counterparts, this does not imply that all small coefficients have small variance. + +3. **Sample Size and Data Structure**: + - The variance of the coefficient estimates also depends on the sample size and the variability in the data. Larger sample sizes tend to _reduce_ the variance of the estimates. + - The design matrix $X$ (the matrix of predictors) influences the variance. If the columns of $X$ are nearly linearly dependent (indicating multicollinearity), the variance of the coefficients increases. + +### Mathematical Perspective + +The variance of the OLS estimator $\hat{\beta}$ is given by: +$\text{Var}(\hat{\beta}) = \sigma^2 (X^TX)^{-1}$ +where $\sigma^2$ is the variance of the error term, and $X^TX$ is the matrix of predictors. The elements of $(X^TX)^{-1}$ can be large if $X$ is close to being singular, leading to high variance in $\hat{\beta}$. + +In ridge regression, the estimator is: +$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty$ +Here, the addition of $\lambda I$ to $X^TX$ stabilizes the inversion, reducing the variance of $\hat{\beta}_{\text{ridge}}$. This often results in smaller coefficients, but it is the regularization that reduces variance, not the size of the coefficients themselves. + +### Intuitive Example + +Consider two different scenarios with a simple linear model: + +1. **Scenario with Multicollinearity**: + - Predictors $X_1$ and $X_2$ are highly correlated. Even if the true coefficients are small (say $\beta_1 = 0.1$ and $\beta_2 = 0.2$), the high correlation means the estimates of $\hat{\beta}_1$ and $\hat{\beta}_2$ will have high variance. Small coefficients in this case do not imply small variance. + +2. **Scenario with Regularization**: + - Applying ridge regression with a suitable $\lambda$ reduces the magnitude of the coefficients and stabilizes their estimates. Here, the small coefficients result from the regularization process, which also reduces variance. But this is a controlled scenario where small coefficients are associated with reduced variance due to regularization. + +### Conclusion + +While regularization techniques like ridge regression can produce smaller coefficients with reduced variance, small coefficients in general do not imply small variance. The relationship between coefficient size and variance is influenced by multicollinearity, regularization, sample size, and the structure of the data. Understanding these factors is crucial in interpreting the estimates and their variances in regression models. + +## Stability of Inversion + +Stabilizing an inversion refers to improving the numerical stability and conditioning of the matrix inversion process, especially in the context of solving systems of linear equations or estimating parameters in regression models. This is particularly important when dealing with nearly singular or ill-conditioned matrices, where direct inversion can lead to large numerical errors and unstable solutions. + +### Context in Regression + +In the context of ridge regression, stabilizing the inversion relates to making the matrix $X^TX$ (the Gram matrix of the predictor variables) better conditioned, so that the inversion required to solve for the regression coefficients is more stable and reliable. + +### Ordinary Least Squares (OLS) Regression + +In OLS regression, the coefficients $\hat{\beta}$ are estimated using the formula: +$\hat{\beta} = (X^TX)^{-1}X^Ty$ + +If $X^TX$ is nearly singular or ill-conditioned, the inversion $(X^TX)^{-1}$ can be numerically unstable. **Small changes in the data can cause large changes in the coefficients**, leading to **high variance in the estimates**. + +### Ridge Regression + +Ridge regression addresses this problem by adding a regularization term to the diagonal of $X^TX$: +$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty$ + +Here, $\lambda$ is the regularization parameter, and $I$ is the identity matrix. This addition has several effects: + +1. **Improved Conditioning**: The matrix $X^TX + \lambda I$ is better conditioned than $X^TX$. Conditioning refers to the _sensitivity of the solution to changes in the input data_. A well-conditioned matrix has a smaller condition number, meaning it is less sensitive to small changes or errors in the data. + +2. **Numerical Stability**: The inversion $(X^TX + \lambda I)^{-1}$ is more stable because the _added term_ $\lambda I$ ensures that the matrix is not close to singular. This reduces the risk of _large numerical errors_ during the inversion process. + +3. **Shrinkage**: The regularization term $\lambda I$ shrinks the coefficients towards zero, which helps to reduce their variance. This shrinkage is controlled by the regularization parameter $\lambda$. As $\lambda$ increases, the coefficients are shrunk more, and the inversion becomes more stable. + +### Why Stabilizing the Inversion Matters + +- **Numerical Precision**: Computers have finite precision, and operations like matrix inversion can introduce numerical errors. Stabilizing the inversion helps to minimize these errors. +- **Model Reliability**: Stable inversion leads to more reliable coefficient estimates. In the presence of multicollinearity, ridge regression provides more consistent and interpretable results. +- **Reduced Variance**: By stabilizing the inversion, ridge regression reduces the variance of the coefficient estimates, making the model less sensitive to small changes in the data. + +### Example + +Consider a simple example with two highly correlated predictors $X_1$ and $X_2$: + +1. **OLS Regression**: + $\hat{\beta} = (X^TX)^{-1}X^Ty$ + If $X_1$ and $X_2$ are highly correlated, $X^TX$ is nearly singular, and the inversion is unstable, leading to high variance in $\hat{\beta}$. + +2. **Ridge Regression**: + $hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^Ty$ + The term $\lambda I$ makes $X^TX + \lambda I$ well-conditioned, ensuring that the inversion is stable and the estimates $\hat{\beta}_{\text{ridge}}$ have lower variance. + +In summary, stabilizing an inversion involves improving the numerical stability and conditioning of the matrix inversion process, making it more reliable and reducing the variance of the estimates. Ridge regression achieves this by adding a regularization term, which ensures the matrix to be inverted is better conditioned. \ No newline at end of file diff --git a/Machine Learning/Concepts/Network Adjacency Matrix.md b/Machine Learning/Concepts/Network Adjacency Matrix.md new file mode 100644 index 0000000..648da6f --- /dev/null +++ b/Machine Learning/Concepts/Network Adjacency Matrix.md @@ -0,0 +1,106 @@ +# Network Adjacency Matrix + +A network adjacency matrix is a mathematical representation of a graph or network. It is a square matrix used to represent a finite graph, where the elements of the matrix indicate whether pairs of vertices (or nodes) are adjacent or not in the graph. + +## Definition and Structure + +- **Vertices/Nodes**: The entities in the network (e.g., people, computers, cities). +- **Edges**: The connections between the entities (e.g., friendships, communication links, roads). + +The adjacency matrix $A$ for a graph with $n$ vertices is an $n \times n$ matrix where each element $A_{ij}$ is defined as: + +- $A_{ij} = 1$ if there is an edge from vertex $i$ to vertex $j$. +- $A_{ij} = 0$ if there is no edge from vertex $i$ to vertex $j$. + +### Types of Graphs and Their Adjacency Matrices + +1. **Undirected Graphs**: + - Symmetric adjacency matrix. + - $A_{ij} = A_{ji}$. + +2. **Directed Graphs (Digraphs)**: + - Asymmetric adjacency matrix. + - $A_{ij} \neq A_{ji}$ if there is a directed edge from $i$ to $j$ but not from $j$ to $i$. + +3. **Weighted Graphs**: + - Entries in the matrix represent the weight of the edge. + - $A_{ij}$ could be any non-negative value indicating the weight of the edge from $i$ to $j$. + +4. **Unweighted Graphs**: + - Entries are either 0 or 1, indicating the absence or presence of an edge. + +### Example + +#### Undirected Unweighted Graph + +Consider a simple undirected graph with 4 vertices (A, B, C, D): + +- A is connected to B and C. +- B is connected to A and D. +- C is connected to A. +- D is connected to B. + +The adjacency matrix $A$ is: + +$$A = \begin{pmatrix} +0 & 1 & 1 & 0 \\ +1 & 0 & 0 & 1 \\ +1 & 0 & 0 & 0 \\ +0 & 1 & 0 & 0 \\ +\end{pmatrix} +$$ + +#### Directed Unweighted Graph + +Consider a directed graph with 3 vertices (X, Y, Z): + +- X has a directed edge to Y. +- Y has a directed edge to Z. +- Z has a directed edge to X. + +The adjacency matrix $A$ is: + +$$ +A = \begin{pmatrix} +0 & 1 & 0 \\ +0 & 0 & 1 \\ +1 & 0 & 0 \\ +\end{pmatrix} +$$ + +#### Weighted Graph + +Consider a weighted graph with 3 vertices (P, Q, R): + +- P is connected to Q with weight 2 and to R with weight 3. +- Q is connected to R with weight 4. +- R has no outgoing edges. + +The adjacency matrix $A$ is: + +$$ +A = \begin{pmatrix} +0 & 2 & 3 \\ +0 & 0 & 4 \\ +0 & 0 & 0 \\ +\end{pmatrix} +$$ + +### Applications + +- **Social Networks**: Representing connections between people. +- **Computer Networks**: Representing connections between computers or routers. +- **Biological Networks**: Representing interactions between proteins, genes, or other biological entities. +- **Transportation Networks**: Representing routes between cities or locations. + +### Advantages + +- **Simple Representation**: Easy to understand and visualize. +- **Efficient Access**: Quick access to check if an edge exists between two vertices. + +### Disadvantages + +- **Memory Usage**: Requires $O(n^2)$ space, which can be inefficient for large, sparse graphs. +- **Scalability**: Not suitable for very large graphs due to space constraints. + +In summary, an adjacency matrix is a fundamental tool in graph theory and network analysis, offering a straightforward way to represent and manipulate graphs mathematically. \ No newline at end of file diff --git a/Machine Learning/Concepts/Numerical Overflow.md b/Machine Learning/Concepts/Numerical Overflow.md new file mode 100644 index 0000000..18cdcd7 --- /dev/null +++ b/Machine Learning/Concepts/Numerical Overflow.md @@ -0,0 +1,56 @@ +# Numerical Overflow + +Computational overflow due to large exponentials is a common issue in numerical computing, particularly when working with functions involving exponentiation, such as the [Softmax](Softmax.md) function in machine learning, exponential distributions, and other applications in scientific computing. + +### What is Computational Overflow? + +Computational overflow occurs when a number exceeds the maximum value that can be represented within a given numerical system. In floating-point arithmetic, this can lead to values becoming infinite (`inf`) or causing the program to crash or produce erroneous results. + +### Causes of Overflow with Large Exponentials + +When computing exponentials of large numbers, the result can grow very rapidly. For instance, $e^{100}$ is already a very large number, and $e^{1000}$ is vastly larger. Most programming environments have a maximum value for floating-point numbers (e.g., around $10^{308}$ for double-precision floating-point numbers in many systems). If a computation exceeds this range, an overflow occurs. + +### Example + +Consider the softmax function: +$\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$ + +For large $z_i$, $e^{z_i}$ can be extremely large, leading to overflow. + +### Mitigation Strategies + +1. **Log-Sum-Exp Trick**: A common technique to avoid overflow in the softmax function and similar computations is to use the log-sum-exp trick. This involves subtracting the maximum value from the exponentials before computing the sum, thus preventing any term from becoming too large. + + Instead of: + $\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$ + + Use: + $\sigma(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_{j=1}^{n} e^{z_j - \max(\mathbf{z})}}$ + + Here, $\max(\mathbf{z})$ is the maximum value in the vector $\mathbf{z}$. Subtracting this value shifts all $z_i$ so that the largest value becomes zero, which prevents any exponentials from becoming excessively large. + +2. **Clamping Values**: Another approach is to clamp values within a range before applying the exponential function. For instance, in certain applications, values can be capped to avoid reaching the overflow threshold. + +3. **Using Logarithms**: In many cases, working in the logarithmic domain can help prevent overflow. For example, instead of working with probabilities directly, which can involve exponentials, work with log-probabilities. + +4. **Numerical Libraries**: Use robust numerical libraries that handle large exponentials and potential overflow situations internally. Many high-level numerical libraries include optimized functions for these computations. + +### Example Calculation Using Log-Sum-Exp Trick + +Suppose you have a vector $\mathbf{z} = [1000, 1001, 1002]$. + +- Compute the maximum value: $\max(\mathbf{z}) = 1002$. +- Subtract the maximum value from each element in $\mathbf{z}$: $[1000 - 1002, 1001 - 1002, 1002 - 1002] = [-2, -1, 0]$. +- Compute the exponentials of the adjusted values: $[e^{-2}, e^{-1}, e^{0}] = [0.135, 0.368, 1]$. +- Compute the sum of these exponentials: $0.135 + 0.368 + 1 = 1.503$. +- Compute the softmax values: + + $\sigma(z_1) = \frac{0.135}{1.503} \approx 0.090$ + + $\sigma(z_2) = \frac{0.368}{1.503} \approx 0.245$ + + $\sigma(z_3) = \frac{1}{1.503} \approx 0.665$ + +### Conclusion + +Computational overflow caused by large exponentials is a significant issue in numerical computations. Techniques such as the log-sum-exp trick, clamping values, working in the logarithmic domain, and using robust numerical libraries can help mitigate these issues. Understanding and applying these strategies is crucial for ensuring numerical stability and accuracy in computational applications. \ No newline at end of file diff --git a/Machine Learning/Concepts/Pearson Correlation.md b/Machine Learning/Concepts/Pearson Correlation.md new file mode 100644 index 0000000..913dcca --- /dev/null +++ b/Machine Learning/Concepts/Pearson Correlation.md @@ -0,0 +1,63 @@ +# Pearson Correlation + +The Pearson correlation coefficient is a measure of the **linear relationship** between two vectors (or sets of data). It quantifies how much one variable changes in response to changes in another variable, giving both the direction and the strength of the relationship. + +### Definition + +Given two vectors $\mathbf{x}$ and $\mathbf{y}$ of length $n$, the Pearson correlation coefficient $r$ is defined as: + +$r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$ + +where: +- $x_i$ and $y_i$ are the individual elements of the vectors $\mathbf{x}$ and $\mathbf{y}$, respectively. +- $\bar{x}$ and $\bar{y}$ are the means of the vectors $\mathbf{x}$ and $\mathbf{y}$, respectively. + +### Interpretation + +1. **Value Range**: The Pearson correlation coefficient $r$ ranges from -1 to 1. + - $r = 1$: Perfect positive linear relationship. As one variable increases, the other variable increases proportionally. + - $r = -1$: Perfect negative linear relationship. As one variable increases, the other variable decreases proportionally. + - $r = 0$: No linear relationship. Changes in one variable do not predict changes in the other variable. + +2. **Strength of Relationship**: + - $|r|$ close to 1: Strong linear relationship. + - $|r|$ close to 0: Weak or no linear relationship. + +### What Pearson Correlation Tells Us + +1. **Direction**: The sign of $r$ indicates the direction of the relationship: + - Positive $r$: Both variables tend to increase together. + - Negative $r$: One variable tends to decrease as the other increases. + +2. **Strength**: The magnitude of $r$ indicates the strength of the relationship: + - Values close to 1 or -1 indicate a strong linear relationship. + - Values close to 0 indicate a weak linear relationship. + +3. **Linearity**: Pearson correlation specifically measures the strength and direction of a linear relationship. It does not capture non-linear relationships. + +### Example + +Consider two vectors $\mathbf{x} = [1, 2, 3, 4, 5]$ and $\mathbf{y} = [2, 4, 6, 8, 10]$: + +1. Compute the means: + $\bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3, \quad \bar{y} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6$ + +2. Compute the covariance: + $\sum_{i=1}^{5} (x_i - \bar{x})(y_i - \bar{y}) = (1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6) = 20$ + +3. Compute the standard deviations: + $\sqrt{\sum_{i=1}^{5} (x_i - \bar{x})^2} = \sqrt{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2} = \sqrt{10}$ + $\sqrt{\sum_{i=1}^{5} (y_i - \bar{y})^2} = \sqrt{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2} = \sqrt{40}$ + +4. Compute $r$: + $r = \frac{20}{\sqrt{10} \cdot \sqrt{40}} = \frac{20}{\sqrt{400}} = \frac{20}{20} = 1$ + +This indicates a perfect positive linear relationship between $\mathbf{x}$ and $\mathbf{y}$. + +### Caveats + +- **Linearity**: Pearson correlation measures linear relationships only. If the relationship is non-linear, $r$ might be close to 0 even if there is a strong relationship. +- **Outliers**: Sensitive to outliers, which can disproportionately affect the correlation. +- **Assumption of Normality**: Assumes the data is _normally distributed_, though it can still be used for non-normal data with caution. + +In summary, the Pearson correlation coefficient provides a concise measure of the direction and strength of the linear relationship between two vectors. It is widely used in statistics and data analysis to understand the relationship between variables. \ No newline at end of file diff --git a/Machine Learning/Concepts/Residual Sum of Squares.md b/Machine Learning/Concepts/Residual Sum of Squares.md new file mode 100644 index 0000000..b6457db --- /dev/null +++ b/Machine Learning/Concepts/Residual Sum of Squares.md @@ -0,0 +1,14 @@ +# RSS + +RSS (Residual Sum of Squares) is a metric used to evaluate the performance of a regression model. + +## Residual Sum of Squares (RSS) + +RSS measures the total deviation of the observed values from the values predicted by the model. It is calculated as the sum of the squares of the residuals (errors). + +$$\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ + +where: +- $y_i$ is the actual value for the $i$-th observation. +- $\hat{y}_i$ is the predicted value for the $i$-th observation. +- $n$ is the number of observations. diff --git a/Machine Learning/Concepts/Softmax.md b/Machine Learning/Concepts/Softmax.md new file mode 100644 index 0000000..f1cf3d8 --- /dev/null +++ b/Machine Learning/Concepts/Softmax.md @@ -0,0 +1,166 @@ +# Softmax + +The softmax function is a mathematical function that converts a vector of real numbers into a **probability distribution**. It is commonly used in machine learning, especially in the context of multiclass classification problems, to represent the probabilities of different classes. + +## Softmax Function +### Definition + +Given a vector $\mathbf{z} = [z_1, z_2, \ldots, z_n]$, the softmax function $\sigma(\mathbf{z})$ is defined as: + +$\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$ + +for $i = 1, 2, \ldots, n$. + +### Properties + +1. **Probability Distribution**: The output of the softmax function is a vector of probabilities that sum to 1. + + $\sum_{i=1}^{n} \sigma(z_i) = 1$ + +2. **Exponentiation**: The function uses the exponential function, which ensures that all output values are positive. +3. **Normalization**: Each exponentiated value is divided by the sum of all exponentiated values, normalizing the outputs to sum to 1. + +### Example + +Consider a vector $\mathbf{z} = [1.0, 2.0, 3.0]$. To compute the softmax of this vector: + +1. Compute the exponentials of each element: + + $e^{1.0} = 2.7183, \quad e^{2.0} = 7.3891, \quad e^{3.0} = 20.0855$ + +2. Compute the sum of the exponentials: + + $2.7183 + 7.3891 + 20.0855 = 30.1929$ + +3. Divide each exponential by the sum: + + $\sigma(1.0) = \frac{2.7183}{30.1929} \approx 0.0900, \quad \sigma(2.0) = \frac{7.3891}{30.1929} \approx 0.2447, \quad \sigma(3.0) = \frac{20.0855}{30.1929} \approx 0.6652$ + +So, the softmax output is $\mathbf{\sigma(z)} = [0.0900, 0.2447, 0.6652]$. + +### Applications + +1. **Multiclass Classification**: In neural networks, especially in the final layer of a classification model, softmax is used to convert the logits (raw output values) into probabilities for each class. +2. **Probability Distribution Modeling**: It is used in various probabilistic models where it is necessary to convert a set of scores into a probability distribution. + +### Advantages + +- **Interpretable Output**: The output probabilities are easy to interpret and can be directly used for decision-making. +- **Differentiability**: The softmax function is differentiable, which makes it suitable for optimization algorithms like gradient descent. + +### Derivative + +The derivative of the softmax function is often used in the backpropagation algorithm for training neural networks. For a vector $\mathbf{z}$, the derivative with respect to its inputs is: + + +$\frac{\partial \sigma(z_i)}{\partial z_j} = \sigma(z_i) (\delta_{ij} - \sigma(z_j))$ + +where $\delta_{ij}$ is the [Kronecker Delta](Kronecker%20Delta.md), which is 1 if $i = j$ and 0 otherwise. + +### Computational Stability + +In practice, to improve numerical stability, the softmax function is often implemented using the following equivalent formulation: + + +$\sigma(z_i) = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_{j=1}^{n} e^{z_j - \max(\mathbf{z})}}$ + +Subtracting the maximum value of the vector $\mathbf{z}$ from each element before exponentiating helps to prevent large exponentials, reducing the risk of [Numerical Overflow](Numerical%20Overflow.md). + +## Intuitive Understanding + +Intuitively, the softmax function performs a kind of "squashing" operation that takes a vector of real numbers (which could be logits or scores from a model) and transforms it into a vector of probabilities. Here’s how it works and why it makes sense: + +### Intuitive Explanation + +1. **Exponential Emphasis**: + - **Highlighting Differences**: By applying the exponential function to each element, the softmax function accentuates the differences between values. Higher values become much larger after exponentiation, while smaller values become comparatively smaller. This emphasizes the relative importance of higher scores. + +2. **Normalization**: + - **Sum to One**: The sum of these exponentiated values is computed, and each exponentiated value is then divided by this sum. This ensures that the resulting values form a probability distribution (i.e., they are non-negative and sum to 1). + +### Why It Makes Sense + +- **Relative Importance**: The original vector might represent some form of raw, unbounded scores indicating the relative importance or likelihood of different classes or outcomes. By exponentiating these scores, we are effectively translating them into a scale where the differences become more pronounced. + +- **Normalization**: Once we have these pronounced differences, normalizing them ensures that they can be interpreted as probabilities. Probabilities inherently need to sum to one, providing a meaningful way to compare the relative likelihoods of different outcomes. + +### Example to Illustrate + +Consider a simple example where we have scores from a classification model: $[2.0, 1.0, 0.1]$. + +1. **Exponentiation**: + - $e^{2.0} \approx 7.39$ + - $e^{1.0} \approx 2.72$ + - $e^{0.1} \approx 1.11$ + +2. **Sum of Exponentials**: + - Sum $\approx 7.39 + 2.72 + 1.11 = 11.22$ + +3. **Normalization**: + - Probability for the first element: $\frac{7.39}{11.22} \approx 0.66$ + - Probability for the second element: $\frac{2.72}{11.22} \approx 0.24$ + - Probability for the third element: $\frac{1.11}{11.22} \approx 0.10$ + +This transforms the original scores into a probability distribution $[0.66, 0.24, 0.10]$, which is easy to interpret: the first class is the most likely, the second is less likely, and the third is the least likely. + +### Why Use Softmax Instead of Other Functions? + +- **Exponentiation and Differentiation**: The exponential function’s properties ensure that even small differences in scores become significant after exponentiation, which is crucial for distinguishing between classes. Moreover, the softmax function is differentiable, allowing it to be used effectively in gradient-based optimization methods common in neural networks. + +- **Handling Logits**: Raw scores (logits) can be negative, zero, or positive, and their scale can vary. The softmax function handles this by converting any range of input values into a well-defined probability distribution. + +### Summary + +The softmax function turns a vector of arbitrary real numbers into a vector of probabilities by emphasizing differences between the numbers and normalizing them to sum to one. This transformation makes sense because it leverages the properties of exponentiation to highlight relative importance and ensures a meaningful probabilistic interpretation through normalization. + +## Temperature + +The temperature parameter in the softmax function is a scaling factor that can be used to **control the "sharpness" or "smoothness" of the resulting probability distribution**. This parameter is often denoted by $T$. + +### Temperature in Softmax Function + +The standard softmax function for a vector $\mathbf{z}$is: + +$\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$ + +Introducing a temperature parameter $T$, the modified softmax function becomes: + +$\sigma_T(z_i) = \frac{e^{z_i / T}}{\sum_{j=1}^{n} e^{z_j / T}}$ + +### Effects of the Temperature Parameter + +1. **High Temperature ( $T > 1$)**: + - When the temperature $T$is greater than 1, the softmax function produces a smoother, more uniform probability distribution. This means that the differences between the probabilities of different outcomes are reduced, making the distribution more "soft". + - For example, if $T$is very high, the softmax function approaches a uniform distribution where all probabilities are almost equal. + +2. **Low Temperature ( $0 < T < 1$)**: + - When the temperature $T$is less than 1, the softmax function produces a sharper, more peaked probability distribution. This means that the differences between the probabilities of different outcomes are accentuated, making the distribution more "hard". + - For example, if $T$is very low, the softmax function approaches a one-hot encoding where the highest value dominates the probability distribution, and other values have probabilities close to zero. + +### Why Use Temperature? + +1. **Control Exploration and Exploitation**: + - In reinforcement learning and some probabilistic models, the temperature parameter is used to control the trade-off between exploration (trying out less likely actions) and exploitation (choosing the most likely action). A higher temperature promotes exploration, while a lower temperature promotes exploitation. + +2. **Adjust Sensitivity**: + - Temperature can be used to adjust the sensitivity of the softmax function to the differences in the input values. In neural networks, for example, temperature scaling can be used during inference to control the confidence of the predictions. + +3. **Smooth vs. Sharp Decisions**: + - When modeling probabilistic decisions, temperature can help in smoothing out the decision boundaries or making them sharper, depending on the application needs. + +### Example + +Consider the vector $\mathbf{z} = [1.0, 2.0, 3.0]$: + +- **Standard Softmax ( $T = 1$)**: + $\sigma(1.0) = \frac{e^1}{e^1 + e^2 + e^3} \approx 0.090, \quad \sigma(2.0) = \frac{e^2}{e^1 + e^2 + e^3} \approx 0.245, \quad \sigma(3.0) = \frac{e^3}{e^1 + e^2 + e^3} \approx 0.665$ + +- **High Temperature ( $T = 2$)**: + $\sigma_T(1.0) = \frac{e^{1/2}}{e^{1/2} + e^{2/2} + e^{3/2}} \approx 0.211, \quad \sigma_T(2.0) = \frac{e^{2/2}}{e^{1/2} + e^{2/2} + e^{3/2}} \approx 0.307, \quad \sigma_T(3.0) = \frac{e^{3/2}}{e^{1/2} + e^{2/2} + e^{3/2}} \approx 0.482$ + +- **Low Temperature ( $T = 0.5$)**: + $\sigma_T(1.0) = \frac{e^{1/0.5}}{e^{1/0.5} + e^{2/0.5} + e^{3/0.5}} \approx 0.019, \quad \sigma_T(2.0) = \frac{e^{2/0.5}}{e^{1/0.5} + e^{2/0.5} + e^{3/0.5}} \approx 0.119, \quad \sigma_T(3.0) = \frac{e^{3/0.5}}{e^{1/0.5} + e^{2/0.5} + e^{3/0.5}} \approx 0.861$ + +As seen, with $T = 2$, the probabilities are more uniform, while with $T = 0.5$, the probabilities are more skewed towards the highest value. + +In summary, the temperature parameter in the softmax function is a powerful tool to control the spread of the resulting probability distribution, allowing it to be adjusted from very smooth to very sharp depending on the application requirements. \ No newline at end of file diff --git a/Programming Environment/External Linux on SSD.md b/Programming Environment/External Linux on SSD.md new file mode 100644 index 0000000..b5ef1bf --- /dev/null +++ b/Programming Environment/External Linux on SSD.md @@ -0,0 +1,153 @@ +# External Linux on SSD + +## 1. SSD Partition + +Leave out 130 GBs for the system in ext4 + +## 2. Create bootable USB flashdrive + +Download Ubuntu 24.04 image +Use balenaEtcher to write it to a usb stick with 8GB storage + +## 3. Boot from USB + +Restart, enter BIOS by smashing F2 when the ROG logo shows, then change the boot sequence. +In other computers it's possible to disconnect the internal drives to avoid mistakenly overwriting them. +Restart to enter the Ubuntu menu + +## 4. Install Ubuntu + +Choose Ubuntu (Safe Graphics) +Follow the instructions +- Don +Choose `Manual Partition` +- In **most** cases, the internal HD will show up as `sda` and the external as `sdb`, but this is not guaranteed. +- `ESP Partition`: `VFAT`,500 MB +- `swap`: 8 GB, should not exceed computer RAM +- `/`: `ext4`, 60 GB, root partition +- `/boot/efi`: `VFAT`, 1 GB, created automatically +- `/home`: `ext4`, 40 GB, separate for easier system upgrade later +- The rest of space can be allocated for other use + +> [!info] The /home directory +> With all credit to Ubuntu, the home directory is: +> +> - Where your Desktop resides. +> - Where your documents, pictures, music, videos, audio, and pretty much everything else can be stored. +> - Where your application settings are stored, in hidden files and directories (their names start with a . ) that you should not touch unless you know what you are doing. Eg, /home/username/.thunderbird contains your Thunderbird emails and contacts, /home/username/.mozilla contains Firefox bookmarks, passwords, history, plugins… +> - The only place (with the exclusion of removable drives and the /tmp directory) within the Ubuntu file system where a user can freely create/modify/remove files and directories without needing root permissions or the sudo command. + +> [!info] Bootloader +> The /boot/efi partition is for installing the bootloader to boot the system. +> +> Make sure to choose bootloader installation to be on the external SSD. + +## 5. Installing upgrades + +### Recovery mode + +Enter Recovery mode (Advanced options for Ubuntu) and boot in secure mode by choosing the 1st option. + + + +### White screen error + +It may enter a white screen with 'Oh no .... contact system administrator' message. This is caused by gdm3 and gnome desktop. + +Boot through recovery mode, then use the terminal: +Or press `Ctrl+Alt+F3`, and use the terminal: + +working solution: +[source](https://askubuntu.com/questions/1050672/gdm3-does-not-start-in-ubuntu-18-04) +``` sh +sudo apt install ubuntu-gnome-desktop +sudo apt install gnome-shell gnome +sudo systemctl restart gdm3 +sudo reboot +``` + +or try this: +``` sh +sudo apt purge gdm3 +sudo reboot +``` + +When the computer boots again, only terminal is available. Enter + +``` +sudo apt install gdm3 +sudo service gdm start +``` + +``` sh +sudo apt-get clean && sudo apt-get autoremove && sudo reboot +``` + +Or try lightdm, did not work for me + +``` sh +sudo apt install lightdm +sudo dpkg-reconfigure lightdm +``` + +Or reinstall Nvidia driver +### Useful commands + +Display system info + +``` sh +inxi -G +``` + +Find package installed + +``` +dpkg -l | grep 'nvidia' +``` + +### Failed to start Nvidia persistence daemon + +First check + +``` sh +systemctl status nvidia-persistenced.service +``` + +If it runs don't worry about it + +If it does not: +[Source](https://ubuntu.com/server/docs/nvidia-drivers-installation) +``` sh +sudo apt purge nvidia-* +sudo apt --purge remove '*nvidia*${DRIVER_BRANCH}*' +sudo apt autoremove +``` +like `'*nvidia*535*'` + +Then install the driver + +``` sh +sudo apt install nvidia-driver-535 nvidia-dkms-535 +``` + +or + +``` sh +sudo ubuntu-drivers install +``` + +Check installation + +``` sh +nvidia-smi +``` + +Then install the Nvidia settings app + +``` sh +sudo apt install nvidia-settings +``` + + + + diff --git a/Programming Environment/Ubuntu 2204.md b/Programming Environment/Ubuntu 2204.md new file mode 100644 index 0000000..f23184f --- /dev/null +++ b/Programming Environment/Ubuntu 2204.md @@ -0,0 +1,232 @@ +# Ubuntu 22.04 + +## Install with Manual Partition + +https://www.youtube.com/watch?v=Wc4GjV_Ahb8 + +Installing 3rd party drivers (Nvidia) require a password. Make it and remember it: 12345678 +On reboot after installation. Choose enroll MOK. Then follow the instructions to enter your password. [Watch this](https://www.youtube.com/watch?v=A0gxy3xaJlE) + +Manual partition +- EFI partition: 200 MB sdb1 +- Root: 60 GB sdb2 +- Home: 40 GB sdb3 +- Swap: 17 GB (depend on RAM) sdb4 + +Separating home partition makes upgrading system or using multiple systems easier. +Why the size of EFI partition? [here](https://askubuntu.com/questions/670778/why-must-an-efi-partition-be-at-least-100mb) + +## Fix Grub Location + +> my fix is simply to use gparted to remove the boot flag before starting the +> installer, and the setting it back when the install is finished. It always +> works for me. + +How to remove wrong grub and install correct one [here](https://www.reddit.com/r/Ubuntu/comments/nd2qzw/grub_installed_to_wrong_drive_bug/) +Install boot repair tool https://pimylifeup.com/ubuntu-boot-repair-tool/ + +## Change GRUB settings + +Allows you to see more detailed logs during boot + +Edit `/etc/default/grub` + +Change + +``` +GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" +``` + +to + +``` +GRUB_CMDLINE_LINUX_DEFAULT="" +``` + +and set + +``` +GRUB_TIMEOUT=10 +``` + +Finally + +``` +sudo update-grub +``` + +## Samba + +For smb file sharing, install samba + +``` sh +sudo apt update +sudo install samba +``` + +``` sh +whereis samba +``` + +[Set up sharing](https://ubuntu.com/tutorials/install-and-configure-samba#3-setting-up-samba) + +## V2ray + +Core: + +First download the core file at `https://github.com/v2fly/v2ray-core/releases/download/v5.16.1/v2ray-linux-64.zip` + +Send it to the system via smb + +Then in the smbshare directory +``` sh +curl -L https://raw.githubusercontent.com/v2fly/fhs-install-v2ray/master/install-release.sh > go.sh +``` + +``` sh +sudo bash ./go.sh --local ./v2ray-linux-64.zip +``` + +``` sh +systemctl start v2ray +systemctl enable v2ray +``` + +v2rayA: + +[Install and set auto-start](https://v2raya.org/en/docs/prologue/installation/debian/) + +``` sh +sudo apt install /installer_path +``` + +``` sh +sudo systemctl start v2raya.service +sudo systemctl enable v2raya.service +``` + +### V2rayA usage + +1. Choose up to 4 proxy servers +2. Set On: Proxt except CN Sites +3. Turn on at the top left +4. No need to configure bash again + +### Disable + +``` sh +sudo systemctl disable v2raya.service +sudo systemctl disable v2ray +sudo systemctl status v2ray +``` + +## Update and Upgrade + + +``` sh +sudo apt update && sudo apt upgrade +``` + +``` sh +sudo apt clean && sudo apt autoremove && sudo reboot +``` + +## rustdesk + +Desktop sharing app + +## Git + +``` sh +sudo apt install git +``` + +## tlp + +Power management + +``` sh +sudo apt install tlp +sudo tlp start +tlp-stat -s +tlp-stat -b +``` + +Edit charge level: + +``` sh +sudo vim /etc/tlp.conf +``` + +Change + +``` +START_CHARGE_THRESH_BAT0=50 +STOP_CHARGE_THRESH_BAT0=60 +``` + +## Time discrepancy + +https://www.howtogeek.com/323390/how-to-fix-windows-and-linux-showing-different-times-when-dual-booting/#:~:text=By%20default%2C%20Windows%20assumes%20the,make%20Windows%20use%20UTC%20time. +## CUDA + +[Source](https://qiyuan-z.github.io/2022/01/04/Ubuntu%E5%A4%9A%E7%89%88%E6%9C%ACcuda%E5%AE%89%E8%A3%85%E4%B8%8E%E5%88%87%E6%8D%A2/) + +## Zero Tier One + +https://www.zerotier.com/download/#entry-4 +Simulate LAN network for devices in different networks + +MacOS: package installer +Linux: command line install + +### Set-up + +Go to my.zerotier.com +Make account +Create a network +Add members and authorize: [how](https://docs.zerotier.com/cli) +Turn off auto-assign IP +Add IPs within the managed routes + +## ssh server + +To ssh to Linux, Linux should be set up as a server. The computer that starts the ssh should have the ssh client. +https://ubuntu.com/server/docs/openssh-server + +Use `ssh-keygen -t rsa` to generate a pair of public and private keys. The public key should be stored on the server you're trying to access. + +Use `ssh-copy-id username@ip` to add your public key to the host's authorized_keys + +## SFTP server + +Don't need to do anything extra. But for creating a user and assign to a group specifically for sftp, read below: +https://www.cybrosys.com/blog/how-to-setup-sftp-server-on-ubuntu-20-04 + +## conda + +Install [conda](conda.md) on Ubuntu [here](https://docs.anaconda.com/miniconda/) + +Install for multiple users [here](https://askubuntu.com/questions/1457726/how-and-where-to-install-conda-to-be-accessible-to-all-users). I used the group name `miniconda` + +## jupyter lab + +Install [jupyter](jupyter.md) lab on Ubuntu in [conda](conda.md). + +Also installed jupyverse and set autostart. See [jupyter](jupyter.md) note. + +## Turn off stupid sound + +https://ubuntuhandbook.org/index.php/2023/01/disable-event-sound-ubuntu/ + +## Filling up EFI System Partition + +https://askubuntu.com/questions/1410236/how-can-i-expand-the-boot-efi-partition-on-ubuntu-20-04 + +## The initramfs will attempt to resume from /dev/sdb4 + +https://askubuntu.com/questions/1189835/the-initramfs-will-attempt-to-resume-from-dev-dm-1 + +## Change Ubuntu mirror + +Go into Software & Updates, choose new server location. \ No newline at end of file diff --git a/Programming Environment/bash.md b/Programming Environment/bash.md index 6d1d102..d8cf737 100644 --- a/Programming Environment/bash.md +++ b/Programming Environment/bash.md @@ -16,4 +16,14 @@ In this specific case, the `find` command is searching for files in the director ## Setting proxy -https://askubuntu.com/questions/583797/how-to-set-a-proxy-for-terminal \ No newline at end of file +https://askubuntu.com/questions/583797/how-to-set-a-proxy-for-terminal + +## Remove last line + +```bash +sed -i '$ d' foo.txt +``` + +```bash +sed -i '' -e '$ d' foo.txt +``` diff --git a/Programming Environment/conda.md b/Programming Environment/conda.md index bada8c4..ae30146 100644 --- a/Programming Environment/conda.md +++ b/Programming Environment/conda.md @@ -1,6 +1,7 @@ # Conda > [Python](Python.md) package and environment manager +> [Common commands](https://developer.aliyun.com/article/1267271) ## Add conda to PATH on Windows @@ -22,6 +23,15 @@ `pip install pysocks` +## Add channels + +``` sh +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge +conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/ +conda config --show channels +``` ## Create environment `conda create -n "name" pip` @@ -48,4 +58,52 @@ ## Use with pip `conda install pip` -Then do other pip installations \ No newline at end of file +Then do other pip installations + +## Create requirements.txt + +``` sh +conda list -e > requirements.txt +``` + +Use + +``` sh +conda install --file requirements.txt +``` + +## conda clean + +clean cache and unused package. +resolves `ValueError: unsupported format character 'T' (0x54) at index 1787` error + +``` sh +conda clean --all -y +``` + +## conda build + +Build R packages from github + +### Install conda build + +``` sh +conda activate base +conda install conda-build +``` + +### Build skeleton + +``` sh +conda skeleton cran https://github.com/cole-trapnell-lab/cicero-release +``` + +### Build package + +``` sh +conda-build r-fansi +``` +### Convert to other platforms + +https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs.html#converting-a-package-for-use-on-all-platforms + diff --git a/Programming Environment/jupyter.md b/Programming Environment/jupyter.md index 57b3553..3600c4f 100644 --- a/Programming Environment/jupyter.md +++ b/Programming Environment/jupyter.md @@ -6,7 +6,82 @@ Meaning: ``` bash conda activate selenium -pip install ipykernel +conda install ipykernel ipython kernel install --user --name=selenium conda deactivate +``` + +## irkernel + +[irkernel](irkernel.md) + +## nb_conda_kernels + +https://github.com/anaconda/nb_conda_kernels +Easily use all [R](R.md) and [Python](Python.md) [conda](conda.md) environments as kernels. + +## jupyverse + +Used to start jupyter lab and access it remotely. + +Single user use: +https://davidbrochart.github.io/jupyverse/usage/single_user/ + +``` sh +jupyverse --set auth.mode=noauth +``` + +## autostart + +https://medium.com/analytics-vidhya/auto-start-jupyter-lab-on-machine-boot-e4f6b3296034 +https://medium.com/@datamove/setup-jupyter-notebook-server-to-start-up-on-boot-and-use-different-conda-environments-147b091b9a5f + +## Jupyter lab configurations + +### starting folder + +First run + +``` sh +jupyter lab --generate-config +``` + +It will tell you where the config file is. Go open it. For it is at `~/.jupyter` + +Find `c.ServerApp.root_dir` and set it to a directory of your choice. + +Make sure to uncomment the line! + +### Starting port + +Similar as above, but change `c.ServerApp.port = 8080` or whatever number you like + +### ip listened + +`c.ServerApp.ip` defaults to localhost. Since I use port forwarding to forward my port to localhost on Ubuntu, it is ok I leave it at localhost. Otherwise change to 0.0.0.0 for remote access. + + +## Shutdown + +Find all running servers +``` sh +jupyter lab list +``` + +Ctrl + C or +``` sh +jupyter lab stop +``` + +## Remove kernel + +``` sh +jupyter kernelspec list +jupyter kernelspec uninstall unwanted-kernel +``` + +## Run jupyter lab + +``` sh +jupyter lab --port=8080 --no-browser ``` \ No newline at end of file diff --git a/Python/Python Magic Methods.md b/Python/Python Magic Methods.md index 5116c06..f401db3 100644 --- a/Python/Python Magic Methods.md +++ b/Python/Python Magic Methods.md @@ -1,3 +1,4 @@ # Magic Methods -> [Python](Python.md) \ No newline at end of file +> [Python](Python.md) + diff --git a/Python/celloracle.md b/Python/celloracle.md new file mode 100644 index 0000000..622f461 --- /dev/null +++ b/Python/celloracle.md @@ -0,0 +1,8 @@ +# celloracle + +## fa2 + +Used for plotting force directed graphs + +https://anaconda.org/conda-forge/fa2_modified works for later versions of python > 3.8. See [here](https://github.com/bhargavchippada/forceatlas2/issues/48) +python < 3.8 just do `pip install fa2` \ No newline at end of file diff --git a/Python/ipykernel.md b/Python/ipykernel.md new file mode 100644 index 0000000..41b60f4 --- /dev/null +++ b/Python/ipykernel.md @@ -0,0 +1,6 @@ +# ipykernel + +Install in [conda](conda.md) environment to use it in [jupyter](jupyter.md) notebook + +Should be detected automatically, if not, install kernel manually [here](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) + diff --git a/Python/pip.md b/Python/pip.md new file mode 100644 index 0000000..ef17549 --- /dev/null +++ b/Python/pip.md @@ -0,0 +1,5 @@ +# pip + +Python package manager +[conda](conda.md) +[Common commands](https://developer.aliyun.com/article/1267271) \ No newline at end of file diff --git a/Python/scanpy.md b/Python/scanpy.md new file mode 100644 index 0000000..56e4f8e --- /dev/null +++ b/Python/scanpy.md @@ -0,0 +1,3 @@ +# scanpy + +Cluster tutorial, basic introduction: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html \ No newline at end of file diff --git a/R/R.md b/R/R.md index a2744a7..c2ba1f8 100644 --- a/R/R.md +++ b/R/R.md @@ -1,3 +1,61 @@ # The R language -All R packages links to here. \ No newline at end of file +All R packages links to here. + +## Where is R + +``` r +R.home() +``` + +## Where are packages + +``` r +.libPaths() +``` + +## Can't find a library + +Sometimes R can't find a required library, like openssl. Download the source code and use this to install: + +``` sh +R CMD INSTALL --configure-vars='INCLUDE_DIR=/opt/homebrew/Caskroom/miniconda/base/pkgs/openssl-3.2.1-h0d3ecfb_0/include LIB_DIR=/opt/homebrew/Caskroom/miniconda/base/pkgs/openssl-3.2.1-h0d3ecfb_0/lib' openssl_2.2.0.tar.gz +``` + +First find where the library is installed with: + +``` sh +which openssl +``` + +or + +``` sh +whereis openssl +``` + +or find where [conda](conda.md) environment installs the library + +## Look in directory + +``` r +tempdir() +``` + +``` r +list.files() +``` + +## R and R packages from conda + +https://medium.com/@tortuecookie/using-r-with-conda-80953395bfe6 + +Some new packages are not there. See below: +### Install github r package in conda + +https://stackoverflow.com/questions/52061664/install-r-package-from-github-using-conda + +``` sh +conda skeleton cran +conda build --R= r- +``` \ No newline at end of file diff --git a/R/Ubuntu R.md b/R/Ubuntu R.md new file mode 100644 index 0000000..38dabc5 --- /dev/null +++ b/R/Ubuntu R.md @@ -0,0 +1,11 @@ +# Install R on Ubuntu + +https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-22-04 + +## Install packages for system + +``` sh +sudo -i R +``` + +Somehow installed packages are not found in projects. Just install them in projects. \ No newline at end of file diff --git a/R/cicero.md b/R/cicero.md new file mode 100644 index 0000000..9dd608b --- /dev/null +++ b/R/cicero.md @@ -0,0 +1,3 @@ +# Cicero + +https://cole-trapnell-lab.github.io/cicero-release/docs_m3/ \ No newline at end of file diff --git a/R/irkernel.md b/R/irkernel.md new file mode 100644 index 0000000..51748c0 --- /dev/null +++ b/R/irkernel.md @@ -0,0 +1,14 @@ +# irkernel + +Use R in Jupyter notebook + +https://github.com/IRkernel/IRkernel +[jupyter](jupyter.md) +[R](R.md) + +## install to a specific name + +``` r +IRkernel::installspec(name = 'ir33', displayname = 'R 3.3') +``` + diff --git a/R/monocle3.md b/R/monocle3.md new file mode 100644 index 0000000..e0f22e4 --- /dev/null +++ b/R/monocle3.md @@ -0,0 +1,3 @@ +# Monocle3 + +https://cole-trapnell-lab.github.io/monocle3/docs/installation/ \ No newline at end of file diff --git a/R/openssl.md b/R/openssl.md new file mode 100644 index 0000000..a1d72f9 --- /dev/null +++ b/R/openssl.md @@ -0,0 +1,19 @@ +# openssl + +## bug + +Multiple versions installed + +homebrew ver +/opt/homebrew/Cellar/openssl@3/3.3.1/lib + +miniconda ver +/opt/homebrew/Caskroom/miniconda/base/pkgs/openssl-3.2.1-h0d3ecfb_0/lib + +Encounter error when installing openssl package: + +Error: package or namespace load failed for ‘openssl’ in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Users/wangchuyao/R Projects/grnAnalysis/renv/library/R-4.2/x86_64-apple-darwin17.0/00LOCK-openssl/00new/openssl/libs/openssl.so': + dlopen(/Users/wangchuyao/R Projects/grnAnalysis/renv/library/R-4.2/x86_64-apple-darwin17.0/00LOCK-openssl/00new/openssl/libs/openssl.so, 0x0006): tried: '/Users/wangchuyao/R Projects/grnAnalysis/renv/library/R-4.2/x86_64-apple-darwin17.0/00LOCK-openssl/00new/openssl/libs/openssl.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/wangchuyao/R Projects/grnAnalysis/renv/library/R-4.2/x86_64-apple-darwin17.0/00LOCK-openssl/00new/openssl/libs/openssl.so' (no such file), '/Users/wangchuyao/R Projects/grnAnalysis/renv/library/R-4.2/x86_64-apple-darwin17.0/00LOCK-openssl/00new/openssl/libs/openssl.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64')) + +**Cause**: complier complies for arm64, but my r is x86. Can's solve without breaking the other renv environment. \ No newline at end of file diff --git a/note project.md b/note project.md new file mode 100644 index 0000000..d366d02 --- /dev/null +++ b/note project.md @@ -0,0 +1,23 @@ +# notes + +## Genome coordinate file + +aka. "chromosome size files" or "chromosome length files," needed for cicero to predict cis-regulatory relationships in scATAC data. + +Genome coordinate file [here](https://hgdownload.soe.ucsc.edu/downloads.html#human) +Find your reference genome. +Get the chrom.sizes file + +## scRNA + +https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html pbmc3k example +https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/03_scRNA-seq_data_preprocessing/scanpy_preprocessing_with_Paul_etal_2015_data.ipynb scRNA notebook + +## pseudotime + + + +## baseGRN + +done locally on my [Ubuntu 2204](Ubuntu%202204.md) +