Skip to content
This repository has been archived by the owner on Aug 25, 2019. It is now read-only.

2017.12.12

DMTR edited this page Dec 12, 2017 · 6 revisions

Good morning! Today I will try to finish the script for Piano.

Progress (written as it goes, so bear with me):

  • Started writing the script for Piano (using my limited R skills - though I feel like I'm getting better, which is good).
  • There are some complications though:
    • I found out that there are duplicates regarding the gene symbols after the processed DESeq2 files had been translated from Ensembl gene ID to gene symbols.
    • This is problematic because then it is not simple to load them as a dataframe in R, because it cannot have more than one row with the same row names.
    • I have checked their (most importantly) p-values and they seem to be the same, and since Piano requires only the p-value for each gene, I discarded all the duplicates and just retain one copy of them.
    • Another error is that there are rows with 'NA' values, as these cannot be proccesed anyway, they were also discarded.
    • Additionally, for some unknown reason, I keep getting the error message which I suspect because some gene symbols have extra characters that are not in the gene set files. I will try to remove these: from xx6jas.12 to xx6jas.
gsaRes <- runGSA(HS_stat, gsc=go, verbose=TRUE)
Running gene set analysis:
Checking arguments...done!
Final gene/gene-set association: 0 genes and 0 gene sets
  Details:
  Calculating gene set statistics from 0 out of 16076 gene-level statistics
  Using all 16076 gene-level statistics for significance estimation
  Removed 15578 genes from GSC due to lack of matching gene statistics
  Removed 4436 gene sets containing no genes after gene removal
  Removed additionally 0 gene sets not matching the size limits
  Loaded additional information for 0 gene sets
Calculating gene set statistics...
 Show Traceback
**Error in gsc[[iGeneSet]] : subscript out of bounds**
  • Ok, now it finally works. Lesson of the day: do not blindly follow the instruction manual. They are already the gene symbols, it was superfluous to insert the string "gene" preceding the actual gene symbol. Another thing that I found was unnecessary was removing the extra bits in the gene name.
names(HS_stat) <- paste("gene", HS$gene_symbol, sep="") # Nope!
names(HS_stat) <- paste(HS$gene_symbol) # Ja!
  • Now that the script has been tested to work on one DESeq2 output, I need to modify it to make it more 'modular', i.e. just requires the input of the processed DESeq2 file path instead of hardcoding everything. I'm not sure if I can do this with my limited knowledge in R, but since time is of the essence (the group is meeting tomorrow to discuss all the results!), I will hardcode it anyway. If I have the time, then I will try to make it work (maybe by making it into a function?).
  • In the meantime, I will convert this to an Rscript and send it to Uppmax for faster processing before taking my lunch break.

  • I got the results, but they lack the directionality of the pathway changes. After asking Arif, he asked me to check whether I have put the log2FoldChange and some other parameters in the function - which I had not. So here we go again.
gsaRes <- runGSA(HS_stat, fc, gsc=kegg, geneSetStat="reporter",
                signifMethod="nullDist", nPerm=1000, verbose=TRUE)
  • Got the results. Now going to fix my script for Limma. It was weird to get a adjusted p-val in the magnitude of 10-68 (otherwise would be happy), but in this case I realised that I forgot to take into account the biological replicates within the cohort.
Clone this wiki locally