Skip to content

Data Analysis

zhengxwen edited this page Dec 20, 2014 · 9 revisions

Data Analysis

We developed gdsfmt and SNPRelate (high-performance computing R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations in GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures.

> # Open the GDS file
> genofile <- snpgdsOpen(snpgdsExampleFileName())

> # Get population information
> #   or pop_code <- scan("pop.txt", what=character())
> #   if it is stored in a text file "pop.txt"
> pop_code <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
> # Display the first six values
> head(pop_code)
[1] "YRI" "YRI" "YRI" "YRI" "CEU" "CEU"

1. LD-based SNP pruning

It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis.

> set.seed(1000)
> # Try different LD thresholds for sensitivity analysis
> snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)
SNP pruning based on LD:
Removing 365 non-autosomal SNPs.
Removing 1 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 279 samples, 8722 SNPs
	Using 1 (CPU) core
	Sliding window: 500000 basepairs, Inf SNPs
	|LD| threshold: 0.2
Chromosome 1: 75.42%, 540/716
Chromosome 2: 72.24%, 536/742
Chromosome 3: 74.71%, 455/609
Chromosome 4: 73.31%, 412/562
Chromosome 5: 77.03%, 436/566
Chromosome 6: 75.58%, 427/565
Chromosome 7: 75.42%, 356/472
Chromosome 8: 71.31%, 348/488
Chromosome 9: 77.88%, 324/416
Chromosome 10: 74.33%, 359/483
Chromosome 11: 77.40%, 346/447
Chromosome 12: 76.81%, 328/427
Chromosome 13: 75.58%, 260/344
Chromosome 14: 76.95%, 217/282
Chromosome 15: 76.34%, 200/262
Chromosome 16: 72.66%, 202/278
Chromosome 17: 74.40%, 154/207
Chromosome 18: 73.68%, 196/266
Chromosome 19: 85.00%, 102/120
Chromosome 20: 71.62%, 164/229
Chromosome 21: 76.98%, 97/126
Chromosome 22: 75.86%, 88/116
6547 SNPs are selected in total.

> names(snpset)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22"
> head(snpset$chr1)  # snp.id
[1]  1  2  4  5  7 10
> # Get all selected snp id
> snpset.id <- unlist(snpset)

2. Principal Component Analysis

The functions in SNPRelate for PCA include calculating the genetic covariance matrix from genotypes, computing the correlation coefficients between sample loadings and genotypes for each SNP, calculating SNP eigenvectors (loadings), and estimating the sample loadings of a new dataset from specified SNP eigenvectors.

> # Run PCA
> pca <- snpgdsPCA(genofile)
Principal Component Analysis (PCA) on SNP genotypes:
Removing 365 non-autosomal SNPs.
Removing 1 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 279 samples, 8722 SNPs
	Using 1 (CPU) core
PCA:	the sum of all working genotypes (0, 1 and 2) = 2446510
PCA:	Thu Jul 17 17:48:04 2014	0%
PCA:	Thu Jul 17 17:48:04 2014	100%
PCA:	Thu Jul 17 17:48:04 2014	Begin (eigenvalues and eigenvectors)
PCA:	Thu Jul 17 17:48:04 2014	End (eigenvalues and eigenvectors)

The code below shows how to calculate the percent of variation is accounted for by the principal component for the first 16 PCs. It is clear to see the first two eigenvectors hold the largest percentage of variance among the population, although the total variance accounted for is still less the one-quarter of the total.

> pc.percent <- 100 * pca$eigenval[1:16]/sum(pca$eigenval)
> pc.percent
 [1] 12.2251756  5.8397767  1.0111423  0.9475002  0.8437720  0.7363588
 [7]  0.7350545  0.7243414  0.7200081  0.7047742  0.7033253  0.6943755
[13]  0.6847090  0.6780401  0.6720635  0.6692861

In the case of no prior population information,

> # make a data.frame
> tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    stringsAsFactors = FALSE)
> head(tab)
  sample.id         EV1         EV2
1   NA19152  0.08411287  0.01226860
2   NA19139  0.08360644  0.01085849
3   NA18912  0.08110808  0.01184524
4   NA19160  0.08680864  0.01447106
5   NA07034 -0.03109761 -0.07709255
6   NA07055 -0.03228450 -0.08155730

> # Draw
> plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")
```R
![](http://corearray.sourceforge.net/images/SNPRelateTutorial-022.png)

If there are population information,
```R
> # Get sample id
> sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
> # Get population information
> #   or pop_code <- scan("pop.txt", what=character())
> #   if it is stored in a text file "pop.txt"
> pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))
> # assume the order of sample IDs is as the same as population codes
> head(cbind(sample.id, pop_code))
     sample.id pop_code
[1,] "NA19152" "YRI"   
[2,] "NA19139" "YRI"   
[3,] "NA18912" "YRI"   
[4,] "NA19160" "YRI"   
[5,] "NA07034" "CEU"   
[6,] "NA07055" "CEU"   

> # Make a data.frame
> tab <- data.frame(sample.id = pca$sample.id,
    pop = factor(pop_code)[match(pca$sample.id, sample.id)],
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    stringsAsFactors = FALSE)
> head(tab)
  sample.id pop         EV1         EV2
1   NA19152 YRI  0.08411287  0.01226860
2   NA19139 YRI  0.08360644  0.01085849
3   NA18912 YRI  0.08110808  0.01184524
4   NA19160 YRI  0.08680864  0.01447106
5   NA07034 CEU -0.03109761 -0.07709255
6   NA07055 CEU -0.03228450 -0.08155730

> # Draw
> plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),
    xlab="eigenvector 2", ylab="eigenvector 1")
> legend("topleft", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))

Plot the principal component pairs for the first four PCs:

> lbls <- paste("PC", 1:4, "\n", format(pc.percent[1:4], digits=2), "%", sep="")
> pairs(pca$eigenvect[,1:4], col=tab$pop, labels=lbls)

To calculate the SNP correlations between eigenvactors and SNP genotypes:

> # Get chromosome index
> chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome"))
> CORR <- snpgdsPCACorr(pca, genofile, eig.which=1:4)
SNP correlations:
Working space: 279 samples, 9088 SNPs
	Using 1 CPU core.
	Using the top 32 eigenvectors.
SNP Correlation:	the sum of all working genotypes (0, 1 and 2) = 2553065
SNP Correlation:	Thu Jul 17 17:48:05 2014	0%
SNP Correlation:	Thu Jul 17 17:48:05 2014	100%

> par( mfrow=c(3,1))
> for (i in 1:3)
  {
      plot(abs(CORR$snpcorr[i,]), ylim=c(0,1), xlab="SNP Index",
          ylab=paste("PC", i), col=chr, pch="+")
  }

3. Relatedness Analysis

For relatedness analysis, identity-by-descent (IBD) estimation in SNPRelate can be done by either the method of moments (MoM) (Purcell et al., 2007) or maximum likelihood estimation (MLE) (Milligan, 2003; Choi et al., 2009). Although MLE estimates are more reliable than MoM, MLE is significantly more computationally intensive. For both of these methods it is preffered to use a LD pruned SNP set.

> # YRI samples
> sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
> YRI.id <- sample.id[pop_code == "YRI"]

1. Estimating IBD Using PLINK method of moments (MoM)

> # Estimate IBD coefficients
> ibd <- snpgdsIBDMoM(genofile, sample.id=YRI.id, snp.id=snpset.id,
    maf=0.05, missing.rate=0.05)
IBD analysis (PLINK method of moment) on SNP genotypes:
Removing 1285 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 93 samples, 5262 SNPs
	Using 1 (CPU) core
PLINK IBD:	the sum of all working genotypes (0, 1 and 2) = 484520
PLINK IBD:	Thu Jul 17 17:48:26 2014	0%
PLINK IBD:	Thu Jul 17 17:48:26 2014	100%

> # Make a data.frame
> ibd.coeff <- snpgdsIBDSelection(ibd)
> head(ibd.coeff)
      ID1     ID2        k0         k1     kinship
1 NA19152 NA19139 0.9548539 0.04514610 0.011286524
2 NA19152 NA18912 1.0000000 0.00000000 0.000000000
3 NA19152 NA19160 1.0000000 0.00000000 0.000000000
4 NA19152 NA18515 0.9234541 0.07654590 0.019136475
5 NA19152 NA19222 1.0000000 0.00000000 0.000000000
6 NA19152 NA18508 0.9833803 0.01661969 0.004154922
> plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1),
    xlab="k0", ylab="k1", main="YRI samples (MoM)")
> lines(c(0,1), c(1,0), col="red", lty=2)

2. Estimating IBD Using Maximum Likelihood Estimation (MLE)

> # Estimate IBD coefficients
> set.seed(1000)
> snp.id <- sample(snpset.id, 5000)  # random 5000 SNPs
> ibd <- snpgdsIBDMLE(genofile, sample.id=YRI.id, snp.id=snp.id,
    maf=0.05, missing.rate=0.05)

> # Make a data.frame
> ibd.coeff <- snpgdsIBDSelection(ibd)

> plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1),
    xlab="k0", ylab="k1", main="YRI samples (MLE)")
> lines(c(0,1), c(1,0), col="red", lty=2)

3. Relationship inference Using KING method of moments

Within- and between-family relationship could be inferred by the KING-robust method in the presence of population stratification.

> # Incorporate with pedigree information
> family.id <- read.gdsn(index.gdsn(genofile, "sample.annot/family.id"))
> family.id <- family.id[match(YRI.id, sample.id)]
> table(family.id)
family.id
101 105 112 117  12  13  16  17  18  23  24  28   4  40  42  43  45  47  48   5 
  3   3   3   4   4   3   3   3   3   3   4   3   3   3   3   3   3   3   3   3 
 50  51  56  58  60  71  72  74  77   9 
  3   3   3   3   3   3   3   3   3   3 

> ibd.robust <- snpgdsIBDKING(genofile, sample.id=YRI.id, family.id=family.id)
IBD analysis (KING method of moment) on SNP genotypes:
Removing 365 non-autosomal SNPs.
Removing 563 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 93 samples, 8160 SNPs
	Using 1 (CPU) core
# of families: 30, and within- and between-family relationship are estimated differently.
Working space: 93 samples, 8160 SNPs
	Using 1 CPU core.
Relationship inference in the presence of population stratification.
KING IBD:	the sum of all working genotypes (0, 1 and 2) = 755648
KING IBD:	Thu Jul 17 17:48:26 2014	0%
KING IBD:	Thu Jul 17 17:48:26 2014	100%

> names(ibd.robust)
[1] "sample.id" "snp.id"    "afreq"     "IBS0"      "kinship"  

> # Pairs of individuals
> dat <- snpgdsIBDSelection(ibd.robust)
> head(dat)
      ID1     ID2       IBS0      kinship
1 NA19152 NA19139 0.05504926 -0.005516960
2 NA19152 NA18912 0.05738916 -0.003658537
3 NA19152 NA19160 0.06230760 -0.034086156
4 NA19152 NA18515 0.05602758  0.007874016
5 NA19152 NA19222 0.05923645 -0.012668574
6 NA19152 NA18508 0.05561722  0.002216848
> plot(dat$IBS0, dat$kinship, xlab="Proportion of Zero IBS",
    ylab="Estimated Kinship Coefficient (KING-robust)")

4. Identity-By-State Analysis

For the n individuals in a sample, snpgdsIBS can be used to create a n-by-n matrix of genome-wide average IBS pairwise identities:

> ibs <- snpgdsIBS(genofile, num.thread=2)
Identity-By-State (IBS) analysis on SNP genotypes:
Removing 365 non-autosomal SNPs.
Removing 1 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 279 samples, 8722 SNPs
	Using 2 (CPU) cores
IBS:	the sum of all working genotypes (0, 1 and 2) = 2446510
IBS:	Thu Jul 17 17:48:26 2014	0%
IBS:	Thu Jul 17 17:48:26 2014	100%

The heat map is shown:

> library(lattice)
> L <- order(pop_code)
> levelplot(ibs$ibs[L, L], col.regions = terrain.colors)

To perform multidimensional scaling analysis on the n-by-n matrix of genome-wide IBS pairwise distances:

> loc <- cmdscale(1 - ibs$ibs, k = 2)
> x <- loc[, 1]; y <- loc[, 2]
> race <- as.factor(pop_code)

> plot(x, y, col=race, xlab = "", ylab = "",
    main = "Multidimensional Scaling Analysis (IBS Distance)")
> legend("topleft", legend=levels(race), text.col=1:nlevels(race))

To perform cluster analysis on the n-by-n matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score:

> set.seed(100)
> ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))
Identity-By-State (IBS) analysis on SNP genotypes:
Removing 365 non-autosomal SNPs.
Removing 1 SNPs (monomorphic, < MAF, or > missing rate)
Working space: 279 samples, 8722 SNPs
	Using 2 (CPU) cores
IBS:	the sum of all working genotypes (0, 1 and 2) = 2446510
IBS:	Thu Jul 17 17:48:28 2014	0%
IBS:	Thu Jul 17 17:48:28 2014	100%

> # Determine groups of individuals automatically
> rv <- snpgdsCutTree(ibs.hc)
Determine groups by permutation (Z threshold: 15, outlier threshold: 5):
Create 3 groups.

> plot(rv$dendrogram, leaflab="none", main="HapMap Phase II")
> table(rv$samp.group)
G001 G002 G003 
  93   94   92 

Here is the population information we have known:

> # Determine groups of individuals by population information
> rv2 <- snpgdsCutTree(ibs.hc, samp.group=as.factor(pop_code))
Create 4 groups.

> plot(rv2$dendrogram, leaflab="none", main="HapMap Phase II")
> legend("topright", legend=levels(race), col=1:nlevels(race), pch=19, ncol=4)

> # Close the GDS file
> snpgdsClose(genofile)