Genome-wide association study

Introduction

Welcome to the lab practice. In today’s work, we will work on the human data that passed our quality control yesterday. There will be three sections in today’s practice:

A simple association study using R
A simple association study using PLINK
[ADVANCED] GWAS linear mixed model using R

The data set we will be using today is accessible through my Google Drive storage. https://drive.google.com/drive/folders/1LIuCaFqfURj0-yfculFcmHlxNeC1_i7m?usp=share_link

A simple GWAS using R

Here’s an example of conducting a chi-square test in R (the example in the lecture note). You will be conducting a GWAS test utilizing the chi-square method.

exampleData = data.frame(
    case = c(20, 20, 60),
    control = c(50, 30, 20)
)

result = chisq.test(exampleData)
print(result)


    Pearson's Chi-squared test

data:  exampleData
X-squared = 34.857, df = 2, p-value = 2.697e-08

Let’s start the GWAS test. First, read three data with human_all prefix binary data using read_plink() function in genio package.

library(genio)
### YOUR CODE HERE ###

A test run on the first locus

Step 1. Generate a data frame with only two columns. Remove samples with missing phenotypes or genotypes.

Phenotype: extract phenotype information from the fam data frame (1, 2 indicate this is a case/control study).
Genotype: extract genotype information from the X matrix (the first row). Remove samples which missing genotype information.

### YOUR CODE HERE ###

Step 2. Please use the function table() to convert the data frame into a contingency table. (HINT: ?table)

Note: the table() function will automatically remove rows with missing values.

### YOUR CODE HERE ###

Step 3. Chi-square test

### YOUR CODE HERE ###

Whole-genome association study

Please repeat the same steps you took with the first marker for all the other markers.

Step 1. Create a data frame with 4 columns.

CHR: Chromosome number
SNP: SNP marker ID
BP: Marker position
P: p-value (NA for all cells)

### YOUR CODE HERE ###

Step 2. A for loop can be utilized to conduct a chi-square test on all SNP markers. The p-values can be extracted from each iteration and added to the data frame created in Step 1. It’s essential to remember that R may provide a warning message if the expected value is small; you may ignore them.

### YOUR CODE HERE ###

Step 3. To create a Manhattan plot, utilize the manhattan() function within the qqman package. The data frame will already be in the default input format for this function if you use the same column names as shown in Step 1. Additionally, use the annotatePval = 0.01 option to annotate peaks with a p-value < 10-4.

Check this webpage for more information. https://r-graph-gallery.com/101_Manhattan_plot.html

### YOUR CODE HERE ###

Step 4. Q-Q plot. The QQ plot serves as a crucial tool for identifying issues in a GWAS. To use it, input the vector of p-values from your association result into the qq() function.

### YOUR CODE HERE ###

A simple GWAS using PLINK

There are more efficient ways to conduct association tests than using R. For instance, one can use well-developed software like PLINK. For more information, check this website. https://www.cog-genomics.org/plink/1.9/assoc

Step 1. Run association study by PLINK in the terminal.

### YOUR CODE HERE ###

Step 2. Read association test results (*.assoc) table by read.table() function.

### YOUR CODE HERE ###

Step 3. Manhattan plot. The result table generated from PLINK is the exact format for the plotting function.

### YOUR CODE HERE ###

Step 4. Q-Q plot

### YOUR CODE HERE ###

Question

You may notice a difference in results from R and PLINK. To prevent any confusion, please reference the help manual for more information on the chi-square test (chisq.test()) and Yate’s correction for continuity (https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity).
Based on the Q-Q plot outcome, which result is superior?

[Advanced] GWAS linear mixed model

You can choose from different R packages for fitting GWAS linear mixed model. Some of these packages are GWAStools, rrBLUP, and sommer. Among these choices, my personal preference is the sommer package.

We will use a rice dataset from Cornell University in 2011 (https://www.nature.com/articles/ncomms1467). Raw data can be found on the rice diversity website (http://www.ricediversity.org/data/sets/44kgwas/).

library(sommer)

Loading required package: Matrix

Loading required package: MASS

Loading required package: lattice

Loading required package: crayon

Read PLINK binary format files (rice.bed, rice.bim, rice.fam). This dataset contains various trait measurements. You can select one from the file rice_phenotype.txt and incorporate phenotype information into the fam data frame.

Read binary files by the genio() function.

### YOUR CODE HERE ###

Read the phenotype table (I will choose plant height in the answer version)

### YOUR CODE HERE ###

Merge data$fam and peno. The NSFTVID in the phenotype table is its individual ID.

### YOUR CODE HERE ###

Start preparing input data for the GWAS() function in the sommer() package.

The genotype matrix needs to be transposed (row: samples, column: markers).
Genotype coding should be converted from (0, 1, 2) to (-1, 0, 1).
Missing values in genotype coded as 0. (Today, we’re keeping it simple, but there are more effective methods for imputing missing values.)
The row names of the genotype matrix should be identical to the ID in the phenotype table.
Make sure the sample id is factorized.

### YOUR CODE HERE ###

Now we need to generate an additive relationship matrix using A.mat() function.

### YOUR CODE HERE ###

Then, we can fit the linear mixed model by the function GWAS()

### YOUR CODE HERE ###

Create a data frame for plotting the function. You should have CHR, SNP, BP, and P in four columns. The p-value could be calculated by 10^(-as.numeric(fit$scores))).

### YOUR CODE HERE ###

Making Manhattan plot.

To create a Manhattan plot, please keep in mind that if the p-value is 0, its log value becomes infinity, which cannot be plotted. Therefore, kindly remove the data point before proceeding with the plot.

### YOUR CODE HERE ###