= data.frame(
exampleData case = c(20, 20, 60),
control = c(50, 30, 20)
)
= chisq.test(exampleData)
result print(result)
Pearson's Chi-squared test
data: exampleData
X-squared = 34.857, df = 2, p-value = 2.697e-08
Welcome to the lab practice. In today’s work, we will work on the human data that passed our quality control yesterday. There will be three sections in today’s practice:
The data set we will be using today is accessible through my Google Drive storage. https://drive.google.com/drive/folders/1LIuCaFqfURj0-yfculFcmHlxNeC1_i7m?usp=share_link
Here’s an example of conducting a chi-square test in R (the example in the lecture note). You will be conducting a GWAS test utilizing the chi-square method.
exampleData = data.frame(
case = c(20, 20, 60),
control = c(50, 30, 20)
)
result = chisq.test(exampleData)
print(result)
Pearson's Chi-squared test
data: exampleData
X-squared = 34.857, df = 2, p-value = 2.697e-08
Let’s start the GWAS test. First, read three data with human_all prefix binary data using read_plink()
function in genio
package.
Step 1. Generate a data frame with only two columns. Remove samples with missing phenotypes or genotypes.
Step 2. Please use the function table()
to convert the data frame into a contingency table. (HINT: ?table)
Note: the table()
function will automatically remove rows with missing values.
Step 3. Chi-square test
Please repeat the same steps you took with the first marker for all the other markers.
Step 1. Create a data frame with 4 columns.
Step 2. A for loop can be utilized to conduct a chi-square test on all SNP markers. The p-values can be extracted from each iteration and added to the data frame created in Step 1. It’s essential to remember that R may provide a warning message if the expected value is small; you may ignore them.
Step 3. To create a Manhattan plot, utilize the manhattan()
function within the qqman
package. The data frame will already be in the default input format for this function if you use the same column names as shown in Step 1. Additionally, use the annotatePval = 0.01 option to annotate peaks with a p-value < 10-4.
Check this webpage for more information. https://r-graph-gallery.com/101_Manhattan_plot.html
Step 4. Q-Q plot. The QQ plot serves as a crucial tool for identifying issues in a GWAS. To use it, input the vector of p-values from your association result into the qq() function.
There are more efficient ways to conduct association tests than using R. For instance, one can use well-developed software like PLINK. For more information, check this website. https://www.cog-genomics.org/plink/1.9/assoc
Step 1. Run association study by PLINK in the terminal.
Step 2. Read association test results (*.assoc
) table by read.table()
function.
Step 3. Manhattan plot. The result table generated from PLINK is the exact format for the plotting function.
Step 4. Q-Q plot
chisq.test()
) and Yate’s correction for continuity (https://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity).You can choose from different R packages for fitting GWAS linear mixed model. Some of these packages are GWAStools, rrBLUP, and sommer. Among these choices, my personal preference is the sommer package.
We will use a rice dataset from Cornell University in 2011 (https://www.nature.com/articles/ncomms1467). Raw data can be found on the rice diversity website (http://www.ricediversity.org/data/sets/44kgwas/).
Loading required package: Matrix
Loading required package: MASS
Loading required package: lattice
Loading required package: crayon
Read PLINK binary format files (rice.bed, rice.bim, rice.fam). This dataset contains various trait measurements. You can select one from the file rice_phenotype.txt and incorporate phenotype information into the fam data frame.
Read binary files by the genio()
function.
Read the phenotype table (I will choose plant height in the answer version)
Merge data$fam and peno. The NSFTVID in the phenotype table is its individual ID.
Start preparing input data for the GWAS() function in the sommer() package.
Now we need to generate an additive relationship matrix using A.mat()
function.
Then, we can fit the linear mixed model by the function GWAS()
Create a data frame for plotting the function. You should have CHR, SNP, BP, and P in four columns. The p-value could be calculated by 10^(-as.numeric(fit$scores))).
To create a Manhattan plot, please keep in mind that if the p-value is 0, its log value becomes infinity, which cannot be plotted. Therefore, kindly remove the data point before proceeding with the plot.