### YOUR CODE HERE ###
Quality Control
You are highly recommended to work on the server today. The first part of this practice will need to run PLINK and vcftools.
Dataset
- Copy the dataset to your working directory from
/proj/g2020004/nobackup/oumark/gwas_lab2/human.tar.gz
on UPPMAX. - Unzip the
human.tar.gz
- After unzipping the folder, you will have
human.bed
,human.bim
, andhuman.fam
three files. We call this kind of format as PLINK binary files.
A quick view
Before we start all QC jobs, let’s have a quick view of our data. First, find answers to the following questions from PLINK’s website.
FAM file
Have a look at human.fam
in terminal
- Which column is the family ID (FID)? What is the prefix?
- Which column is the Within-family ID (IID)? What is the prefix?
- Does this dataset provide parental information? What does 0 stands for?
- How many samples are included in this dataset?
BIM file
Have a look at human.bim
in terminal
- What does each column mean?
- How does the missing value present in the columns for reference and alternative alleles?
- How many markers are included in this dataset?
PLINK
- Homepage: https://www.cog-genomics.org/plink/1.9/
Data input
Several data formats are available for PLINK. The dataset we provide for today’s practice is the PLINK 1 binary format. Have a look at the help manual on how to load the dataset. LINK
Here are some tricks to make your work easier: - The prefix of the three files should be the same. - Make sure all three files are placed in the same directory.
Data output
One powerful function of PLINK is converting the dataset format used in other software. Have a look at the help manual for the output format option. LINK
- Input the PLINK binary file and generate a block-gzipped VCF file. (this will be used in the next section)
- What’s the function of
--out
option? - Have a look at the message in the terminal. How many variants are loaded from the BIM file? How many males and females sample were loaded?
- Do you find your output file human.vcf.gz?
- What’s the function of
### YOUR CODE HERE ###
Try following QC options
Try to filter the dataset by the following steps and write down the number of markers and samples passed in each step. (Note: output name needs to be changed in each run)
- Minor allele frequency (0.1)
### YOUR CODE HERE ###
- Individual genotyping rate (0.05 missingness)
### YOUR CODE HERE ###
- SNP genotype rate (0.05 missingness)
### YOUR CODE HERE ###
- Hardy-Weinberg P (0.001)
### YOUR CODE HERE ###
After doing all steps, - How many markers remain in the dataset? - How many samples remain in the dataset?
All in a run
Perform all steps from the raw data in one run by PLINK. - Does the number of remaining markers and samples differ from the step-by-step result? Why?
### YOUR CODE HERE ###
VCFtools
- Homepage: https://vcftools.github.io/index.html
- Use the
human.vcf.gz
you generated in the previous section.
Data input
- What are the avalible input formats for this software? Which option should we use for our dataset?
Data output
- What are the output options?
- What is the command for output VCF format? a 012 matrix for genotype? and PLINK format?
Try QC options
Try to filter your dataset by the following instructions and write down the number of remaining markers and samples after filtering.
- Minor allele frequency (0.1)
### YOUR CODE HERE ###
- Include only bi-allelic sites
### YOUR CODE HERE ###
- SNP genotype rate (0.05 missingness)
### YOUR CODE HERE ###
- Hardy-Weinberg P (0.001)
### YOUR CODE HERE ###
All in a run
Perform all QC steps in one run and compare the result with the step-wise method.
### YOUR CODE HERE ###
Visualization in R [ADVANCED]
Load data to R
read_plink()
function in thegenio
package is a powerful tool to read PLINK binary files.- Use PLINK binary files passed all QC in one run in the previous section.
#library(genio)
### YOUR CODE HERE ###
Genotype frequency
Calculate genotype (0: A1A1, 1: A1A2, 2: A2A2) frequencies of each loci - Generate an empty data frame (allele_frequency) with three columns (A1A1, A1A2, A2A2). - Number of rows = number of markers
### YOUR CODE HERE ###
Use for loop to count the number of genotypes in each marker and divide by the number of samples.
### YOUR CODE HERE
Bind frequency table with the marker information (HINT: cbind
function can bind two data-frame)
### YOUR CODE HERE ###
Make a plot using the following scripts
Questions:
- Why visualizing dataset important?
- Try calculate raw allele counts for each site by VCFtools.
- What is the problem with low MAF? Is it a good idea to have a stricter rule for MAF (a higher MAF threshold)? Why?