Quality Control

You are highly recommended to work on the server today. The first part of this practice will need to run PLINK and vcftools.

Dataset

Copy the dataset to your working directory from /proj/g2020004/nobackup/oumark/gwas_lab2/human.tar.gz on UPPMAX.
Unzip the human.tar.gz
After unzipping the folder, you will have human.bed, human.bim, and human.fam three files. We call this kind of format as PLINK binary files.

### YOUR CODE HERE ###

A quick view

Before we start all QC jobs, let’s have a quick view of our data. First, find answers to the following questions from PLINK’s website.

FAM file

Have a look at human.fam in terminal

Which column is the family ID (FID)? What is the prefix?
Which column is the Within-family ID (IID)? What is the prefix?
Does this dataset provide parental information? What does 0 stands for?
How many samples are included in this dataset?

BIM file

Have a look at human.bim in terminal

What does each column mean?
How does the missing value present in the columns for reference and alternative alleles?
How many markers are included in this dataset?

PLINK

Homepage: https://www.cog-genomics.org/plink/1.9/

Data input

Several data formats are available for PLINK. The dataset we provide for today’s practice is the PLINK 1 binary format. Have a look at the help manual on how to load the dataset. LINK

Here are some tricks to make your work easier: - The prefix of the three files should be the same. - Make sure all three files are placed in the same directory.

Data output

One powerful function of PLINK is converting the dataset format used in other software. Have a look at the help manual for the output format option. LINK

Input the PLINK binary file and generate a block-gzipped VCF file. (this will be used in the next section)
- What’s the function of --out option?
- Have a look at the message in the terminal. How many variants are loaded from the BIM file? How many males and females sample were loaded?
- Do you find your output file human.vcf.gz?

### YOUR CODE HERE ###

Try following QC options

Try to filter the dataset by the following steps and write down the number of markers and samples passed in each step. (Note: output name needs to be changed in each run)

Minor allele frequency (0.1)

### YOUR CODE HERE ###

Individual genotyping rate (0.05 missingness)

### YOUR CODE HERE ###

SNP genotype rate (0.05 missingness)

### YOUR CODE HERE ###

Hardy-Weinberg P (0.001)

### YOUR CODE HERE ###

After doing all steps, - How many markers remain in the dataset? - How many samples remain in the dataset?

All in a run

Perform all steps from the raw data in one run by PLINK. - Does the number of remaining markers and samples differ from the step-by-step result? Why?

### YOUR CODE HERE ###

VCFtools

Homepage: https://vcftools.github.io/index.html
Use the human.vcf.gz you generated in the previous section.

Data input

What are the avalible input formats for this software? Which option should we use for our dataset?

Data output

What are the output options?
What is the command for output VCF format? a 012 matrix for genotype? and PLINK format?

Try QC options

Try to filter your dataset by the following instructions and write down the number of remaining markers and samples after filtering.

Minor allele frequency (0.1)

### YOUR CODE HERE ###

Include only bi-allelic sites

### YOUR CODE HERE ###

SNP genotype rate (0.05 missingness)

### YOUR CODE HERE ###

Hardy-Weinberg P (0.001)

### YOUR CODE HERE ###

All in a run

Perform all QC steps in one run and compare the result with the step-wise method.

### YOUR CODE HERE ###

Visualization in R [ADVANCED]

Load data to R

read_plink() function in the genio package is a powerful tool to read PLINK binary files.
Use PLINK binary files passed all QC in one run in the previous section.

#library(genio)
### YOUR CODE HERE ###

Genotype frequency

Calculate genotype (0: A1A1, 1: A1A2, 2: A2A2) frequencies of each loci - Generate an empty data frame (allele_frequency) with three columns (A1A1, A1A2, A2A2). - Number of rows = number of markers

### YOUR CODE HERE ###

Use for loop to count the number of genotypes in each marker and divide by the number of samples.

### YOUR CODE HERE

Bind frequency table with the marker information (HINT: cbind function can bind two data-frame)

### YOUR CODE HERE ###

Make a plot using the following scripts

Questions:

Why visualizing dataset important?
Try calculate raw allele counts for each site by VCFtools.
What is the problem with low MAF? Is it a good idea to have a stricter rule for MAF (a higher MAF threshold)? Why?