Quality Control

You are highly recommended to work on the server today. The first part of this practice will need to run PLINK and vcftools.

Dataset

  • Copy the dataset to your working directory from /proj/g2020004/nobackup/oumark/gwas_lab2/human.tar.gz on UPPMAX.
  • Unzip the human.tar.gz
  • After unzipping the folder, you will have human.bed, human.bim, and human.fam three files. We call this kind of format as PLINK binary files.
### YOUR CODE HERE ###

A quick view

Before we start all QC jobs, let’s have a quick view of our data. First, find answers to the following questions from PLINK’s website.

FAM file

Have a look at human.fam in terminal

  • Which column is the family ID (FID)? What is the prefix?
  • Which column is the Within-family ID (IID)? What is the prefix?
  • Does this dataset provide parental information? What does 0 stands for?
  • How many samples are included in this dataset?

BIM file

Have a look at human.bim in terminal

  • What does each column mean?
  • How does the missing value present in the columns for reference and alternative alleles?
  • How many markers are included in this dataset?

VCFtools

  • Homepage: https://vcftools.github.io/index.html
  • Use the human.vcf.gz you generated in the previous section.

Data input

  • What are the avalible input formats for this software? Which option should we use for our dataset?

Data output

  • What are the output options?
  • What is the command for output VCF format? a 012 matrix for genotype? and PLINK format?

Try QC options

Try to filter your dataset by the following instructions and write down the number of remaining markers and samples after filtering.

  1. Minor allele frequency (0.1)
### YOUR CODE HERE ###
  1. Include only bi-allelic sites
### YOUR CODE HERE ###
  1. SNP genotype rate (0.05 missingness)
### YOUR CODE HERE ###
  1. Hardy-Weinberg P (0.001)
### YOUR CODE HERE ###

All in a run

Perform all QC steps in one run and compare the result with the step-wise method.

### YOUR CODE HERE ###

Visualization in R [ADVANCED]

Load data to R

  • read_plink() function in the genio package is a powerful tool to read PLINK binary files.
  • Use PLINK binary files passed all QC in one run in the previous section.
#library(genio)
### YOUR CODE HERE ###

Genotype frequency

Calculate genotype (0: A1A1, 1: A1A2, 2: A2A2) frequencies of each loci - Generate an empty data frame (allele_frequency) with three columns (A1A1, A1A2, A2A2). - Number of rows = number of markers

### YOUR CODE HERE ###

Use for loop to count the number of genotypes in each marker and divide by the number of samples.

### YOUR CODE HERE

Bind frequency table with the marker information (HINT: cbind function can bind two data-frame)

### YOUR CODE HERE ###

Make a plot using the following scripts

Questions:

  • Why visualizing dataset important?
  • Try calculate raw allele counts for each site by VCFtools.
  • What is the problem with low MAF? Is it a good idea to have a stricter rule for MAF (a higher MAF threshold)? Why?