Quickdraws UK Biobank RAP tutorial
The tutorial assumes you have access to UKB data on RAP and have set up the DNA Nexus Python API (see https://pypi.org/project/dxpy/ for installation instructions).
Data Preparation and QC
Create a directory for your GWAS project on DNA Nexus.
dx mkdir -p QuickdrawsGWAS/
We move the raw genotypes available in the UK Biobank folder Bulk/Genotype Results/Genotype calls/
because Plink concatenate (next step) doesn’t work when there are spaces in the file path. This script uses swiss-army-knife to copy genotype files for each chromosome from the original location to a new location without spaces in the file path. The new files are prefixed with “copy_” to differentiate them from the originals.
for CHR in {1..22}; do
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="Bulk/Genotype Results/Genotype calls/ukb22418_c${CHR}_b0_v2.bed" \
-iin="Bulk/Genotype Results/Genotype calls/ukb22418_c${CHR}_b0_v2.bim" \
-iin="Bulk/Genotype Results/Genotype calls/ukb22418_c${CHR}_b0_v2.fam" \
-icmd="cp ukb22418_c${CHR}_b0_v2.bed copy_ukb22418_c${CHR}_b0_v2.bed && cp ukb22418_c${CHR}_b0_v2.bim copy_ukb22418_c${CHR}_b0_v2.bim && cp ukb22418_c${CHR}_b0_v2.fam copy_ukb22418_c${CHR}_b0_v2.fam" \
--instance-type mem1_ssd1_v2_x8 \
--name move_genotypes \
--priority normal \
-y
done
We concatenate the genotypes across chromosomes to make a single Plink file, which will be used for model fitting. We use the plink --merge-list
functionality for this. First, we generate a list of genotype files to be merged:
echo -n '' > merge_list.txt
for CHR in {1..22}; do
echo /mnt/project/QuickdrawsGWAS/copy_ukb22418_c${CHR}_b0_v2 >> merge_list.txt
done
dx upload merge_list.txt --path /QuickdrawsGWAS/
Next, we use the app-swiss-army-knife
tool to run the PLINK command that merges these files. This process should take approximately one hour to finish concatenating all the genotypes.
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="QuickdrawsGWAS/merge_list.txt" \
-icmd="plink --bfile /mnt/project/QuickdrawsGWAS/copy_ukb22418_c1_b0_v2 --merge-list merge_list.txt --make-bed --out genotype_500k" \
--instance-type mem1_ssd1_v2_x8 \
--name merge_chromosomes \
--priority normal \
-y
The above script creates a single Plink file called genotype_500k
, which stores the raw genotype values for all the samples in the UK Biobank across approximately 800k variants. We can remove the per-chromosome BED files as we no longer need them:
for CHR in {1..22}; do
dx rm QuickdrawsGWAS/copy_ukb22418_c${CHR}_b0_v2.bed
dx rm QuickdrawsGWAS/copy_ukb22418_c${CHR}_b0_v2.bim
dx rm QuickdrawsGWAS/copy_ukb22418_c${CHR}_b0_v2.fam
done
Next, we perform various quality control steps for the genotype matrix before performing GWAS. Note that these are not extensive QC steps, and users should add or remove steps based on their needs. We will start by removing variants that did not pass QC for phasing in the UK Biobank dataset release.
dx run app-swiss-army-knife \
-iin="Bulk/Genotype Results/Genotype calls/ukb_snp_qc.txt" \
--folder="QuickdrawsGWAS/" \
-icmd="cat ukb_snp_qc.txt | awk '{ print \$1, \$159; }' > SNPlist.unfiltered.txt && \
cat SNPlist.unfiltered.txt | sed '1d' | awk '{ if (\$2 == 1) print \$1; }' > SNPlist.filtered.QC.txt" \
--instance-type mem1_ssd1_v2_x2 \
--name qc_snp \
--priority normal \
-y
Next, we will perform QC based on samples and only retain genotypes for the white British subset of the UK Biobank, which consists of approximately 400k samples.
dx run app-swiss-army-knife \
--folder="/QuickdrawsGWAS/" \
-iin="Bulk/Genotype Results/Genotype calls/ukb_sqc_v2.txt" \
-iin="Bulk/Genotype Results/Genotype calls/ukb22418_c1_b0_v2.fam" \
-icmd="
cat ukb_sqc_v2.txt | cut -d ' ' -f 24 > ukb_white_british.txt && \
cat ukb22418_c1_b0_v2.fam | cut -d ' ' -f 1 > samples.tmp.txt && \
paste samples.tmp.txt ukb_white_british.txt > INDlist.unfiltered.txt && \
cat INDlist.unfiltered.txt | sed '1d' | awk '{ if (\$2 == 1) print \$1, \$1; }' > INDlist.filtered.QC.txt && \
rm samples.tmp.txt && rm ukb_white_british.txt && rm INDlist.unfiltered.txt" \
--instance-type mem1_ssd1_v2_x2 \
--name subset_genotype \
--priority normal \
-y
We perform variant QC using the plink2 --extract
flag and sample QC using the plink2 --keep
flag. Additionally, we filter the genotype matrix to retain only variants with a MAF > 1% as follows:
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="QuickdrawsGWAS/genotype_500k.bed" \
-iin="QuickdrawsGWAS/genotype_500k.bim" \
-iin="QuickdrawsGWAS/genotype_500k.fam" \
-iin="QuickdrawsGWAS/INDlist.filtered.QC.txt" \
-iin="QuickdrawsGWAS/SNPlist.filtered.QC.txt" \
-icmd="
plink2 --bfile genotype_500k --keep INDlist.filtered.QC.txt --extract SNPlist.filtered.QC.txt --make-bed --out genotype_400k && \
rm genotype_500k.bed && rm genotype_500k.bim && rm genotype_500k.fam && \
plink2 --bfile genotype_400k --maf 0.01 --make-bed --out genotype_400k_common && \
rm genotype_400k.bed && rm genotype_400k.bim && rm genotype_400k.fam" \
--instance-type mem1_ssd1_v2_x8 \
--name genotype_maf_filter \
--priority normal \
-y
The above command generates a new Plink file called genotype_400k_common
, which stores genotype information for the 400k white British subset of the UK Biobank, retaining variants with MAF > 1%. To speed up the rest of the tutorial, we further subset the genotype data to the first 10,000 samples in the file. You can ignore this command and use genotype_400k_common
if you wish to run GWAS on the entire 400k subset of the data.
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="QuickdrawsGWAS/genotype_400k_common.bed" \
-iin="QuickdrawsGWAS/genotype_400k_common.bim" \
-iin="QuickdrawsGWAS/genotype_400k_common.fam" \
-icmd="
head -n 10001 genotype_400k_common.fam > genotype_10k_common.fam
plink2 --bfile genotype_400k_common --keep genotype_10k_common.fam --make-bed --out genotype_10k_common" \
--instance-type mem1_ssd1_v2_x8 \
--name genotype_10k \
--priority normal \
-y
We also need to download covariates to perform GWAS. In this tutorial, we only download the top 20 PCs, but users can download other covariates like Age, Sex, and their combinations separately using the dx extract_dataset functionality. We create a space-separated covariates.txt file to be used in Quickdraws as follows:
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="Bulk/Genotype Results/Genotype calls/ukb_sqc_v2.txt" \
-iin="Bulk/Genotype Results/Genotype calls/ukb22418_c1_b0_v2.fam" \
-icmd="
cat ukb_sqc_v2.txt | cut -d ' ' -f 26-45 > pcs.txt && \
cat ukb22418_c1_b0_v2.fam | cut -d ' ' -f 1-2 > samples.tmp.txt && \
echo FID IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 > covariates.txt && \
paste samples.tmp.txt pcs.txt >> covariates.txt && \
rm samples.tmp.txt && rm pcs.txt" \
--instance-type mem1_ssd1_v2_x2 \
--name make_covariates \
--priority normal \
-y
Quickdraws also needs access to KING table and list of unrelated homogenous samples to calibrate its test statistics. We can download the KING table directly from UK Biobank RAP (available at: Bulk/Genotype Results/Genotype calls/ukb_rel.dat
) and we can use the KING table to make the list of unrelated samples using our code snippet as follows. The python file make_unrel_homogenous.py
can be downloaded from Make list of unrelated homogenous samples.
dx upload make_unrel_homogenous.py --path QuickdrawsGWAS/
dx run app-dxjupyterlab \
--folder="QuickdrawsGWAS/" \
-iin="Bulk/Genotype\ Results/Genotype\ calls/ukb_rel.dat" \
-iin="QuickdrawsGWAS/INDlist.filtered.QC.txt" \
-iin="QuickdrawsGWAS/make_unrel_homogenous.py" \
-icmd="
python make_unrel_homogenous.py INDlist.filtered.QC.txt ukb_rel.dat unrel_FID_IID.txt && \
cp ukb_rel.dat kinship.txt" \
--instance-type mem1_ssd1_v2_x2 \
--name make_unrel_homo_file \
--priority normal \
-y
Finally, we download five quantitative phenotypes (Height, BMI, Systolic_BP, FEV1_FVC, Hip_circumference) using the dx extract_dataset command. We use the phenotypes as they are, but users should preprocess the phenotypes before running the GWAS. General guidelines on phenotype QC can be found in the Supplementary Note of the Quickdraws paper. We upload the resulting space-separated phenotypes.txt file to the RAP server. Note: Perform this step on a RAP instance.
dx extract_dataset app43206_20221106204130.dataset --fields 'participant.eid,participant.eid,participant.p50_i0,participant.p21001_i0,participant.p4080_i0_a0,participant.p20258,participant.p49_i0' --delimiter " " --output phenotypes.txt
echo "FID IID Height BMI Systolic_BP FEV1_FVC Hip_circumference" | cat - <(tail -n +2 phenotypes.txt) > temp && mv temp phenotypes.txt
dx upload phenotypes.txt --path QuickdrawsGWAS/
Running Quickdraws
As we have downloaded and processed all the files required to run GWAS using Quickdraws, we can now proceed with Quickdraws step 1 (model fitting) followed by step 2 (test statistics calculation). We run Quickdraws using a Docker image. A tutorial on making the docker image for Quickdraws can be found in Docker image. A docker image built on UK Biobank RAP and ready for use on RAP is available here.
Start by uploading the docker image to the central RAP server:
dx mkdir docker/
wget https://www.stats.ox.ac.uk/publication-data/sge/ppg/quickdraws/docker_image/quickdraws.tar
dx upload quickdraws.tar --path docker/
The docker image should now be on the central RAP server and ready to use.
In order to run step 1, we need a GPU node which can be accessed as follows.
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="docker/quickdraws.tar" \
-iin="QuickdrawsGWAS/genotype_10k_common.bed" \
-iin="QuickdrawsGWAS/genotype_10k_common.bim" \
-iin="QuickdrawsGWAS/genotype_10k_common.fam" \
-iin="QuickdrawsGWAS/phenotypes.txt" \
-iin="QuickdrawsGWAS/covariates.txt" \
-iin="QuickdrawsGWAS/kinship.txt" \
-icmd="docker load -i quickdraws.tar && docker run --security-opt seccomp=unconfined --gpus all --rm --shm-size=16g -v ./:/mnt/ quickdraws quickdraws-step-1 --out /mnt/step1 --bed /mnt/genotype_10k_common --phenoFile /mnt/phenotypes.txt --covarFile /mnt/covariates.txt --kinship /mnt/kinship.txt && tar -cvf step1_output.tar ./step1*" \
--instance-type mem2_ssd2_gpu1_x8 \
--name qd_step1 \
--priority high \
-y
For the 10k samples it should require about 2hrs and 1.5£ to finish this analysis. It is also possible to run the analysis on CPUs, however it is more expensive and time consuming, we recommned running step1 on GPUs where possible. The command to run it on CPUs is as follows:
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="docker/quickdraws.tar" \
-iin="QuickdrawsGWAS/genotype_10k_common.bed" \
-iin="QuickdrawsGWAS/genotype_10k_common.bim" \
-iin="QuickdrawsGWAS/genotype_10k_common.fam" \
-iin="QuickdrawsGWAS/phenotypes.txt" \
-iin="QuickdrawsGWAS/covariates.txt" \
-iin="QuickdrawsGWAS/kinship.txt" \
-icmd="docker load -i quickdraws.tar && docker run --security-opt seccomp=unconfined --rm --shm-size=16g -v ./:/mnt/ quickdraws quickdraws-step-1 --out /mnt/step1 --bed /mnt/genotype_10k_common --phenoFile /mnt/phenotypes.txt --covarFile /mnt/covariates.txt --kinship /mnt/kinship.txt && tar -cvf step1_output.tar ./step1*" \
--instance-type mem1_ssd1_v2_x8 \
--name qd_step1 \
--priority high \
-y
Next we can calculate test statistics for variants in the genotype_10k_common
plink file as follows:
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="docker/quickdraws.tar" \
-iin="QuickdrawsGWAS/genotype_10k_common.bed" \
-iin="QuickdrawsGWAS/genotype_10k_common.bim" \
-iin="QuickdrawsGWAS/genotype_10k_common.fam" \
-iin="QuickdrawsGWAS/covariates.txt" \
-iin="QuickdrawsGWAS/unrel_FID_IID.txt" \
-iin="QuickdrawsGWAS/step1_output.tar" \
-icmd="docker load -i quickdraws.tar && tar -xvf step1_output.tar && docker run --security-opt seccomp=unconfined --rm -v ./:/mnt/ quickdraws quickdraws-step-2 --out /mnt/step2_22 --bed /mnt/genotype_10k_common --out_step1 /mnt/step1 --covarFile /mnt/covariates.txt --unrel_sample_list /mnt/unrel_FID_IID.txt && rm -r step1*" \
--instance-type mem1_ssd1_v2_x8 \
--name qd_step2 \
--priority high \
-y
We tar the output from step 1 and download it for step 2 computation. This step requires less than 15 minutes and costs approximately £0.10 to finish the test statistics calculation.
Finally, we can calculate the test statistics for the imputed data using the BGEN files on the central server (available at: Bulk/Imputation/UKB_imputation_from_genotype/) as follows:
dx run app-swiss-army-knife \
--folder="QuickdrawsGWAS/" \
-iin="docker/quickdraws.tar" \
-iin="Bulk/Imputation/UKB_imputation_from_genotype/ukb22828_c22_b0_v3.bgen" \
-iin="Bulk/Imputation/UKB_imputation_from_genotype/ukb22828_c22_b0_v3.bgen.bgi" \
-iin="Bulk/Imputation/UKB_imputation_from_genotype/ukb22828_c22_b0_v3.sample" \
-iin="QuickdrawsGWAS/covariates.txt" \
-iin="QuickdrawsGWAS/step2.calibration" \
-iin="QuickdrawsGWAS/step1_output.tar" \
-icmd="docker load -i quickdraws.tar && tar -xvf step1_output.tar && docker run --security-opt seccomp=unconfined --rm -v ./:/mnt/ quickdraws quickdraws-step-2 --out /mnt/step2 --bgen /mnt/ukb22828_c22_b0_v3.bgen --sample /mnt/ukb22828_c22_b0_v3.sample --out_step1 /mnt/step1 --covarFile /mnt/covariates.txt --calibrationFile /mnt/step2.calibration && rm -r step1*" \
--instance-type mem1_ssd1_v2_x8 \
--name qd_step2_imputed \
--priority high \
-y
This requires running for around 2hrs for chromosome 22 and costs < 0.5£ using the mem1_ssd1_v2_x8
instance.
FAQs
This is a Python script that can be used to create a list of unrelated homogeneous samples used for the calibration of test statistics. Save the file as make_unrel_homogenous.py
before uploading it to the RAP server:
import pandas as pd
import numpy as np
import sys
homogenous_samples = sys.argv[1]
kinship_file = sys.argv[2]
output_file = sys.argv[3]
iid = pd.read_csv(homogenous_samples,'\s+',names=['IID'])['IID'].values
kinship = pd.read_csv(
kinship_file,
sep="\s+",
)
kinship = kinship[
kinship["ID1"].isin(iid) & kinship["ID2"].isin(iid)
]
kinship = kinship[kinship.Kinship >= 0.044][['ID1','ID2']]
unrel_list = set(iid) - set(kinship.ID1)
unrel_list = unrel_list - set(kinship.ID2)
unrel_list = list(unrel_list)
graph = {}
for i in range(len(kinship)):
node1, node2 = kinship.iloc[i].values
if node1 not in graph:
graph[node1] = []
if node2 not in graph:
graph[node2] = []
graph[node1].append(node2)
graph[node2].append(node1)
label = {}
for node in graph:
label[node] = False
for node in graph:
all_false = True
for neighbor in graph[node]:
if label[neighbor]:
all_false = False
break
if all_false:
label[node] = True
mis = [node for node in graph if label[node]]
unrel_list = unrel_list + mis
unrel_list = np.unique(unrel_list)
print("Length of maximal independent set = " + str(len(unrel_list)))
df = pd.DataFrame(columns=['FID', 'IID'])
df['FID'] = unrel_list
df['IID'] = unrel_list
df.to_csv(output_file, sep='\t', index=None)
If you use this software, please cite:
H. Loya, G. Kalantzis, F. Cooper, P. F. Palamara. A scalable variational inference approach for increased mixed-model association power. bioRxiv, 2024.