Quickdraws GWAS Software Documentation

Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives.

Table of Contents

  1. Installation
  2. Input Files
  3. Steps to Run Quickdraws
  4. List of Options
  5. Interpreting the Output
  6. FAQs
  7. Citation

Installation

It is strongly recommended to either set up a python virtual environment, or a conda environment:

Python virtual environment

python -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel

Conda environment

conda create -n quickdraws python=3.11 -y
conda activate quickdraws
pip install --upgrade pip setuptools wheel

Install pytorch and quickdraws

It is necessary for anything bigger than toy examples to use a cuda-enabled version of pytorch. Use the pytorch configuration helper to find suitable installation instructions for your system. The code snippet below will probably work for most systems:

pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install quickdraws

Input Files

Required Input Files

  1. Bed file: This contain the genotype data used to build the model (steps 0 and 1). Variants in a bed file can also be tested for association (Step 2).
  2. Phenotype file: This files have the phenotype information (see Additional Details on Input/Output File Formats for format details).
  3. Covariate file: This files include covariate data (see Additional Details on Input/Output File Formats for format details).
  4. Kinship file: Used to calibrate test statistics in the presence of close relatives (for UKB can be found in RAP at Bulk/Genotype Results/Genotype calls/ukb_rel.dat, see Additional Details on Input/Output File Formats for format details).
  5. Unrel Sample List file: A list of homogenous unrelated samples in the dataset (for UKB see resource 22021, see Additional Details on Input/Output File Formats for format details).

Optional Input Files

  1. Bgen file: These contain the genotype data to be tested for association (Step 2).
  2. Sample file: A .sample file associated with the bgen file recording sample FID and IID.
  3. Model SNPs file: A list of SNPs for step 1 (optional, see Generating Model SNPs).
  4. Extract SNPs file: A List of SNPs for which test statistics will be computed (one SNP ID per line, same format as Model SNPs file).

Steps to Run Quickdraws

Step 0: Convert Bed File to HDF5 (Optional)

Generate an HDF5 file corresponding to the Bed file for fast sample-wise access. This step is performed once per genetic dataset. This step can be skipped! - you can directly jump to Step 1 which will generate the HDF5 file as a preprocessing step, but it is best practice to generate the HDF5 file for the entire dataset once and reuse it later.

convert-to-hdf5 \
  --out master \
  --bed ${bed} \
  --modelSnps maf001.snps

This will convert the input ${bed} file into the HDF5 file master.hdf5. Only the SNPs listed in maf001.snps will be converted. The maf001.snps file should contain one SNP identifier per line; the identifier corresponds to the 2nd column of the BIM file.

Quantitative traits

Step 1

Use spike-and-slab Bayesian linear regression to estimate genetic effects.

quickdraws-step-1 \
    --out ${output_step1} \
    --hdf5 master.hdf5 \
    --bed ${bed} \
    --phenoFile ${pheno} \
    --covarFile ${covar} \
    --modelSnps maf001.snps \
    --kinship kinship.txt
  • ${output_step1}: File prefix where the output of Step 1 will be written.
  • master.hdf5: Generated in Step 0 (optional).
  • ${bed}: Bed file containing the model SNPs.
  • ${pheno}: File containing phenotype data. See input/output file formats section for format details.
  • ${covar}: File containing covariates. See input/output file formats section for format details.
  • maf001.snps: List of SNPs for the regression (optional), described in Step 0.
  • kinship.txt: Contains pairwise relatives up to the 3rd degree. This file can be produced using KING as implemented in PLINK (see FAQs for an example).

Step 2 (bed files)

Generate summary statistics for SNPs in a bed file (e.g., the model SNPs).

quickdraws-step-2 \
   --out ${output_step2} \
   --bed ${bed} \
   --out_step1 ${output_step1} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt
  • ${output_step2}: File prefix where results will be written.
  • ${bed}: The Bed file containing the SNPs to be tested.
  • ${output_step1}: The output file from Step 1.
  • ${covar}: File containing covariates. See input/output file formats section for format details.
  • unrelated_white_British.FID_IID.txt: List of unrelated homogeneous samples, or any other list of samples to be included in the calculation of test statistics. See input/output file formats section for format details.

Step 2 (bgen files)

Compute summary statistics for imputed data.

quickdraws-step-2 \
    --out ${output_step2} \
    --bgen ${bgen} \
    --sample ${sample} \
    --out_step1 ${output_step1}  \
    --calibrationFile ${output_step2}.calibration \
    --covarFile ${covar} \
    --extract snp_list.txt
  • ${output_step2}: File prefix where results will be written.
  • ${bgen}: The Bgen file containing the variants to be tested.
  • ${sample}: File containing the list of samples to be included. See input/output file formats section for format details.
  • ${output_step1}: The output file from Step 1.
  • ${output_step2}.calibration: Calibration coefficients (produced in Step 2).
  • ${covar}: File containing covariates. See input/output file formats section for format details.
  • snp_list.txt: List of SNPs to test. Should contain one SNP identifier per line; the identifier correponds to the 2nd column of a bim file.

Binary traits

Step 1

For binary traits, add the --binary and --h2_grid flags.

quickdraws-step-1 \
    --out ${output_step1} \
    --hdf5 master.hdf5 \
    --bed ${bed} \
    --phenoFile ${pheno} \
    --covarFile ${covar} \
    --modelSnps maf001.snps \
    --kinship kinship.txt \
    --binary \
    --h2_grid

Input/output files are as described for quantitative traits.

Step 2 (bed files)

Add the --binary and --firth flags for Firth logistic regression.

quickdraws-step-2 \
   --out ${output_step2} \
   --bed ${bed} \
   --out_step1 ${output_step1} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt \
   --binary \
   --firth

Input/output files are as described for quantitative traits.

Step 2 (bgen files)

Compute summary statistics for imputed data.

quickdraws-step-2 \
    --out ${output_step2} \
    --bgen ${bgen} \
    --sample ${sample} \
    --out_step1 ${output_step1}  \
    --calibrationFile ${output_step2}.calibration \
    --covarFile ${covar} \
    --extract snp_list.txt \
    --binary \
    --firth

Input/output files are as described for quantitative traits.

List of options

You can use --help to get a full list of options for step 1 and step 2 as follows:

quickdraws-step-1 --help
quickdraws-step-2 --help

Main options for step 1

Option Required Data-type Description
–bed Yes STRING Prefix for bed/bim/fam files
–phenoFile Yes STRING Phenotype file in “FID IID Trait1 Trait2 …” format, tab-separated
–out Yes STRING Prefix for output files
–covarFile No STRING Covariate file in “FID IID Var1 Var2 …” format, tab-separated
–hdf5 No STRING Path to the master HDF5 file (Generated in step 0, if unspecified is generated automatically)
–modelSnps No STRING Path to list of SNPs for model fitting (one SNP ID per line)
–kinship No STRING Path to KING table file, has ID1 ID2 Kinship as columns, tab-separated
–binary No - For binary traits
–h2_grid No - Perform a grid-search for heritability
–lowmem No - Run Quickdraws in low-memory mode
–predBetasFlag No - Calculate and store posterior mean effect estimates for prediction
–train_split No FLOAT Fraction of dataset for training (default = 0.9)
–phen_thres No FLOAT Phenotyping rate threshold (default = 0)
–rhe_random_vectors No INT Number of random vectors to be used by RHE to estimate h2 (default = 50)

Main options for step 2

Option Required Data-type Description
–bed Yes STRING Prefix for bed/bim/fam files
–bgen Yes STRING Prefix for bgen file
–sample Yes STRING Prefix for sample file
–out Yes STRING Prefix for output files
–out_step1 Yes STRING Prefix for output files from Step 1
–covarFile No STRING Covariate file in “FID IID Var1 Var2 …” format, tab-separated
–unrel_sample_list No STRING List of unrelated homogeneous samples in “FID IID” format, tab-separated
–binary No - For binary traits
–firth No - Firth logistic regression
–calibrate No - Use effective sample size estimate to calibrate test statistics (default=True)
–calibrationFile No STRING File with calibration constants; is estimated using step 2 on model fitting SNPs
–extract No STRING List of SNPs for which test statistics will be computed (one SNP ID per line)
–sample_weights No STRING Sampling weights file for participation bias correction in “FID IID Weight” format, tab-separated
–firth_pval_thresh No FLOAT P-value threshold for Firth regression (default = 0.05)
–firth_maf_thresh No FLOAT MAF threshold for Firth regression (default = 0.05)
–firth_prevalence_thresh No FLOAT Prevalence threshold for Firth regression (default = 0.05)
–num_threads No INT Number of threads (default = -1, which equals len(os.sched_getaffinity(0)) - 1)

Interpreting the output

Quickdraws generates the following output files:

Filename Step Description
{out}.traits 1 Postprocessed traits in “FID IID Trait1 Trait2 …” format, tab-separated
{out}.covar_effects 1 Covariate effects on traits in “FID IID Var1 Var2 …” format, tab-separated
{out}.hdf5 1 Genotype matrix HDF5 file
{out}.h2 1 Heritability for each trait, one heritability value per line, sorted based on trait order in {out}.traits
{out}.alpha 1 Optimal sparsity parameter for each trait, one sparsity value per line, sorted based on trait order in {out}.traits
{out}_pred.list 1 LOCO predictions for each trait, stores trait name and loco prediction filename per trait
{out}.{pheno}.loco 1 LOCO predictions for phenotype ${pheno}, “FID IID loco1 loco2 ..” format, tab-separated
{out}.{pheno}.posterior_betas 1 posterior betas for phenotype ${pheno}, “CHR GENPOS POS SNP BETA A1 A2” format, tab-separated
{out}.{pheno}.sumstats.gz 2 Quickdraws association statistics (gzip-compressed)
{out}.{pheno}_lrunrel.sumstats 2 Association statistics for linear regression on unrelated samples stored as an HDF file, use pd.read_hdf() to read
{out}.calibration 2 Calibration factor from Step 2, per phenotype (row) stores calibration constants for each LOCO run (columns)

FAQs

Generating Model SNPs

We only run Step 1 on common genotyped SNPs in the data. One can get the list using PLINK:

plink --bfile genotype --freq --out all_snps
awk '$5 > 0.01' all_snps.frq > maf001.snps

Generating KING table

You can use PLINK2 to calculate the KING table for up to 3rd degree relatives as follows:

plink2 --make-king-table --king-table-filter 0.044 --bfile genotype --out king

Concatenating summary statistics

To combine summary statistics across multiple chromosomes, you can run:

git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
python src/concat_sumstats.py --out ${out}

Using Regenie for step 2

Quickdraws saves LOCO predictions from Step 1 in the same format as Regenie, so can be easily combined with Regenie. To calculate test statistics through Regenie (tested on Regenie v3.1) For Quantitative traits, run:

regenie \
   --step 2 \
   --pred ${output}_pred.list \
   --bed ${bed} \ 
   --phenoFile ${output}.traits \
   --covarFile ${covarFile} \
   --out ${output}

where ${output}_pred.list is a list of files containing the LOCO predictions, one line per trait, and ${output}.traits is a list of files containing post-processed traits from Quickdraws, in the same order. For Binary traits:

regenie \
   --step 2 \
   --pred ${output}_pred.list \ 
   --bed ${bed} \
   --phenoFile ${output}.traits \
   --covarFile ${covarFile} \
   --out ${output} \
   --bt \
   --firth \ 
   --approx \
   --pThresh 0.05 

NOTE: the summary statistics produced through this approach are not necessarily calibrated, as Regenie does not perform a calibration step.

Participation-bias corrected GWAS

We perform a weighted linear regression to correct for participation-bias in Step 2 of the algorithm. This is done by specifying a sampling weights file (see Schooleler et al. 2023 for ways to generate this).

quickdraws-step-2 \
   --out ${output} \
   --bed ${bed} \
   --out_step1 ${output} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt \
   --sampling_weights sampling_weights.txt 

Minimal complete examples

Once you install quickdraws, two executables should be available in your path: quickdraws-step-1 and quickdraws-step-2. Clone the Git repository to access the example data and script demonstrating how these can be used:

git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
bash run_example.sh

Additional details on input/output file formats

The file formats for various input files are as follows:

  1. Bed files: Genotypes stored in plink v1 .bed/.fam/.bim files used for model fitting, concatenated across chromosomes.
  2. Bgen files: Dosages stored in plink v2 .bgen/.sample files used for test statistics calculation, can be separated across chromosomes.
  3. Phenotype files: tab-separated phenotype values for samples, each row corresponds to “FID IID Trait1 Trait2 ..” Example:
FID IID Trait1 Trait2 Trait3
10001 10001 3.121 2.121 nan
10002 10002 1.178 8.21 -1.23
10004 10004 -1.012 -5.12 1.02
  1. Covariate files: tab-separated phenotype values for samples, each row corresponds to “FID IID Var1 Var2 ..”, binary or ordinal covariates are to be encoded as numerics. Example:
FID IID Var1 Var2 Var3
10001 10001 1 5.1121 6.12
10002 10002 0 nan 81.23
10004 10004 1 -3.19 4.12
  1. Kinship file: Used to calibrate test statistics in the presence of close relatives. The file should be tab-separated and have the following columns “ID1”, “ID2” and “Kinship” which stores the Kinship between ID1 and ID2. Example:
ID1 ID2 Kinship
10001 10002 0.15
10004 10001 0.26
10005 10010 0.07
  1. Unrel Sample List file: A list of homogenous unrelated samples in the dataset. The file should be tab-separated and record FID IID pairs of individuals which are unrelated and are homogenous. Example:
10001 10001
10003 10003
10005 10005
  1. Model SNPs file: A list of SNP IDs (corresponding to the 2nd column of the .bim file) one per line. Example:
rs123131
rs321219
10_3512121_CT_T
  1. Extract SNPs file: Useful to run association step for only a subset of variants, contains a list of SNP IDs one per line similar to model SNPs file.

Running on UK Biobank RAP

see UK Biobank RAP Tutorial

Citation

If you use this software, please cite:

H. Loya, G. Kalantzis, F. Cooper, P. F. Palamara. A scalable variational inference approach for increased mixed-model association power. bioRxiv, 2024.