Quickdraws GWAS Software Documentation

Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives.

Installation
Input Files
- Required Input Files
- Optional Input Files
Steps to Run Quickdraws
List of Options
- Main Options for Step 1
- Main Options for Step 2
Interpreting the Output
FAQs
Citation

Installation

It is strongly recommended to either set up a python virtual environment, or a conda environment:

Python virtual environment

python -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel

Conda environment

conda create -n quickdraws python=3.11 -y
conda activate quickdraws
pip install --upgrade pip setuptools wheel

Install pytorch and quickdraws

It is necessary for anything bigger than toy examples to use a cuda-enabled version of pytorch. Use the pytorch configuration helper to find suitable installation instructions for your system. The code snippet below will probably work for most systems:

pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install quickdraws

Input Files

Required Input Files

Bed file: This contain the genotype data used to build the model (steps 0 and 1). Variants in a bed file can also be tested for association (Step 2).
Phenotype file: This files have the phenotype information (see Additional Details on Input/Output File Formats for format details).
Covariate file: This files include covariate data (see Additional Details on Input/Output File Formats for format details).
Kinship file: Used to calibrate test statistics in the presence of close relatives (for UKB can be found in RAP at Bulk/Genotype Results/Genotype calls/ukb_rel.dat, see Additional Details on Input/Output File Formats for format details).
Unrel Sample List file: A list of homogenous unrelated samples in the dataset (for UKB see resource 22021, see Additional Details on Input/Output File Formats for format details).

Optional Input Files

Bgen file: These contain the genotype data to be tested for association (Step 2).
Sample file: A .sample file associated with the bgen file recording sample FID and IID.
Model SNPs file: A list of SNPs for step 1 (optional, see Generating Model SNPs).
Extract SNPs file: A List of SNPs for which test statistics will be computed (one SNP ID per line, same format as Model SNPs file).

Steps to Run Quickdraws

Step 0: Convert Bed File to HDF5 (Optional)

Generate an HDF5 file corresponding to the Bed file for fast sample-wise access. This step is performed once per genetic dataset. This step can be skipped! - you can directly jump to Step 1 which will generate the HDF5 file as a preprocessing step, but it is best practice to generate the HDF5 file for the entire dataset once and reuse it later.

convert-to-hdf5 \
  --out master \
  --bed ${bed} \
  --modelSnps maf001.snps

This will convert the input ${bed} file into the HDF5 file master.hdf5. Only the SNPs listed in maf001.snps will be converted. The maf001.snps file should contain one SNP identifier per line; the identifier corresponds to the 2nd column of the BIM file. You can additionally adjust the chunk size of the genotype on RAM using the --chunksize argument (default = 8192), higher values mean faster compute but more memory usage.

Quantitative traits

Step 1

Use spike-and-slab Bayesian linear regression to estimate genetic effects.

quickdraws-step-1 \
    --out ${output_step1} \
    --hdf5 master.hdf5 \
    --bed ${bed} \
    --phenoFile ${pheno} \
    --covarFile ${covar} \
    --modelSnps maf001.snps \
    --kinship kinship.txt

${output_step1}: File prefix where the output of Step 1 will be written.
master.hdf5: Generated in Step 0 (optional).
${bed}: Bed file containing the model SNPs.
${pheno}: File containing phenotype data. See input/output file formats section for format details.
${covar}: File containing covariates. See input/output file formats section for format details.
maf001.snps: List of SNPs for the regression (optional), described in Step 0.
kinship.txt: Contains pairwise relatives up to the 3rd degree. This file can be produced using KING as implemented in PLINK (see FAQs for an example).

Step 2 (bed files)

Generate summary statistics for SNPs in a bed file (e.g., the model SNPs).

quickdraws-step-2 \
   --out ${output_step2} \
   --bed ${bed} \
   --out_step1 ${output_step1} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt

${output_step2}: File prefix where results will be written.
${bed}: The Bed file containing the SNPs to be tested.
${output_step1}: The output file from Step 1.
${covar}: File containing covariates. See input/output file formats section for format details.
unrelated_white_British.FID_IID.txt: List of unrelated homogeneous samples, or any other list of samples to be included in the calculation of test statistics. See input/output file formats section for format details.

Step 2 (bgen files)

Compute summary statistics for imputed data.

quickdraws-step-2 \
    --out ${output_step2} \
    --bgen ${bgen} \
    --sample ${sample} \
    --out_step1 ${output_step1}  \
    --calibrationFile ${output_step2}.calibration \
    --covarFile ${covar} \
    --extract snp_list.txt

${output_step2}: File prefix where results will be written.
${bgen}: The Bgen file containing the variants to be tested.
${sample}: File containing the list of samples to be included. See input/output file formats section for format details.
${output_step1}: The output file from Step 1.
${output_step2}.calibration: Calibration coefficients (produced in Step 2).
${covar}: File containing covariates. See input/output file formats section for format details.
snp_list.txt: List of SNPs to test. Should contain one SNP identifier per line; the identifier correponds to the 2nd column of a bim file.

Binary traits

Step 1

For binary traits, add the --binary and --h2_grid flags.

quickdraws-step-1 \
    --out ${output_step1} \
    --hdf5 master.hdf5 \
    --bed ${bed} \
    --phenoFile ${pheno} \
    --covarFile ${covar} \
    --modelSnps maf001.snps \
    --kinship kinship.txt \
    --binary \
    --h2_grid

Input/output files are as described for quantitative traits.

Step 2 (bed files)

Add the --binary and --firth flags for Firth logistic regression.

quickdraws-step-2 \
   --out ${output_step2} \
   --bed ${bed} \
   --out_step1 ${output_step1} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt \
   --binary \
   --firth

Input/output files are as described for quantitative traits.

Step 2 (bgen files)

Compute summary statistics for imputed data.

quickdraws-step-2 \
    --out ${output_step2} \
    --bgen ${bgen} \
    --sample ${sample} \
    --out_step1 ${output_step1}  \
    --calibrationFile ${output_step2}.calibration \
    --covarFile ${covar} \
    --extract snp_list.txt \
    --binary \
    --firth

Input/output files are as described for quantitative traits.

List of options

You can use --help to get a full list of options for step 1 and step 2 as follows:

quickdraws-step-1 --help
quickdraws-step-2 --help

Main options for step 1

Option	Required	Data-type	Description
–bed	Yes	STRING	Prefix for bed/bim/fam files
–phenoFile	Yes	STRING	Phenotype file in “FID IID Trait1 Trait2 …” format, tab-separated
–out	Yes	STRING	Prefix for output files
–covarFile	No	STRING	Covariate file in “FID IID Var1 Var2 …” format, tab-separated
–hdf5	No	STRING	Path to the master HDF5 file (Generated in step 0, if unspecified is generated automatically)
–modelSnps	No	STRING	Path to list of SNPs for model fitting (one SNP ID per line)
–kinship	No	STRING	Path to KING table file, has ID1 ID2 Kinship as columns, tab-separated
–binary	No	-	For binary traits
–h2_grid	No	-	Perform a grid-search for heritability
–lowmem	No	-	Run Quickdraws in low-memory mode
–predBetasFlag	No	-	Calculate and store posterior mean effect estimates for prediction
–train_split	No	FLOAT	Fraction of dataset for training (default = 0.9)
–phen_thres	No	FLOAT	Phenotyping rate threshold (default = 0)
–rhe_random_vectors	No	INT	Number of random vectors to be used by RHE to estimate h2 (default = 50)

Main options for step 2

Option	Required	Data-type	Description
–bed	Yes	STRING	Prefix for bed/bim/fam files
–bgen	Yes	STRING	Prefix for bgen file
–sample	Yes	STRING	Prefix for sample file
–out	Yes	STRING	Prefix for output files
–out_step1	Yes	STRING	Prefix for output files from Step 1
–covarFile	No	STRING	Covariate file in “FID IID Var1 Var2 …” format, tab-separated
–unrel_sample_list	No	STRING	List of unrelated homogeneous samples in “FID IID” format, tab-separated
–binary	No	-	For binary traits
–firth	No	-	Firth logistic regression
–calibrate	No	-	Use effective sample size estimate to calibrate test statistics (default=True)
–calibrationFile	No	STRING	File with calibration constants; is estimated using step 2 on model fitting SNPs
–extract	No	STRING	List of SNPs for which test statistics will be computed (one SNP ID per line)
–sample_weights	No	STRING	Sampling weights file for participation bias correction in “FID IID Weight” format, tab-separated
–firth_pval_thresh	No	FLOAT	P-value threshold for Firth regression (default = 0.05)
–firth_maf_thresh	No	FLOAT	MAF threshold for Firth regression (default = 0.05)
–firth_prevalence_thresh	No	FLOAT	Prevalence threshold for Firth regression (default = 0.05)
–num_threads	No	INT	Number of threads (default = -1, which equals len(os.sched_getaffinity(0)) - 1)

Interpreting the output

Quickdraws generates the following output files:

Filename	Step	Description
{out}.traits	1	Postprocessed traits in “FID IID Trait1 Trait2 …” format, tab-separated
{out}.covar_effects	1	Covariate effects on traits in “FID IID Var1 Var2 …” format, tab-separated
{out}.hdf5	1	Genotype matrix HDF5 file
{out}.h2	1	Heritability for each trait, one heritability value per line, sorted based on trait order in {out}.traits
{out}.alpha	1	Optimal sparsity parameter for each trait, one sparsity value per line, sorted based on trait order in {out}.traits
{out}_pred.list	1	LOCO predictions for each trait, stores trait name and loco prediction filename per trait
{out}.{pheno}.loco	1	LOCO predictions for phenotype ${pheno}, “FID IID loco1 loco2 ..” format, tab-separated
{out}.{pheno}.posterior_betas	1	posterior betas for phenotype ${pheno}, “CHR GENPOS POS SNP BETA A1 A2” format, tab-separated
{out}.{pheno}.sumstats.gz	2	Quickdraws association statistics (gzip-compressed)
{out}.{pheno}_lrunrel.sumstats	2	Association statistics for linear regression on unrelated samples stored as an HDF file, use pd.read_hdf() to read
{out}.calibration	2	Calibration factor from Step 2, per phenotype (row) stores calibration constants for each LOCO run (columns)

FAQs

Generating Model SNPs

We only run Step 1 on common genotyped SNPs in the data. One can get the list using PLINK:

plink --bfile genotype --freq --out all_snps
awk '$5 > 0.01' all_snps.frq > maf001.snps

Generating KING table

You can use PLINK2 to calculate the KING table for up to 3rd degree relatives as follows:

plink2 --make-king-table --king-table-filter 0.044 --bfile genotype --out king

Concatenating summary statistics

To combine summary statistics across multiple chromosomes, you can run:

git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
python src/concat_sumstats.py --out ${out}

Using Regenie for step 2

Quickdraws saves LOCO predictions from Step 1 in the same format as Regenie, so can be easily combined with Regenie. To calculate test statistics through Regenie (tested on Regenie v3.1) For Quantitative traits, run:

regenie \
   --step 2 \
   --pred ${output}_pred.list \
   --bed ${bed} \ 
   --phenoFile ${output}.traits \
   --covarFile ${covarFile} \
   --out ${output}

where ${output}_pred.list is a list of files containing the LOCO predictions, one line per trait, and ${output}.traits is a list of files containing post-processed traits from Quickdraws, in the same order. For Binary traits:

regenie \
   --step 2 \
   --pred ${output}_pred.list \ 
   --bed ${bed} \
   --phenoFile ${output}.traits \
   --covarFile ${covarFile} \
   --out ${output} \
   --bt \
   --firth \ 
   --approx \
   --pThresh 0.05

NOTE: the summary statistics produced through this approach are not necessarily calibrated, as Regenie does not perform a calibration step.

Participation-bias corrected GWAS

We perform a weighted linear regression to correct for participation-bias in Step 2 of the algorithm. This is done by specifying a sampling weights file (see Schooleler et al. 2023 for ways to generate this).

quickdraws-step-2 \
   --out ${output} \
   --bed ${bed} \
   --out_step1 ${output} \
   --covarFile ${covar} \
   --unrel_sample_list unrelated_white_British.FID_IID.txt \
   --sampling_weights sampling_weights.txt

Minimal complete examples

Once you install quickdraws, two executables should be available in your path: quickdraws-step-1 and quickdraws-step-2. Clone the Git repository to access the example data and script demonstrating how these can be used:

git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
bash run_example.sh

Additional details on input/output file formats

The file formats for various input files are as follows:

Bed files: Genotypes stored in plink v1 .bed/.fam/.bim files used for model fitting, concatenated across chromosomes.
Bgen files: Dosages stored in plink v2 .bgen/.sample files used for test statistics calculation, can be separated across chromosomes.
Phenotype files: tab-separated phenotype values for samples, each row corresponds to “FID IID Trait1 Trait2 ..” Example:

FID IID Trait1 Trait2 Trait3
10001 10001 3.121 2.121 nan
10002 10002 1.178 8.21 -1.23
10004 10004 -1.012 -5.12 1.02

Covariate files: tab-separated phenotype values for samples, each row corresponds to “FID IID Var1 Var2 ..”, binary or ordinal covariates are to be encoded as numerics. Example:

FID IID Var1 Var2 Var3
10001 10001 1 5.1121 6.12
10002 10002 0 nan 81.23
10004 10004 1 -3.19 4.12

Kinship file: Used to calibrate test statistics in the presence of close relatives. The file should be tab-separated and have the following columns “ID1”, “ID2” and “Kinship” which stores the Kinship between ID1 and ID2. Example:

ID1 ID2 Kinship
10001 10002 0.15
10004 10001 0.26
10005 10010 0.07

Unrel Sample List file: A list of homogenous unrelated samples in the dataset. The file should be tab-separated and record FID IID pairs of individuals which are unrelated and are homogenous. Example:

10001 10001
10003 10003
10005 10005

Model SNPs file: A list of SNP IDs (corresponding to the 2nd column of the .bim file) one per line. Example:

rs123131
rs321219
10_3512121_CT_T

Extract SNPs file: Useful to run association step for only a subset of variants, contains a list of SNP IDs one per line similar to model SNPs file.

Running on UK Biobank RAP

see UK Biobank RAP Tutorial

Citation

If you use this software, please cite:

H. Loya, G. Kalantzis, F. Cooper, P. F. Palamara. A scalable variational inference approach for increased mixed-model association power. Nature Genetics, 2025.

Quickdraws GWAS Software Documentation

Table of Contents

Installation

Python virtual environment

Conda environment

Install pytorch and quickdraws

Input Files

Required Input Files

Optional Input Files

Steps to Run Quickdraws

Step 0: Convert Bed File to HDF5 (Optional)

Quantitative traits

Step 1

Step 2 (bed files)

Step 2 (bgen files)

Binary traits

Step 1

Step 2 (bed files)

Step 2 (bgen files)

List of options

Main options for step 1

Main options for step 2

Interpreting the output

FAQs

Generating Model SNPs

Generating KING table

Concatenating summary statistics

Using Regenie for step 2

Participation-bias corrected GWAS

Minimal complete examples

Additional details on input/output file formats

Running on UK Biobank RAP

Citation