Quickdraws GWAS Software Documentation
Quickdraws is a scalable method to perform genome-wide association studies (GWAS) for quantitative and binary traits. To run GWAS using Quickdraws, you will need three main input files: bed (and bgen) files with model-building and testing genetic variants, phenotype files, and covariate files. For certain analyses, you may also need a list of model SNPs and a file describing close genetic relatives.
- Installation
- Input Files
- Steps to Run Quickdraws
- List of Options
- Interpreting the Output
- FAQs
- Citation
Installation
It is strongly recommended to either set up a python virtual environment, or a conda environment:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
conda create -n quickdraws python=3.11 -y
conda activate quickdraws
pip install --upgrade pip setuptools wheel
It is necessary for anything bigger than toy examples to use a cuda-enabled version of pytorch. Use the pytorch configuration helper to find suitable installation instructions for your system. The code snippet below will probably work for most systems:
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install quickdraws
Input Files
- Bed file: This contain the genotype data used to build the model (steps 0 and 1). Variants in a bed file can also be tested for association (Step 2).
- Phenotype file: This files have the phenotype information (see Additional Details on Input/Output File Formats for format details).
- Covariate file: This files include covariate data (see Additional Details on Input/Output File Formats for format details).
- Kinship file: Used to calibrate test statistics in the presence of close relatives (for UKB can be found in RAP at
Bulk/Genotype Results/Genotype calls/ukb_rel.dat
, see Additional Details on Input/Output File Formats for format details). - Unrel Sample List file: A list of homogenous unrelated samples in the dataset (for UKB see resource 22021, see Additional Details on Input/Output File Formats for format details).
- Bgen file: These contain the genotype data to be tested for association (Step 2).
- Sample file: A .sample file associated with the bgen file recording sample FID and IID.
- Model SNPs file: A list of SNPs for step 1 (optional, see Generating Model SNPs).
- Extract SNPs file: A List of SNPs for which test statistics will be computed (one SNP ID per line, same format as Model SNPs file).
Steps to Run Quickdraws
Generate an HDF5 file corresponding to the Bed file for fast sample-wise access. This step is performed once per genetic dataset. This step can be skipped! - you can directly jump to Step 1 which will generate the HDF5 file as a preprocessing step, but it is best practice to generate the HDF5 file for the entire dataset once and reuse it later.
convert-to-hdf5 \
--out master \
--bed ${bed} \
--modelSnps maf001.snps
This will convert the input ${bed}
file into the HDF5 file master.hdf5
. Only the SNPs listed in maf001.snps
will be converted. The maf001.snps
file should contain one SNP identifier per line; the identifier corresponds to the 2nd column of the BIM file.
Use spike-and-slab Bayesian linear regression to estimate genetic effects.
quickdraws-step-1 \
--out ${output_step1} \
--hdf5 master.hdf5 \
--bed ${bed} \
--phenoFile ${pheno} \
--covarFile ${covar} \
--modelSnps maf001.snps \
--kinship kinship.txt
${output_step1}
: File prefix where the output of Step 1 will be written.master.hdf5
: Generated in Step 0 (optional).${bed}
: Bed file containing the model SNPs.${pheno}
: File containing phenotype data. See input/output file formats section for format details.${covar}
: File containing covariates. See input/output file formats section for format details.maf001.snps
: List of SNPs for the regression (optional), described in Step 0.kinship.txt
: Contains pairwise relatives up to the 3rd degree. This file can be produced using KING as implemented in PLINK (see FAQs for an example).
Generate summary statistics for SNPs in a bed file (e.g., the model SNPs).
quickdraws-step-2 \
--out ${output_step2} \
--bed ${bed} \
--out_step1 ${output_step1} \
--covarFile ${covar} \
--unrel_sample_list unrelated_white_British.FID_IID.txt
${output_step2}
: File prefix where results will be written.${bed}
: The Bed file containing the SNPs to be tested.${output_step1}
: The output file from Step 1.${covar}
: File containing covariates. See input/output file formats section for format details.unrelated_white_British.FID_IID.txt
: List of unrelated homogeneous samples, or any other list of samples to be included in the calculation of test statistics. See input/output file formats section for format details.
Compute summary statistics for imputed data.
quickdraws-step-2 \
--out ${output_step2} \
--bgen ${bgen} \
--sample ${sample} \
--out_step1 ${output_step1} \
--calibrationFile ${output_step2}.calibration \
--covarFile ${covar} \
--extract snp_list.txt
${output_step2}
: File prefix where results will be written.${bgen}
: The Bgen file containing the variants to be tested.${sample}
: File containing the list of samples to be included. See input/output file formats section for format details.${output_step1}
: The output file from Step 1.${output_step2}.calibration
: Calibration coefficients (produced in Step 2).${covar}
: File containing covariates. See input/output file formats section for format details.snp_list.txt
: List of SNPs to test. Should contain one SNP identifier per line; the identifier correponds to the 2nd column of a bim file.
For binary traits, add the --binary
and --h2_grid
flags.
quickdraws-step-1 \
--out ${output_step1} \
--hdf5 master.hdf5 \
--bed ${bed} \
--phenoFile ${pheno} \
--covarFile ${covar} \
--modelSnps maf001.snps \
--kinship kinship.txt \
--binary \
--h2_grid
Input/output files are as described for quantitative traits.
Add the --binary
and --firth
flags for Firth logistic regression.
quickdraws-step-2 \
--out ${output_step2} \
--bed ${bed} \
--out_step1 ${output_step1} \
--covarFile ${covar} \
--unrel_sample_list unrelated_white_British.FID_IID.txt \
--binary \
--firth
Input/output files are as described for quantitative traits.
Compute summary statistics for imputed data.
quickdraws-step-2 \
--out ${output_step2} \
--bgen ${bgen} \
--sample ${sample} \
--out_step1 ${output_step1} \
--calibrationFile ${output_step2}.calibration \
--covarFile ${covar} \
--extract snp_list.txt \
--binary \
--firth
Input/output files are as described for quantitative traits.
List of options
You can use --help
to get a full list of options for step 1 and step 2 as follows:
quickdraws-step-1 --help
quickdraws-step-2 --help
Option | Required | Data-type | Description |
---|---|---|---|
–bed | Yes | STRING | Prefix for bed/bim/fam files |
–phenoFile | Yes | STRING | Phenotype file in “FID IID Trait1 Trait2 …” format, tab-separated |
–out | Yes | STRING | Prefix for output files |
–covarFile | No | STRING | Covariate file in “FID IID Var1 Var2 …” format, tab-separated |
–hdf5 | No | STRING | Path to the master HDF5 file (Generated in step 0, if unspecified is generated automatically) |
–modelSnps | No | STRING | Path to list of SNPs for model fitting (one SNP ID per line) |
–kinship | No | STRING | Path to KING table file, has ID1 ID2 Kinship as columns, tab-separated |
–binary | No | - | For binary traits |
–h2_grid | No | - | Perform a grid-search for heritability |
–lowmem | No | - | Run Quickdraws in low-memory mode |
–predBetasFlag | No | - | Calculate and store posterior mean effect estimates for prediction |
–train_split | No | FLOAT | Fraction of dataset for training (default = 0.9) |
–phen_thres | No | FLOAT | Phenotyping rate threshold (default = 0) |
–rhe_random_vectors | No | INT | Number of random vectors to be used by RHE to estimate h2 (default = 50) |
Option | Required | Data-type | Description |
---|---|---|---|
–bed | Yes | STRING | Prefix for bed/bim/fam files |
–bgen | Yes | STRING | Prefix for bgen file |
–sample | Yes | STRING | Prefix for sample file |
–out | Yes | STRING | Prefix for output files |
–out_step1 | Yes | STRING | Prefix for output files from Step 1 |
–covarFile | No | STRING | Covariate file in “FID IID Var1 Var2 …” format, tab-separated |
–unrel_sample_list | No | STRING | List of unrelated homogeneous samples in “FID IID” format, tab-separated |
–binary | No | - | For binary traits |
–firth | No | - | Firth logistic regression |
–calibrate | No | - | Use effective sample size estimate to calibrate test statistics (default=True) |
–calibrationFile | No | STRING | File with calibration constants; is estimated using step 2 on model fitting SNPs |
–extract | No | STRING | List of SNPs for which test statistics will be computed (one SNP ID per line) |
–sample_weights | No | STRING | Sampling weights file for participation bias correction in “FID IID Weight” format, tab-separated |
–firth_pval_thresh | No | FLOAT | P-value threshold for Firth regression (default = 0.05) |
–firth_maf_thresh | No | FLOAT | MAF threshold for Firth regression (default = 0.05) |
–firth_prevalence_thresh | No | FLOAT | Prevalence threshold for Firth regression (default = 0.05) |
–num_threads | No | INT | Number of threads (default = -1, which equals len(os.sched_getaffinity(0)) - 1) |
Interpreting the output
Quickdraws generates the following output files:
Filename | Step | Description |
---|---|---|
{out}.traits | 1 | Postprocessed traits in “FID IID Trait1 Trait2 …” format, tab-separated |
{out}.covar_effects | 1 | Covariate effects on traits in “FID IID Var1 Var2 …” format, tab-separated |
{out}.hdf5 | 1 | Genotype matrix HDF5 file |
{out}.h2 | 1 | Heritability for each trait, one heritability value per line, sorted based on trait order in {out}.traits |
{out}.alpha | 1 | Optimal sparsity parameter for each trait, one sparsity value per line, sorted based on trait order in {out}.traits |
{out}_pred.list | 1 | LOCO predictions for each trait, stores trait name and loco prediction filename per trait |
{out}.{pheno}.loco | 1 | LOCO predictions for phenotype ${pheno}, “FID IID loco1 loco2 ..” format, tab-separated |
{out}.{pheno}.posterior_betas | 1 | posterior betas for phenotype ${pheno}, “CHR GENPOS POS SNP BETA A1 A2” format, tab-separated |
{out}.{pheno}.sumstats.gz | 2 | Quickdraws association statistics (gzip-compressed) |
{out}.{pheno}_lrunrel.sumstats | 2 | Association statistics for linear regression on unrelated samples stored as an HDF file, use pd.read_hdf() to read |
{out}.calibration | 2 | Calibration factor from Step 2, per phenotype (row) stores calibration constants for each LOCO run (columns) |
FAQs
We only run Step 1 on common genotyped SNPs in the data. One can get the list using PLINK:
plink --bfile genotype --freq --out all_snps
awk '$5 > 0.01' all_snps.frq > maf001.snps
You can use PLINK2 to calculate the KING table for up to 3rd degree relatives as follows:
plink2 --make-king-table --king-table-filter 0.044 --bfile genotype --out king
To combine summary statistics across multiple chromosomes, you can run:
git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
python src/concat_sumstats.py --out ${out}
Quickdraws saves LOCO predictions from Step 1 in the same format as Regenie, so can be easily combined with Regenie. To calculate test statistics through Regenie (tested on Regenie v3.1) For Quantitative traits, run:
regenie \
--step 2 \
--pred ${output}_pred.list \
--bed ${bed} \
--phenoFile ${output}.traits \
--covarFile ${covarFile} \
--out ${output}
where ${output}_pred.list
is a list of files containing the LOCO predictions, one line per trait, and ${output}.traits is a list of files containing post-processed traits from Quickdraws, in the same order. For Binary traits:
regenie \
--step 2 \
--pred ${output}_pred.list \
--bed ${bed} \
--phenoFile ${output}.traits \
--covarFile ${covarFile} \
--out ${output} \
--bt \
--firth \
--approx \
--pThresh 0.05
NOTE: the summary statistics produced through this approach are not necessarily calibrated, as Regenie does not perform a calibration step.
We perform a weighted linear regression to correct for participation-bias in Step 2 of the algorithm. This is done by specifying a sampling weights file (see Schooleler et al. 2023 for ways to generate this).
quickdraws-step-2 \
--out ${output} \
--bed ${bed} \
--out_step1 ${output} \
--covarFile ${covar} \
--unrel_sample_list unrelated_white_British.FID_IID.txt \
--sampling_weights sampling_weights.txt
Once you install quickdraws
, two executables should be available in your path: quickdraws-step-1
and quickdraws-step-2
.
Clone the Git repository to access the example data and script demonstrating how these can be used:
git clone https://github.com/PalamaraLab/quickdraws.git
cd quickdraws
bash run_example.sh
The file formats for various input files are as follows:
- Bed files: Genotypes stored in plink v1 .bed/.fam/.bim files used for model fitting, concatenated across chromosomes.
- Bgen files: Dosages stored in plink v2 .bgen/.sample files used for test statistics calculation, can be separated across chromosomes.
- Phenotype files: tab-separated phenotype values for samples, each row corresponds to “FID IID Trait1 Trait2 ..” Example:
FID IID Trait1 Trait2 Trait3
10001 10001 3.121 2.121 nan
10002 10002 1.178 8.21 -1.23
10004 10004 -1.012 -5.12 1.02
- Covariate files: tab-separated phenotype values for samples, each row corresponds to “FID IID Var1 Var2 ..”, binary or ordinal covariates are to be encoded as numerics. Example:
FID IID Var1 Var2 Var3
10001 10001 1 5.1121 6.12
10002 10002 0 nan 81.23
10004 10004 1 -3.19 4.12
- Kinship file: Used to calibrate test statistics in the presence of close relatives. The file should be tab-separated and have the following columns “ID1”, “ID2” and “Kinship” which stores the Kinship between ID1 and ID2. Example:
ID1 ID2 Kinship
10001 10002 0.15
10004 10001 0.26
10005 10010 0.07
- Unrel Sample List file: A list of homogenous unrelated samples in the dataset. The file should be tab-separated and record FID IID pairs of individuals which are unrelated and are homogenous. Example:
10001 10001
10003 10003
10005 10005
- Model SNPs file: A list of SNP IDs (corresponding to the 2nd column of the .bim file) one per line. Example:
rs123131
rs321219
10_3512121_CT_T
- Extract SNPs file: Useful to run association step for only a subset of variants, contains a list of SNP IDs one per line similar to model SNPs file.
If you use this software, please cite:
H. Loya, G. Kalantzis, F. Cooper, P. F. Palamara. A scalable variational inference approach for increased mixed-model association power. bioRxiv, 2024.