Threads
Threads is a threading-based method to reconstruct the ancestral recombination graph (ARG) from large datasets of sequenced, imputed, or genotyped genomes and to perform ARG-based genotype imputation.
- Installation
- ARG inference
- ARG conversion
- VCF conversion
- Allele age estimation
- Genotype imputation
- Citation
To install Threads, run:
pip install threads_arg
Alternatively, you can download and install from source via the project’s GitHub page.
You will need
- genotypes in Plink2 format (
.pgen
/.pvar
/.psam
or.pgen
/.bim
/.fam
) - genetic map in SHAPEIT format, with columns
pos
,chr
, andcM
- demography file with two columns: generations in the past, effective population size in haploids (note: effective population size is often measured in diploids, make sure to multiply by a factor of 2 before using diploid demographies)
Minimal usage using the provided example data:
threads infer \
--pgen example/example_data.pgen \
--map example/example_data.map \
--demography example/Ne10000.demo \
--out example/example_data.threads
threads convert \
--threads example/example_data.threads \
--argn example/example_data.argn
This will first write a .threads
file and then convert it into an ARG-Needle-format (.argn
) ARG.
threads infer
accepts more options:
threads infer \
--pgen path/to/input.pgen \
--map path/to/genetic_map.gz \
--demography path/to/demography \
--out path/to/output.threads \
--modality [wgs|array] (default: wgs) \
--query_interval (default: 0.01) \
--match_group_interval (default: 0.5) \
--max_sample_batch_size (default: None) \
--mutation_rate (default: 1.4e-8) \
--region 1234-56789 (default: whole region, end-inclusive) \
--num_threads 8 (default: 1) \
--fit_to_data (optional) \
--allele_ages path/to/allele_ages.txt (optional)
--modality array
can be set for inference from genotyping arrays.
The genome-wide mutation rate can be set with --mutation_rate
. This defaults to a realistic human rate of 1.4e-8
per site per generation.
Specifying a --region start-end
means the output ARG is truncated to those base-pair coordinates (end-inclusive). The whole input set will still be used for inference.
Running threads
using parallelism is strongly encouraged and can be enabled by specifying --num_threads
Genetic maps may be found e.g. here.
--query_interval
and --match_group_interval
can be increased to save memory for inference over long genomic regions, this will usually have little impact on accuracy, especially for sparse variants.
Threads does not by default guarantee that input mutations correspond to monophyletic clades in the inferred ARG. However, this can be achieved by providing the --fit_to_data
flag. When threads infer
is run with the --fit_to_data
flag, the threading instructions are amended post-hoc such that each input mutation is represented by a clade in a marginal coalescence tree. As part of this process, Threads estimates the age of each mutation and then amends the local ARG topology. Allele ages can also be provided externally with the --allele_ages
argument, which should be a path to a headerless file with columns representing variant id, base-pair position, and allele age (measured in generations in the past). This step can also be run separately with thethreads allele-ages
command (see below).
Note: We will soon release an updated version of
arg-needle-lib
that speeds up conversion of.threads
files inferred with--fit_to_data
and reduces the size of the resulting.argn
files.
Threads output (.threads
files) can be converted to ARG-Needle-format ARGs (.argn
) or compressed tskit format (.tsz
) using the following commands:
threads convert \
--threads arg.threads \
--argn arg.argn \
--add_mutations (optional)
and
threads convert \
--threads arg.threads \
--tsz arg.tsz \
--add_mutations (optional)
When the --add-mutations
flag is provided, Threads parsimoniously adds mutations to the converted ARG
objects corresponding to those used to infer the ARG. Note that this can result in a high number of mutations when the threading instructions were not inferred with the --fit-to-data
flag.
Threads-encoded genotypes can be extracted to VCF format and printed to stdout
using the following command:
threads vcf \
--threads arg.threads \
--samples samples.samples \
--variants variants.pvar
Threads does not by default save sample/variant metadata. These can be provided by setting --samples
to a file containing IDs for samples used to build the ARG, one per line, and by setting --variants
to the .pvar
or .bim
file used to construct the ARG. If threads infer
was run using the --save_metadata
flag, sample/variant data are included in the .threads
archive.
The ages of mutations used to infer a set of threading instructions can be obtained using the allele-ages
command.
threads allele-ages \
--threads arg.threads \
--num_threads (default: 1) \
--out ages.txt
We highly recommend using parallelism with the --num_threads
flag.
Note that threading instructions inferred with the --fit-to-data
flag already contain allele age estimates.
Threads does not currently support age estimation of arbitrary mutations. If you are interested in this functionality, please contact the developers.
Imputation using Threads proceeds in three steps. First, an ARG is inferred from the reference panel, using the threads infer
procedure described above and converted to .argn
format using threads convert
.
Second, variants carried by the reference panel are assigned edges from within the ARG. This is performed using the threads map
function, which accepts the following options:
threads map \
--argn path/to/arg.argn \
--maf [only variants with MAF below this threshold are mapped, default: 0.02] \
--input [panel genotypes in bcf/vcf/vcf.gz format] \
--region [region to map, end-inclusive, in "1:234-567"-format] \
--num_threads [number of computational threads to request, default: 1] \
--out path/to/output.mut
For imputation, we recommend setting the --maf
threshold to 10 / N_panel
, where N_panel
is the number of individuals in the panel. The above writes the mutation mapping to path/to/output.mut
.
Finally, imputation is performed using the threads impute
command, which takes the following arguments:
threads impute \
--panel [panel genotypes in bcf/vcf/vcf.gz format] \
--target [target genotypes (arrays) in bcf/vcf/vcf.gz format] \
--mut [output from "threads map" command above] \
--map [path to genetic map in SHAPEIT format] \
--mutation_rate [default 1.4e-8] \
--demography [demographic history as used by threads infer] \
--region [region to map, end-inclusive, in "1:234-567"-format] \
--out [path to output.vcf] file \
The phasing experiments reported in Gunnarsson et al. were intended as proof of concept and were not optimized, so they were not included in the software release. Please contact us if you are interested in scripts to help reproduce these experiments.
If you use this software, please cite:
Á. F. Gunnarsson, J. Zhu, B. C. Zhang, Z. Tsangalidou, A. Allmont, P. Palamara. A scalable approach for genome-wide inference of ancestral recombination graphs. bioRxiv, 2024. https://www.biorxiv.org/content/10.1101/2024.08.31.610248v1