threads-arg
Threads-arg is a threading-based method to reconstruct the ancestral recombination graph (ARG) from large datasets of sequenced, imputed, or genotyped genomes and to perform ARG-based genotype imputation.
You will need
- genotypes in Plink2 format (
.pgen
/.pvar
/.psam
or.pgen
/.bim
/.fam
) - genetic map in SHAPEIT format, with columns
pos
,chr
, andcM
- demography file with two columns: generations in the past, effective population size in haploids (note: effective population size is often measured in diploids, make sure to multiply by a factor of 2 before using diploid demographies)
Minimal usage using the provided example data:
threads infer \
--pgen example/example_data.pgen \
--map example/example_data.map \
--demography example/Ne10000.demo \
--out example/example_data.threads
threads convert \
--threads example/example_data.threads \
--argn example/example_data.argn
This will write a .threads
file to path/to/output.threads
.
threads infer
accepts more options:
threads infer \
--pgen path/to/input.pgen \
--map path/to/genetic_map.gz \
--demography path/to/demography \
--out path/to/output.threads \
--modality [wgs|array] (default: wgs) \
--query_interval (default: 0.01) \
--match_group_interval (default: 0.5) \
--max_sample_batch_size (default: None) \
--mutation_rate (default: 1.4e-8) \
--region 1234-56789 (default: whole region, end-inclusive) \
--num_threads 8 (default: 1) \
--fit_to_data (optional) \
--allele_ages path/to/allele_ages.txt (optional)
--modality array
can be set for inference from genotyping arrays.
The genome-wide mutation rate can be set with --mutation_rate
. This defaults to a realistic human rate of 1.4e-8
per site per generation.
Specifying a --region start-end
means the output ARG is truncated to those base-pair coordinates (end-inclusive). The whole input set will still be used for inference.
Running threads
using parallelism is strongly encouraged and can be enabled by specifying --num_threads
Genetic maps may be found e.g. here.
Threads does not by default guarantee that input mutations correspond to monophyletic clades in the inferred ARG. However, this can be achieved by providing the --fit_to_data
flag. When threads infer
is run with the --fit_to_data
flag, the threading instructions are amended post-hoc such that each input mutation is represented by a clade in a marginal coalescence tree. As part of this process, Threads estimates the age of each mutation and then amends the local ARG topology. Allele ages can also be provided externally with the --allele_ages
argument, which should be a path to a headerless file with columns representing variant id, base-pair position, and allele age (measured in generations in the past). This step can also be run separately with thethreads allele-ages
command (see below).
--query_interval
and --match_group_interval
can be increased to save memory for inference over long genomic regions, this will usually have little impact on accuracy, especially for sparse variants.
Threads output (.threads
files) can be converted to ARG-Needle-format ARGs (.argn
) or compressed tskit format (.tsz
) using the following commands:
threads convert \
--threads arg.threads \
--argn arg.argn \
--add_mutations (optional)
and
threads convert \
--threads arg.threads \
--tsz arg.tsz \
--add_mutations (optional)
When the --add-mutations
flag is provided, Threads parsimoniously adds mutations to the converted ARG
objects corresponding to those used to infer the ARG. Note that this can result in a high number of mutations when the threading instructions were not inferred with the --fit-to-data
flag.
The ages of mutations used to infer a set of threading instructions can be obtained using the allele-ages
command.
threads allele-ages \
--threads arg.threads \
--num_threads (default: 1) \
--out ages.txt
We highly recommend using parallelism with the --num_threads
flag.
Note that threading instructions inferred with the --fit-to-data
flag already contain allele age estimates.
Threads does not currently support age estimation of arbitrary mutations. If you are interested in this functionality, please contact the developers.
Imputation using Threads proceeds in three steps. First, an ARG is inferred from the reference panel, using the threads infer
procedure described above and converted to .argn
format using threads convert
.
Second, variants carried by the reference panel are assigned edges from within the ARG. This is performed using the threads map
function, which accepts the following options:
threads map \
--argn path/to/arg.argn \
--maf [only variants with MAF below this threshold are mapped, default: 0.02] \
--input [panel genotypes in bcf/vcf/vcf.gz format] \
--region [region to map, end-inclusive, in "1:234-567"-format] \
--num_threads [number of computational threads to request, default: 1] \
--out path/to/output.mut
For imputation, we recommend setting the --maf
threshold to 10 / N_panel
, where N_panel
is the number of individuals in the panel. The above writes the mutation mapping to path/to/output.mut
.
Finally, imputation is performed using the threads impute
command, which takes the following arguments:
threads impute \
--panel [panel genotypes in bcf/vcf/vcf.gz format] \
--target [target genotypes (arrays) in bcf/vcf/vcf.gz format] \
--mut [output from "threads map" command above] \
--map [path to genetic map in SHAPEIT format] \
--mutation_rate [default 1.4e-8] \
--demography [demographic history as used by threads infer] \
--region [region to map, end-inclusive, in "1:234-567"-format] \
--out [path to output.vcf] file \
If you use this software, please cite:
Á. F. Gunnarsson, J. Zhu, B. C. Zhang, Z. Tsangalidou, A. Allmont, P. Palamara. A scalable approach for genome-wide inference of ancestral recombination graphs. bioRxiv, 2024. https://www.biorxiv.org/content/10.1101/2024.08.31.610248v1