threads-arg

Threads-arg is a threading-based method to reconstruct the ancestral recombination graph (ARG) from large datasets of sequenced, imputed, or genotyped genomes and to perform ARG-based genotype imputation.

ARG inference

You will need

  • genotypes in Plink2 format (.pgen/.pvar/.psam or .pgen/.bim/.fam)
  • genetic map in SHAPEIT format, with columns pos, chr, and cM
  • demography file with two columns: generations in the past, effective population size in haploids (note: effective population size is often measured in diploids, make sure to multiply by a factor of 2 before using diploid demographies)

Minimal usage using the provided example data:

threads infer \
    --pgen example/example_data.pgen \
    --map example/example_data.map \
    --demography example/Ne10000.demo \
    --out example/example_data.threads

threads convert \
    --threads example/example_data.threads \
    --argn example/example_data.argn

This will write a .threads file to path/to/output.threads.

threads infer accepts more options:

threads infer \
    --pgen path/to/input.pgen \
    --map path/to/genetic_map.gz \
    --demography path/to/demography \
    --out path/to/output.threads \
    --modality [wgs|array] (default: wgs) \
    --query_interval (default: 0.01) \
    --match_group_interval (default: 0.5) \
    --max_sample_batch_size (default: None) \
    --mutation_rate (default: 1.4e-8) \
    --region 1234-56789 (default: whole region, end-inclusive) \
    --num_threads 8 (default: 1) \
    --fit_to_data (optional) \
    --allele_ages path/to/allele_ages.txt (optional)

--modality array can be set for inference from genotyping arrays.

The genome-wide mutation rate can be set with --mutation_rate. This defaults to a realistic human rate of 1.4e-8 per site per generation.

Specifying a --region start-end means the output ARG is truncated to those base-pair coordinates (end-inclusive). The whole input set will still be used for inference.

Running threads using parallelism is strongly encouraged and can be enabled by specifying --num_threads

Genetic maps may be found e.g. here.

Threads does not by default guarantee that input mutations correspond to monophyletic clades in the inferred ARG. However, this can be achieved by providing the --fit_to_data flag. When threads infer is run with the --fit_to_data flag, the threading instructions are amended post-hoc such that each input mutation is represented by a clade in a marginal coalescence tree. As part of this process, Threads estimates the age of each mutation and then amends the local ARG topology. Allele ages can also be provided externally with the --allele_ages argument, which should be a path to a headerless file with columns representing variant id, base-pair position, and allele age (measured in generations in the past). This step can also be run separately with thethreads allele-ages command (see below).

--query_interval and --match_group_interval can be increased to save memory for inference over long genomic regions, this will usually have little impact on accuracy, especially for sparse variants.

ARG conversion

Threads output (.threads files) can be converted to ARG-Needle-format ARGs (.argn) or compressed tskit format (.tsz) using the following commands:

threads convert \
    --threads arg.threads \
    --argn arg.argn \
    --add_mutations (optional)

and

threads convert \
    --threads arg.threads \
    --tsz arg.tsz \
    --add_mutations (optional)

When the --add-mutations flag is provided, Threads parsimoniously adds mutations to the converted ARG objects corresponding to those used to infer the ARG. Note that this can result in a high number of mutations when the threading instructions were not inferred with the --fit-to-data flag.

Allele age estimation

The ages of mutations used to infer a set of threading instructions can be obtained using the allele-ages command.

threads allele-ages \
    --threads arg.threads \
    --num_threads (default: 1) \
    --out ages.txt

We highly recommend using parallelism with the --num_threads flag.

Note that threading instructions inferred with the --fit-to-data flag already contain allele age estimates.

Threads does not currently support age estimation of arbitrary mutations. If you are interested in this functionality, please contact the developers.

Imputation

Imputation using Threads proceeds in three steps. First, an ARG is inferred from the reference panel, using the threads infer procedure described above and converted to .argn format using threads convert.

Second, variants carried by the reference panel are assigned edges from within the ARG. This is performed using the threads map function, which accepts the following options:

threads map \
    --argn path/to/arg.argn \
    --maf [only variants with MAF below this threshold are mapped, default: 0.02] \
    --input [panel genotypes in bcf/vcf/vcf.gz format] \
    --region [region to map, end-inclusive, in "1:234-567"-format] \
    --num_threads [number of computational threads to request, default: 1] \
    --out path/to/output.mut

For imputation, we recommend setting the --maf threshold to 10 / N_panel, where N_panel is the number of individuals in the panel. The above writes the mutation mapping to path/to/output.mut.

Finally, imputation is performed using the threads impute command, which takes the following arguments:

threads impute \
    --panel [panel genotypes in bcf/vcf/vcf.gz format] \
    --target [target genotypes (arrays) in bcf/vcf/vcf.gz format] \
    --mut [output from "threads map" command above] \
    --map [path to genetic map in SHAPEIT format] \
    --mutation_rate [default 1.4e-8] \
    --demography [demographic history as used by threads infer] \
    --region [region to map, end-inclusive, in "1:234-567"-format] \
    --out [path to output.vcf] file \

Citation

If you use this software, please cite:

Á. F. Gunnarsson, J. Zhu, B. C. Zhang, Z. Tsangalidou, A. Allmont, P. Palamara. A scalable approach for genome-wide inference of ancestral recombination graphs. bioRxiv, 2024. https://www.biorxiv.org/content/10.1101/2024.08.31.610248v1