Threads

Threads is a threading-based method to reconstruct the ancestral recombination graph (ARG) from large datasets of sequenced, imputed, or genotyped genomes and to perform ARG-based genotype imputation.

Table of contents

Installation

To install Threads, run:

pip install threads_arg

Alternatively, you can download and install from source via the project’s GitHub page.

ARG inference

You will need

  • genotypes in Plink2 format (.pgen/.pvar/.psam or .pgen/.bim/.fam)
  • genetic map in SHAPEIT format, with columns pos, chr, and cM
  • demography file with two columns: generations in the past, effective population size in haploids (note: effective population size is often measured in diploids, make sure to multiply by a factor of 2 before using diploid demographies)

Minimal usage using the provided example data:

threads infer \
    --pgen example/example_data.pgen \
    --map example/example_data.map \
    --demography example/Ne10000.demo \
    --out example/example_data.threads

threads convert \
    --threads example/example_data.threads \
    --argn example/example_data.argn

This will first write a .threads file and then convert it into an ARG-Needle-format (.argn) ARG.

threads infer accepts more options:

threads infer \
    --pgen path/to/input.pgen \
    --map path/to/genetic_map.gz \
    --demography path/to/demography \
    --out path/to/output.threads \
    --modality [wgs|array] (default: wgs) \
    --query_interval (default: 0.01) \
    --match_group_interval (default: 0.5) \
    --max_sample_batch_size (default: None) \
    --mutation_rate (default: 1.4e-8) \
    --region 1234-56789 (default: whole region, end-inclusive) \
    --num_threads 8 (default: 1) \
    --fit_to_data (optional) \
    --allele_ages path/to/allele_ages.txt (optional)

--modality array can be set for inference from genotyping arrays.

The genome-wide mutation rate can be set with --mutation_rate. This defaults to a realistic human rate of 1.4e-8 per site per generation.

Specifying a --region start-end means the output ARG is truncated to those base-pair coordinates (end-inclusive). The whole input set will still be used for inference.

Running threads using parallelism is strongly encouraged and can be enabled by specifying --num_threads

Genetic maps may be found e.g. here.

--query_interval and --match_group_interval can be increased to save memory for inference over long genomic regions, this will usually have little impact on accuracy, especially for sparse variants.

Data-consistent ARGs

Threads does not by default guarantee that input mutations correspond to monophyletic clades in the inferred ARG. However, this can be achieved by providing the --fit_to_data flag. When threads infer is run with the --fit_to_data flag, the threading instructions are amended post-hoc such that each input mutation is represented by a clade in a marginal coalescence tree. As part of this process, Threads estimates the age of each mutation and then amends the local ARG topology. Allele ages can also be provided externally with the --allele_ages argument, which should be a path to a headerless file with columns representing variant id, base-pair position, and allele age (measured in generations in the past). This step can also be run separately with thethreads allele-ages command (see below).

Note: We will soon release an updated version of arg-needle-lib that speeds up conversion of .threads files inferred with --fit_to_data and reduces the size of the resulting .argn files.

ARG conversion

Threads output (.threads files) can be converted to ARG-Needle-format ARGs (.argn) or compressed tskit format (.tsz) using the following commands:

threads convert \
    --threads arg.threads \
    --argn arg.argn \
    --add_mutations (optional)

and

threads convert \
    --threads arg.threads \
    --tsz arg.tsz \
    --add_mutations (optional)

When the --add-mutations flag is provided, Threads parsimoniously adds mutations to the converted ARG objects corresponding to those used to infer the ARG. Note that this can result in a high number of mutations when the threading instructions were not inferred with the --fit-to-data flag.

VCF conversion

Threads-encoded genotypes can be extracted to VCF format and printed to stdout using the following command:

threads vcf \
    --threads arg.threads \
    --samples samples.samples \
    --variants variants.pvar

Threads does not by default save sample/variant metadata. These can be provided by setting --samples to a file containing IDs for samples used to build the ARG, one per line, and by setting --variants to the .pvar or .bim file used to construct the ARG. If threads infer was run using the --save_metadata flag, sample/variant data are included in the .threads archive.

Allele age estimation

The ages of mutations used to infer a set of threading instructions can be obtained using the allele-ages command.

threads allele-ages \
    --threads arg.threads \
    --num_threads (default: 1) \
    --out ages.txt

We highly recommend using parallelism with the --num_threads flag.

Note that threading instructions inferred with the --fit-to-data flag already contain allele age estimates.

Threads does not currently support age estimation of arbitrary mutations. If you are interested in this functionality, please contact the developers.

Genotype imputation

Imputation using Threads proceeds in three steps. First, an ARG is inferred from the reference panel, using the threads infer procedure described above and converted to .argn format using threads convert.

Second, variants carried by the reference panel are assigned edges from within the ARG. This is performed using the threads map function, which accepts the following options:

threads map \
    --argn path/to/arg.argn \
    --maf [only variants with MAF below this threshold are mapped, default: 0.02] \
    --input [panel genotypes in bcf/vcf/vcf.gz format] \
    --region [region to map, end-inclusive, in "1:234-567"-format] \
    --num_threads [number of computational threads to request, default: 1] \
    --out path/to/output.mut

For imputation, we recommend setting the --maf threshold to 10 / N_panel, where N_panel is the number of individuals in the panel. The above writes the mutation mapping to path/to/output.mut.

Finally, imputation is performed using the threads impute command, which takes the following arguments:

threads impute \
    --panel [panel genotypes in bcf/vcf/vcf.gz format] \
    --target [target genotypes (arrays) in bcf/vcf/vcf.gz format] \
    --mut [output from "threads map" command above] \
    --map [path to genetic map in SHAPEIT format] \
    --mutation_rate [default 1.4e-8] \
    --demography [demographic history as used by threads infer] \
    --region [region to map, end-inclusive, in "1:234-567"-format] \
    --out [path to output.vcf] file \

Phasing

The phasing experiments reported in Gunnarsson et al. were intended as proof of concept and were not optimized, so they were not included in the software release. Please contact us if you are interested in scripts to help reproduce these experiments.

Citation

If you use this software, please cite:

Á. F. Gunnarsson, J. Zhu, B. C. Zhang, Z. Tsangalidou, A. Allmont, P. Palamara. A scalable approach for genome-wide inference of ancestral recombination graphs. bioRxiv, 2024. https://www.biorxiv.org/content/10.1101/2024.08.31.610248v1