ARG-Needle and arg-needle-lib

Note: Please contact us if you have any questions, suggestions, or bug reports related to the contents of this manual.

The ARG-Needle software suite enables inferring ARGs from genotyping array (or sequencing) data as well as performing several ARG-based analyses such as genealogy-wide association. To install the software suite, run

pip install arg-needle

Python wheels are currently available for Linux and macOS on x86_64 architecture, supporting Python 3.8–3.11 (inclusive).

The above command will install two main packages: arg-needle, which focuses on ARG inference, and arg-needle-lib, which contains everything else such as the ARG data structure and ARG-based analyses.

arg-needle-lib exposes a Python API with ARG analysis functions, some of which are described below.

arg-needle and arg-needle-lib can be imported in a Python session as follows:

import arg_needle      # ARG inference
import arg_needle_lib  # ARG-based analyses such as association

This manual describes the use of these libraries for ARG inference and analysis.

ARG inference
ARG-based analyses
Source code and contributing
File formats
Contact
Citation

ARG inference

Quickstart

Running pip install arg-needle will set up command-line scripts to quickly try out ARG-Needle. First, run the following to simulate 2 Mb of SNP array data with 200 diploid individuals.

arg_needle_prepare_example

The above writes files to example.hap.gz, example.sample, example.map. Now, you can run ARG inference in various ways:

arg_needle                  # Regular ARG-Needle
arg_needle --asmc_clust 1   # ASMC-clust
arg_needle --normalize 0    # ARG-Needle without ARG normalization

By default, arg_needle reads from example.hap[s][.gz], example.sample, and example.map and writes to example.argn. To adapt it to your own data, you can modify the --hap_gz, --map, and --out flags (see below).

ARG-Needle can analyze either SNP array (default) or sequencing data, although its parameters have been optimized for use with SNP data, so it may be slow when applied to sequencing data. When run in sequencing mode, ARG-Needle accepts the same input format but applies a different underlying ASMC model. To use sequencing mode with 0.5 Mb of simulated data, run:

arg_needle_prepare_example --mode sequence --length 5e5
arg_needle --mode sequence  # Sequencing instead of array

Basic options

We recommend becoming familiar with the following options:

--hap_gz                Path to .hap[s][.gz] file (default=example.hap.gz)
--map                   Path to genetic map (default=example.map)
--out                   Path to output without .argn extension (default=example)
--mode                  Data mode ("array" or "sequence", default="array")
--normalize             Whether to apply ARG normalization (default=1, nonzero
                            means true)
--normalize_demography  Demography with haploid population sizes for normalization
                            (default=/path/to/site-packages/arg_needle/resources/
                            CEU.demo)
--asmc_clust            Whether to use ASMC-clust instead of ARG-Needle (default=
                            0, nonzero means true)
--asmc_decoding_file    Where to find decoding quantities (default=/path/to/site-
                            packages/arg_needle/resources/30-100-2000_
                            CEU.decodingQuantities.gz)
--chromosome            Chromosome number to store in the ARG (default=1)

Input/output

The --hap_gz flag expects a file of the form root.hap, root.haps, root.hap.gz, or root.haps.gz. It then reads that haps file as well as root.sample. In the case that your haps and sample files have a different prefix, we recommend creating a symlink. (For instance, given name1.hap.gz and name2.sample, create a symlink using ln -s name2.sample name1.sample.) Note: ARG-Needle expects the first two columns of the .samples file (ID_1 and ID_2) to be identical for each sample.

The --map file provides the genetic and physical positions for the inference algorithms. Both the genetic and physical positions must be strictly increasing.

The formats of these files are briefly described here.

ARG-Needle and ASMC-clust can be applied to sequencing or SNP array data using the --mode flag, which is set to array by default. For analyses involving sequencing data, it should be set to sequence.

Inferred ARGs are written to the path specified using --out in .argn format and can be analyzed or converted as described below. A saved ARG in .argn format can later be loaded using this function (see other ARG analysis functions).

ARG normalization

ARG-Needle and ASMC-clust by default run an ARG normalization step after performing inference. This step aims to improve the estimated branch lengths within the ARG to match those expected in a population with a given demographic history and is performed by default. This can be changed using --normalize 0.

The demographic history used for normalization is provided using the --normalize_demography flag. The default --normalize_demography is the CEU.demo file from this repository and is included with package installation. It represents a European demographic model with haploid population sizes. Several other demographic models are available here, or you can provide your own, taking care that the population sizes are haploid.

ARG-Needle and ASMC-clust

The default algorithm used for ARG inference if ARG-Needle. To use ASMC-clust instead, you can set--asmc_clust 1.

ASMC decoding quantities

ARG-Needle and ASMC-clust rely on the ASMC algorithm to infer ARGs. ASMC uses a .decodingQuantities.gz file storing information about the input data, such as the demographic model, the mutation rate, and the allele frequency spectrum of the array data. This is set using --asmc_decoding_file.

The default --asmc_decoding_file is the 30-100-2000_CEU.decodingQuantities.gz file from this repository and is included with package installation. It corresponds to CEU.demo demography (which assumes a mutation rate of 1.65E-8), UK Biobank array frequency spectrum, and the 30-100-2000 time discretization (see README here). We recommend starting out with these decoding quantities even if working with a different population or genotyping array, as long as the ARG normalization step uses the correct demography. For strong deviations of both demographic history and genotype array density and frequency spectrum, you can prepare your own decoding quantities following this example. Note that the same decoding file is used for both array and sequence mode so you don’t need to generate multiple decoding quantities.

Chromosome

The --chromosome flag simply sets a chromosome number in the inferred ARG file and does not affect the inference process itself. This is used, for instance, to write out a chromosome value in files containing association summary statistics for a genealogy-wide association analysis. When reading in .hap[s][.gz] / .map files, ARG-Needle and ASMC-clust discard any chromosome information.

Additional ARG inference options

The following additional parameters may be customized when running the arg_needle script.

--verbose                 Whether to log verbose output (default=0, nonzero means
                              true)
--asmc_clust_chunk_sites  Number of sites per chunk for memory-efficient ASMC-clust
                              (default=-1 meaning not used)
--asmc_pad_cm             ASMC padding in cM (default=2.0)
--sequence_hash_cm        Sequence mode hashing window size in cM (default=0.3)
--snp_hash_cm             Array mode hashing window size in cM (default=0.5)
--hash_tolerance          Hashing algorithm tolerance (default=1)
--hash_topk               How many closest samples to return using hashing
                              (default=64)
--hash_word_size          Hashing word size, must be between 1 and 64 (default=16)
--backup_hash_word_size   Backup hashing word size (must be between 0 and 64, 0
                              means no backup for real data inference, default=8)

For details about these parameters, please see our manuscript. For instance, the --hash_word_size parameter can be incrementally increased to speed up computation in multi-step large-scale ARG inference tasks.

Larger analyses and ARG trimming

For larger biobank-scale analyses, ARG inference benefits from being split into multiple “chunks” throughout each chromosome. This reduces the memory requirements and runtime of each job.

In the UK Biobank analysis detailed in our manuscript, we partitioned array variants into 749 non-overlapping chunks for parallelized inference. However, the ARG-Needle algorithm benefits from having additional context when available, so we padded each of the non-overlapping chunks with 50 SNPs on either side, removing these regions when ARG-Needle had finished (see Methods). Additionally, we used one set of ARG-Needle parameters for the first 50K individuals, and switched to parameters allowing faster inference for the remaining individuals up to 337K (see Supplementary Note 1).

All the above options can be set using the more advanced script arg_needle_multistep. This includes the switching of parameters midway through inference as well as “trimming” the padding regions once ARG-Needle finishes. For your own data, you will need to prepare the input chunks as .hap[s][.gz] / .map files with padding SNPs included, but you can try out the script first using our example data.

If you have not already done so, generate the example SNP array data:

arg_needle_prepare_example

You can then run the multistep analysis in 3 steps:

arg_needle_multistep --step 1 --num_snp_samples 200
arg_needle_multistep --step 2
arg_needle_multistep --step 3

Step 1 builds an ARG on the first 200 haploid samples using either ARG-Needle or ASMC-clust. Step 2 threads the remaining samples in the data, and can accept different hashing parameters as needed. Step 3 trims 50 padding SNPs from each side and applies ARG normalization if specified.

When a chunk is at the start or end of the chromosome, there may be no padding SNPs available. In this case, the trimming should not operate on that side. To run step 3 only trimming 50 padding SNPs from the end, run:

arg_needle_multistep --step 3 --trim_num_snps 0,50

The script requires that you run all three steps in order. However, if you do not need to change any options for step 2, you can set --num_snp_samples (or --num_sequence_samples if running in sequence mode) to the full size of dataset for step 1. Running step 2 will then simply write out the same ARG to *.step2.argn as is already in *.step1.argn.

Below are the additional options provided by the arg_needle_multistep script.

--step                  Which of steps 1, 2, 3 to run (default=1)
--num_snp_samples       Number of haploid SNP samples at end of inference
                            (default=0)
--num_sequence_samples  Number of haploid sequence samples at end of inference
                            (default=0)
--trim_num_snps         Number of padding SNPs to trim off each side (default=50).
                            Can either be a single integer or a string of two
                            integers separated by a single comma without space,
                            e.g. 0,50. In the latter case the two values represent
                            the trimming from the beginning and the end.

Troubleshooting

The list below documents common errors and warnings. If you encounter any issues, please take a look at this list and get in touch for any additional questions or bug reports.

Click to expand a list of commonly encountered problems

Commonly encountered problems

Q: Why am I getting this warning? "Warning: few cousins found for ID ... If you see this >10% of the time, decrease the hash word size."
A: This warning is a recommendation to decrease the --hash_word_size. If --backup_hash_word_size is 0, meaning there is no backup hashing, it can also be useful to set it to a value such as 8 or 4. However, as the message says, if these warnings disappear after a few IDs, or are infrequent and comprise less than 10% of IDs, it should not be a problem.

Q: What does position mean in the resulting ARG object?
A: IMPORTANT: the positions in the resulting objects are shifted by an amount. To get the true position, add the ARG offset parameter to the position. So, if your map file has SNPs going from 10 Mb to 12 Mb, the offset will be around 10 Mb and the ARG length will be around 2 Mb.

Q: Why is the resulting ARG object shorter or longer than the simulated region?
A: ARG-Needle uses a range that goes from the first SNP to the last SNP, with a small interval of additional space on each side. This may not necessarily correspond with the original simulation.

Q: How do I determine appropriate hash word sizes?
A: The default parameters of --hash_word_size=16 and --backup_hash_word_size=8 should suffice for most cases. Use larger values to speed up runtime when building large ARGs, and smaller values for greater accuracy in the hashing. See also the Methods and Supplementary Note 1 of the manuscript.

Q: I get an error like the following: “ERROR. SNP snp1 has 200 non-missing individuals, but the CSFS requires 300”. Why?
A: The ASMC Hidden Markov model requires at least 300 haploid samples to build the conditional site frequency spectrum (CSFS). This corresponds to at least 150 diploids.

Q: I get an error like the following: “Please set reserved_samples to at least 400”. What should I do?
A: This error comes from the extend_arg() function, used to thread additional samples to an ARG. While deserializing the ARG, please set the reserved_samples argument to at least this value, to “make room” for the additional samples.

Q: I get an error like the following: “Trying to thread to 200 haploid samples, but initial ARG already has 200 haploid samples”. What should I do?
A: This error comes from the extend_arg() function, used to thread additional samples to an ARG. Please make sure that args.num_snp_samples is set to a value larger than the size of the existing ARG.

Q: What is the verbose output being shown during the ARG building?
A: ARG-Needle logs the following: (1) the sample ID being threaded, whenever ID % 10 == 0; (2) the hash results for the sample whenever ID % 100 == 1; (3) various notices and warnings about the hashing for all IDs.

ARG-based analyses

Running pip install arg-needle will automatically pull in the arg-needle-lib package, which implements ARG-based analyses. arg-needle-lib can also be used and/or installed independently from arg-needle by running pip install arg-needle-lib, for instance for performing analyses using previously inferred or simulated ARGs.

arg-needle-lib can be imported in a Python session as follows:

import arg_needle_lib

arg-needle-lib works with arg_needle_lib.ARG objects, which are the default output format of ARG-Needle ARG inference. arg-needle-lib comes pre-packaged with a script to perform ARG-based association of phenotypes, as well as scripts to convert between arg_needle_lib.ARG, Newick, and tskit.TreeSequence formats.

Genealogy-wide association

The script arg_association uses the arg-needle-lib library to perform ARG-based association, which we refer to as genealogy-wide association in the paper. It comes packaged with arg-needle-lib and should be automatically added to your path.

We have prepared an example to quickly try out genealogy-wide association. After installing arg-needle-lib with pip, simply run

arg_association_prepare_example
arg_association

You can view the source code of arg_association_prepare_example here and the source of arg_association here.

Running the first command will prepare files in the directory files/ relative to wherever you have called the script. In particular, it generates a phenotype files/example.pheno derived from a single variant in files/chr1.chunk1.argn with an allele count of 20. Running the second command then performs association of the ARG with the phenotype, with results output to files/results.tab.gz and files/results.haps.gz. To extract the p-values and MAC and display them with smallest p-value first, run

zcat < files/results.tab.gz | awk '{print $9" "$6}' | sort -g | less

You should observe a few highly significant ARG variants with MAC = 20.

The above run will yield an empty files/results.haps.gz file. To run the example with a different set of parameters, you might try

arg_association_prepare_example
arg_association --min_maf=0.05 --max_maf=0.20 --haps_threshold=1e-15 --sampling_rate=1e-8

The arguments of the arg_association script are documented below.

Basic options:

--arg_path
			Path to .argn (ARG format) or .trees/.tsz (tskit/tszip format) file (default='files/chr1.chunk1.argn').
--arg_sample_path
			Path to file listing diploid samples in the same order as haploid leaves of the ARG (default='files/example.sample').
--arg_id
			Identifier for input ARG used for writing output. Must start with chromosome number followed by a period, e.g., chr1.* or chr22.* (default='chr1.chunk1').
--residualised_pheno_path
			Path to the phenotype, with any covariates residualised out. Should also residualise out the LOCO predictor for ARG-MLMA workflows (default='files/example.pheno').
--out_path
			Output path prefix for *.tab.gz and *.haps.gz output (default='files/results').

Additional optional parameters:

--sampling_rate
			Mutation rate-based testing. If positive, generates random mutations on the ARG, which are tested for association. If 0, tests all branches (default=0).
--min_mac
			Minimum MAC to test (default=-1 which means no minimum). Cannot set both min_mac and min_maf.
--min_maf
			Minimum MAF to test (default=-1 which means no minimum). Cannot set both min_mac and min_maf.
--max_maf
			Maximum MAF to test (default=-1 which means no maximum).
--haps_threshold
			p-value threshold, variants crossing this threshold are written to .haps.gz output (default=0).
--random_seed
			Seed to use for mutation-based testing. If zero, seeds by time (default=1).
--calibration_factor
			Calibration factor. Used in ARG-MLMA workflows, can be set to 1 to instead run linear regression (default=1).
--descendant_list_threshold
			Advanced parameter determining how ARG branch variants are represented in memory. Uses a vector of sample IDs below and a bitset above the threshold. If less than 0, always uses a vector of sample IDs (default). If memory is concern, setting to num_individuals/16 may reduce memory at the cost of greater runtime.

The default settings perform linear regression association. You will want to input a phenotype with any covariates regressed out in --residualised_pheno_path. You can find an example script to residualize covariates here.

For ARG-MLMA analyses, the residualised phenotype must have MLMA LOCO predictions subtracted, which can be performed using BOLT-LMM and PLINK. The --calibration_factor will also be a value different than 1 and can be obtained from the BOLT-LMM analysis. We have included a Snakemake workflow that you can replicate with your own paths to run these steps of the ARG-MLMA analysis.

For a genome-wide ARG analysis, you will likely have several ARGs representing several genomic regions. The --arg_sample_path will likely remain the same for all ARGs being analyzed, while the --arg_path and --arg_id will vary together. When running ARG-MLMA genome-wide, the --residualised_pheno_path should be different across different chromosomes to capture LOCO effects.

We recommend tuning the --sampling_rate and --threshold based on your application. For instance, --threshold can be set to 0 if variant haps are not needed, and --sampling_rate can be tuned to either test all branches (value of 0) or to sample branches based on a mutation rate, for which we used values of 1e-5 and 1e-3 in our paper. For each of these cases, you should compute appropriate genome-wide significance thresholds based on null model simulations. An example for this is in this directory.

We recommend using a fixed --random_seed for reproducibility, either a --min_maf of choice or a --min_mac of 5, and keeping the default --descendant_list_threshold.

Obtaining genome-wide significance thresholds

In our paper, we computed genome-wide significance thresholds for ARG association by simulating phenotypes under the null model of randomly generated Gaussian noise. These null phenotypes are tested across the ARG as well as extensively for one ARG chunk to compute a genome-wide threshold at a family-wise error rate of 0.05.

A Snakemake workflow to perform this analysis can be found here, and the same procedure works for both linear regression and ARG-MLMA association.

First install Snakemake using pip install snakemake.

Next prepare the example files (same as above) and run the Snakemake workflow using (adapt the number of cores as needed):

arg_association_prepare_example
snakemake -R summarize_null_simulations --cores 2

The example runs two different sampling rates (1e-5 and 1e-6) by default, and performs 20 replicates for the ARG chunk. This is much lower than required for a precise genome-wide threshold, for which we recommend 1000 replicates (by setting N_JOBS = 100, as there are 10 seeds per job).

The workflow outputs threshold results to GeWAS_results/mu1e-5_thresholds.tsv and GeWAS_results/mu1e-6_thresholds.tsv. You should use the threshold column as the genome-wide significance threshold, for which we report a 95% confidence interval. The workflow runs through a few choices of minimum MAF. The workflow writes intermediate output to GeWAS_results/null_model; these files can be safely removed.

Filtering association signals

After association has been performed, you may want to extract approximately independent association signals using PLINK clumping and/or GCTA COJO analyses, as described in our manuscript.

We plan to upload scripts to help with these steps in the near future.

Other ARG analysis functions (Python API)

arg-needle-lib exposes a Python API with ARG analysis functions that can be accessed after running import arg_needle_lib in a Python session. Examples of using the Python API can be found in the example folder.

For instance, the functions arg_needle_lib.exact_arg_grm and arg_needle_lib.monte_carlo_arg_grm allow one to compute ARG-GRMs with varying values of alpha as well as MAF stratification. They are demonstrated in the file example_grm.py within the example folder. Another example demonstrates the use of the Python API for generating and iterating over mutations.

For more advanced users, arg-needle-lib contains lower-level C++ functions that can be used by including arg-needle-lib as a C++ module. However, we have designed arg-needle-lib so that users can in most cases simply use the Python API. Please reach out if you have a particular use case not included in the Python API and would like to discuss options for having it implemented.

Saving and loading an ARG

An arg-needle-lib ARG can be saved in .argn format using this function, and loaded back using this function.

Conversion scripts

arg-needle-lib can be used to convert between other formats. It can convert to Newick format and to/from tskit.TreeSequence format. This function can be used to convert to Newick format. To convert to/from tskit.TreeSequence format you can use this and this library functions, as well as the arg2tskit and tskit2arg command line scripts that are packaged with arg-needle-lib. Each script takes --arg_path and --ts_path arguments, checks that the file paths represent the correct formats, and applies the appropriate conversions. The --arg_path argument must end in .argn, while the --ts_path argument can end in either .trees (tskit) or .tsz (tszip) format.

Currently, tskit2arg does not convert mutations on the tskit.TreeSequence. However, mutations can be generated on the converted arg_needle_lib.ARG (see example_mutation.py).

arg2tskit does convert mutations. Additionally, while arg_needle_lib.ARGNode objects have a start and end location in the genome, we remove this information when converting to tskit.TreeSequence due to differences in how ARGs are represented.

See Supplementary Note 1 of our manuscript for a general description of how an ARG is represented in arg-needle-lib.

Source code and contributing

You can find the souce for arg-needle-lib here. If you would like to contribute, feel free to let us know and/or fork the code and open a pull request. Please use the same coding style and conventions currently used in the code base.

File formats

You may want to look at the files that are created when running arg_association_prepare_example to see examples of files provided in input to ARG-Needle.

Phased haplotypes in Oxford haps/sample format (*.hap/hap.gz, samples)

These files are provided in input to ARG-Needle. The file format explained here: haps, samples. These files are output by phasing programs like Eagle and Shapeit. Note: ARG-Needle expects the first two columns of the .samples file (ID_1 and ID_2) to be identical for each sample!

Genetic map (*.map/map.gz)

The genetic map provided in input to ARG-Needle is in Plink map format, in which each line has four columns with format “chromosome SNP_name genetic_position physical_position”. Genetic positions are in centimorgans, physical positions are in bp.

ASMC files

Additional files used by ASMC within ARG-Needle (demographic history *.demo; time discretization *.disc; decoding quantities *.decodingQuantities.gz; time discretization intervals *.intervalsInfo) are described here.

Contact

If you have any questions, suggestions, or bug reports, please contact Pier Palamara using <lastname>@stats.ox.ac.uk.

Citation

If you use this software, please cite:

B. C. Zhang, A. Biddanda, Á. F. Gunnarsson, F. Cooper, P. F. Palamara. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nature Genetics, 2023. https://doi.org/10.1038/s41588-023-01379-x