ARG-Needle and arg-needle-lib
The ARG-Needle software suite enables inferring ARGs from genotyping array (or sequencing) data as well as performing several ARG-based analyses such as genealogy-wide association. To install the software suite, run
pip install arg-needle
Python wheels are currently available for Linux and Mac (x86_64 and M1).
The above command will install two main packages: arg-needle
, which focuses on ARG inference, and arg-needle-lib
, which contains everything else such as the ARG data structure and ARG-based analyses.
arg-needle-lib
exposes a Python API with ARG analysis functions, some of which are described below.
arg-needle
and arg-needle-lib
can be imported in a Python session as follows:
import arg_needle # ARG inference
import arg_needle_lib # ARG-based analyses such as association
This manual describes the use of these libraries for ARG inference and analysis.
Running pip install arg-needle
will set up command-line scripts to quickly try out ARG-Needle. First, run the following to simulate 2 Mb of SNP array data with 200 diploid individuals.
arg_needle_prepare_example
The above writes files to example.hap.gz
, example.sample
, example.map
. Now, you can run ARG inference in various ways:
arg_needle # Regular ARG-Needle
arg_needle --asmc_clust 1 # ASMC-clust
arg_needle --normalize 0 # ARG-Needle without ARG normalization
By default, arg_needle
reads from example.hap[s][.gz]
, example.sample
, and example.map
and writes to example.argn
. To adapt it to your own data, you can modify the --hap_gz
, --map
, and --out
flags (see below).
ARG-Needle can analyze either SNP array (default) or sequencing data, although its parameters have been optimized for use with SNP data, so it may be slow when applied to sequencing data. When run in sequencing mode, ARG-Needle accepts the same input format but applies a different underlying ASMC model. To use sequencing mode with 0.5 Mb of simulated data, run:
arg_needle_prepare_example --mode sequence --length 5e5
arg_needle --mode sequence # Sequencing instead of array
We recommend becoming familiar with the following options:
--hap_gz Path to .hap[s][.gz] file (default=example.hap.gz)
--map Path to genetic map (default=example.map)
--out Path to output without .argn extension (default=example)
--mode Data mode ("array" or "sequence", default="array")
--normalize Whether to apply ARG normalization (default=1, nonzero
means true)
--normalize_demography Demography with haploid population sizes for normalization
(default=/path/to/site-packages/arg_needle/resources/
CEU.demo)
--asmc_clust Whether to use ASMC-clust instead of ARG-Needle (default=
0, nonzero means true)
--asmc_decoding_file Where to find decoding quantities (default=/path/to/site-
packages/arg_needle/resources/30-100-2000_
CEU.decodingQuantities.gz)
--chromosome Chromosome number to store in the ARG (default=1)
The --hap_gz
flag expects a file of the form root.hap
, root.haps
, root.hap.gz
, or root.haps.gz
. It then reads that haps file as well as root.sample
. In the case that your haps and sample files have a different prefix, we recommend creating a symlink. (For instance, given name1.hap.gz
and name2.sample
, create a symlink using ln -s name2.sample name1.sample
.)
The --map
file provides the genetic and physical positions for the inference algorithms. Both the genetic and physical positions must be strictly increasing.
The formats of these files are briefly described here.
ARG-Needle and ASMC-clust can be applied to sequencing or SNP array data using the --mode
flag, which is set to array
by default. For analyses involving sequencing data, it should be set to sequence
.
Inferred ARGs are written to the path specified using --out
in .argn
format and can be analyzed or converted as described below. A saved ARG in .argn
format can later be loaded using this function (see other ARG analysis functions).
ARG-Needle and ASMC-clust by default run an ARG normalization step after performing inference. This step aims to improve the estimated branch lengths within the ARG to match those expected in a population with a given demographic history and is performed by default. This can be changed using --normalize 0
.
The demographic history used for normalization is provided using the --normalize_demography
flag. The default --normalize_demography
is the CEU.demo
file from this repository and is included with package installation. It represents a European demographic model with haploid population sizes. Several other demographic models are available here, or you can provide your own, taking care that the population sizes are haploid.
The default algorithm used for ARG inference if ARG-Needle. To use ASMC-clust instead, you can set--asmc_clust 1
.
ARG-Needle and ASMC-clust rely on the ASMC algorithm to infer ARGs. ASMC uses a .decodingQuantities.gz
file storing information about the input data, such as the demographic model, the mutation rate, and the allele frequency spectrum of the array data. This is set using --asmc_decoding_file
.
The default --asmc_decoding_file
is the 30-100-2000_CEU.decodingQuantities.gz
file from this repository and is included with package installation. It corresponds to CEU.demo
demography (which assumes a mutation rate of 1.65E-8
), UK Biobank array frequency spectrum, and the 30-100-2000
time discretization (see README here). We recommend starting out with these decoding quantities even if working with a different population or genotyping array, as long as the ARG normalization step uses the correct demography. For strong deviations of both demographic history and genotype array density and frequency spectrum, you can prepare your own decoding quantities following this example. Note that the same decoding file is used for both array and sequence mode so you don’t need to generate multiple decoding quantities.
The --chromosome
flag simply sets a chromosome number in the inferred ARG file and does not affect the inference process itself. This is used, for instance, to write out a chromosome value in files containing association summary statistics for a genealogy-wide association analysis. When reading in .hap[s][.gz]
/ .map
files, ARG-Needle and ASMC-clust discard any chromosome information.
The following additional parameters may be customized when running the arg_needle
script.
--verbose Whether to log verbose output (default=0, nonzero means
true)
--asmc_clust_chunk_sites Number of sites per chunk for memory-efficient ASMC-clust
(default=-1 meaning not used)
--asmc_pad_cm ASMC padding in cM (default=2.0)
--sequence_hash_cm Sequence mode hashing window size in cM (default=0.3)
--snp_hash_cm Array mode hashing window size in cM (default=0.5)
--hash_tolerance Hashing algorithm tolerance (default=1)
--hash_topk How many closest samples to return using hashing
(default=64)
--hash_word_size Hashing word size, must be between 1 and 64 (default=16)
--backup_hash_word_size Backup hashing word size (must be between 0 and 64, 0
means no backup for real data inference, default=8)
For details about these parameters, please see our manuscript. For instance, the --hash_word_size
parameter can be incrementally increased to speed up computation in multi-step large-scale ARG inference tasks.
For larger biobank-scale analyses, ARG inference benefits from being split into multiple “chunks” throughout each chromosome. This reduces the memory requirements and runtime of each job.
In the UK Biobank analysis detailed in our manuscript, we partitioned array variants into 749 non-overlapping chunks for parallelized inference. However, the ARG-Needle algorithm benefits from having additional context when available, so we padded each of the non-overlapping chunks with 50 SNPs on either side, removing these regions when ARG-Needle had finished (see Methods). Additionally, we used one set of ARG-Needle parameters for the first 50K individuals, and switched to parameters allowing faster inference for the remaining individuals up to 337K (see Supplementary Note 1).
All the above options can be set using the more advanced script arg_needle_multistep
. This includes the switching of parameters midway through inference as well as “trimming” the padding regions once ARG-Needle finishes. For your own data, you will need to prepare the input chunks as .hap[s][.gz]
/ .map
files with padding SNPs included, but you can try out the script first using our example data.
If you have not already done so, generate the example SNP array data:
arg_needle_prepare_example
You can then run the multistep analysis in 3 steps:
arg_needle_multistep --step 1 --num_snp_samples 200
arg_needle_multistep --step 2
arg_needle_multistep --step 3
Step 1 builds an ARG on the first 200 haploid samples using either ARG-Needle or ASMC-clust. Step 2 threads the remaining samples in the data, and can accept different hashing parameters as needed. Step 3 trims 50 padding SNPs from each side and applies ARG normalization if specified.
When a chunk is at the start or end of the chromosome, there may be no padding SNPs available. In this case, the trimming should not operate on that side. To run step 3 only trimming 50 padding SNPs from the end, run:
arg_needle_multistep --step 3 --trim_num_snps 0,50
The script requires that you run all three steps in order. However, if you do not need to change any options for step 2, you can set --num_snp_samples
(or --num_sequence_samples
if running in sequence mode) to the full size of dataset for step 1. Running step 2 will then simply write out the same ARG to *.step2.argn
as is already in *.step1.argn
.
Below are the additional options provided by the arg_needle_multistep
script.
--step Which of steps 1, 2, 3 to run (default=1)
--num_snp_samples Number of haploid SNP samples at end of inference
(default=0)
--num_sequence_samples Number of haploid sequence samples at end of inference
(default=0)
--trim_num_snps Number of padding SNPs to trim off each side (default=50).
Can either be a single integer or a string of two
integers separated by a single comma without space,
e.g. 0,50. In the latter case the two values represent
the trimming from the beginning and the end.
The list below documents common errors and warnings. If you encounter any issues, please take a look at this list and get in touch for any additional questions or bug reports.
Click to expand a list of commonly encountered problems
Q: Why am I getting this warning? "Warning: few cousins found for ID ... If you see this >10% of the time, decrease the hash word size."
A: This warning is a recommendation to decrease the --hash_word_size
. If --backup_hash_word_size
is 0, meaning there is no backup hashing, it can also be useful to set it to a value such as 8 or 4. However, as the message says, if these warnings disappear after a few IDs, or are infrequent and comprise less than 10% of IDs, it should not be a problem.
Q: What does position mean in the resulting ARG object?
A: IMPORTANT: the positions in the resulting objects are shifted by an amount. To get the true position, add the ARG offset parameter to the position. So, if your map file has SNPs going from 10 Mb to 12 Mb, the offset will be around 10 Mb and the ARG length will be around 2 Mb.
Q: Why is the resulting ARG object shorter or longer than the simulated region?
A: ARG-Needle uses a range that goes from the first SNP to the last SNP, with a small interval of additional space on each side. This may not necessarily correspond with the original simulation.
Q: How do I determine appropriate hash word sizes?
A: The default parameters of --hash_word_size=16
and --backup_hash_word_size=8
should suffice for most cases. Use larger values to speed up runtime when building large ARGs, and smaller values for greater accuracy in the hashing. See also the Methods and Supplementary Note 1 of the manuscript.
Q: I get an error like the following: “ERROR. SNP snp1 has 200 non-missing individuals, but the CSFS requires 300”. Why?
A: The ASMC Hidden Markov model requires at least 300 haploid samples to build the conditional site frequency spectrum (CSFS). This corresponds to at least 150 diploids.
Q: I get an error like the following: “Please set reserved_samples to at least 400”. What should I do?
A: This error comes from the extend_arg()
function, used to thread additional samples to an ARG. While deserializing the ARG, please set the reserved_samples argument to at least this value, to “make room” for the additional samples.
Q: I get an error like the following: “Trying to thread to 200 haploid samples, but initial ARG already has 200 haploid samples”. What should I do?
A: This error comes from the extend_arg()
function, used to thread additional samples to an ARG. Please make sure that args.num_snp_samples
is set to a value larger than the size of the existing ARG.
Q: What is the verbose output being shown during the ARG building?
A: ARG-Needle logs the following: (1) the sample ID being threaded, whenever ID % 10 == 0; (2) the hash results for the sample whenever ID % 100 == 1; (3) various notices and warnings about the hashing for all IDs.
ARG-based analyses
Running pip install arg-needle
will automatically pull in the arg-needle-lib
package, which implements ARG-based analyses. arg-needle-lib
can also be used and/or installed independently from arg-needle
by running pip install arg-needle-lib
, for instance for performing analyses using previously inferred or simulated ARGs.
arg-needle-lib
can be imported in a Python session as follows:
import arg_needle_lib
arg-needle-lib
works with arg_needle_lib.ARG
objects, which are the default output format of ARG-Needle ARG inference. arg-needle-lib
comes pre-packaged with a script to perform ARG-based association of phenotypes, as well as scripts to convert between arg_needle_lib.ARG
, Newick, and tskit.TreeSequence
formats.
The script arg_association
uses the arg-needle-lib
library to perform ARG-based association, which we refer to as genealogy-wide association in the paper. It comes packaged with arg-needle-lib
and should be automatically added to your path.
We have prepared an example to quickly try out genealogy-wide association. After installing arg-needle-lib
with pip
, simply run
arg_association_prepare_example
arg_association
You can view the source code of arg_association_prepare_example
here and the source of arg_association
here.
Running the first command will prepare files in the directory files/
relative to wherever you have called the script. In particular, it generates a phenotype files/example.pheno
derived from a single variant in files/chr1.chunk1.argn
with an allele count of 20. Running the second command then performs association of the ARG with the phenotype, with results output to files/results.tab.gz
and files/results.haps.gz
. To extract the p-values and MAC and display them with smallest p-value first, run
zcat < files/results.tab.gz | awk '{print $9" "$6}' | sort -g | less
You should observe a few highly significant ARG variants with MAC = 20.
The above run will yield an empty files/results.haps.gz
file. To run the example with a different set of parameters, you might try
arg_association_prepare_example
arg_association --min_maf=0.05 --max_maf=0.20 --haps_threshold=1e-15 --sampling_rate=1e-8
The arguments of the arg_association
script are documented below.
--arg_path
Path to .argn (ARG format) or .trees/.tsz (tskit/tszip format) file (default='files/chr1.chunk1.argn').
--arg_sample_path
Path to file listing diploid samples in the same order as haploid leaves of the ARG (default='files/example.sample').
--arg_id
Identifier for input ARG used for writing output. Must start with chromosome number followed by a period, e.g., chr1.* or chr22.* (default='chr1.chunk1').
--residualised_pheno_path
Path to the phenotype, with any covariates residualised out. Should also residualise out the LOCO predictor for ARG-MLMA workflows (default='files/example.pheno').
--out_path
Output path prefix for *.tab.gz and *.haps.gz output (default='files/results').
--sampling_rate
Mutation rate-based testing. If positive, generates random mutations on the ARG, which are tested for association. If 0, tests all branches (default=0).
--min_mac
Minimum MAC to test (default=-1 which means no minimum). Cannot set both min_mac and min_maf.
--min_maf
Minimum MAF to test (default=-1 which means no minimum). Cannot set both min_mac and min_maf.
--max_maf
Maximum MAF to test (default=-1 which means no maximum).
--haps_threshold
p-value threshold, variants crossing this threshold are written to .haps.gz output (default=0).
--random_seed
Seed to use for mutation-based testing. If zero, seeds by time (default=1).
--calibration_factor
Calibration factor. Used in ARG-MLMA workflows, can be set to 1 to instead run linear regression (default=1).
--descendant_list_threshold
Advanced parameter determining how ARG branch variants are represented in memory. Uses a vector of sample IDs below and a bitset above the threshold. If less than 0, always uses a vector of sample IDs (default). If memory is concern, setting to num_individuals/16 may reduce memory at the cost of greater runtime.
The default settings perform linear regression association. You will want to input a phenotype with any covariates regressed out in --residualised_pheno_path
. You can find an example script to residualize covariates here.
For ARG-MLMA analyses, the residualised phenotype must have MLMA LOCO predictions subtracted, which can be performed using BOLT-LMM and PLINK. The --calibration_factor
will also be a value different than 1 and can be obtained from the BOLT-LMM analysis. We have included a Snakemake workflow that you can replicate with your own paths to run these steps of the ARG-MLMA analysis.
For a genome-wide ARG analysis, you will likely have several ARGs representing several genomic regions. The --arg_sample_path
will likely remain the same for all ARGs being analyzed, while the --arg_path
and --arg_id
will vary together. When running ARG-MLMA genome-wide, the --residualised_pheno_path
should be different across different chromosomes to capture LOCO effects.
We recommend tuning the --sampling_rate
and --threshold
based on your application. For instance, --threshold
can be set to 0 if variant haps are not needed, and --sampling_rate
can be tuned to either test all branches (value of 0
) or to sample branches based on a mutation rate, for which we used values of 1e-5
and 1e-3
in our paper. For each of these cases, you should compute appropriate genome-wide significance thresholds based on null model simulations. An example for this is in this directory.
We recommend using a fixed --random_seed
for reproducibility, either a --min_maf
of choice or a --min_mac
of 5, and keeping the default --descendant_list_threshold
.
In our paper, we computed genome-wide significance thresholds for ARG association by simulating phenotypes under the null model of randomly generated Gaussian noise. These null phenotypes are tested across the ARG as well as extensively for one ARG chunk to compute a genome-wide threshold at a family-wise error rate of 0.05.
A Snakemake workflow to perform this analysis can be found here, and the same procedure works for both linear regression and ARG-MLMA association.
First install Snakemake using pip install snakemake
.
Next prepare the example files (same as above) and run the Snakemake workflow using (adapt the number of cores as needed):
arg_association_prepare_example
snakemake -R summarize_null_simulations --cores 2
The example runs two different sampling rates (1e-5 and 1e-6) by default, and performs 20 replicates for the ARG chunk. This is much lower than required for a precise genome-wide threshold, for which we recommend 1000 replicates (by setting N_JOBS = 100
, as there are 10 seeds per job).
The workflow outputs threshold results to GeWAS_results/mu1e-5_thresholds.tsv
and GeWAS_results/mu1e-6_thresholds.tsv
. You should use the threshold
column as the genome-wide significance threshold, for which we report a 95% confidence interval. The workflow runs through a few choices of minimum MAF. The workflow writes intermediate output to GeWAS_results/null_model
; these files can be safely removed.
After association has been performed, you may want to extract approximately independent association signals using PLINK clumping and/or GCTA COJO analyses, as described in our manuscript.
We plan to upload scripts to help with these steps in the near future.
arg-needle-lib
exposes a Python API with ARG analysis functions that can be accessed after running import arg_needle_lib
in a Python session. Examples of using the Python API can be found in the example folder.
For instance, the functions arg_needle_lib.exact_arg_grm and arg_needle_lib.monte_carlo_arg_grm allow one to compute ARG-GRMs with varying values of alpha as well as MAF stratification. They are demonstrated in the file example_grm.py within the example folder. Another example demonstrates the use of the Python API for generating and iterating over mutations.
For more advanced users, arg-needle-lib
contains lower-level C++ functions that can be used by including arg-needle-lib
as a C++ module. However, we have designed arg-needle-lib
so that users can in most cases simply use the Python API. Please reach out if you have a particular use case not included in the Python API and would like to discuss options for having it implemented.
An arg-needle-lib
ARG can be saved in .argn
format using this function, and loaded back using this function.
arg-needle-lib
can be used to convert between other formats. It can convert to Newick format and to/from tskit.TreeSequence
format. This function can be used to convert to Newick format. To convert to/from tskit.TreeSequence
format you can use this and this library functions, as well as the arg2tskit
and tskit2arg
command line scripts that are packaged with arg-needle-lib
. Each script takes --arg_path
and --ts_path
arguments, checks that the file paths represent the correct formats, and applies the appropriate conversions. The --arg_path
argument must end in .argn
, while the --ts_path
argument can end in either .trees
(tskit
) or .tsz
(tszip
) format.
Currently, tskit2arg
does not convert mutations on the tskit.TreeSequence
. However, mutations can be generated on the converted arg_needle_lib.ARG
(see example_mutation.py
).
arg2tskit
does convert mutations. Additionally, while arg_needle_lib.ARGNode
objects have a start and end location in the genome, we remove this information when converting to tskit.TreeSequence
due to differences in how ARGs are represented.
See Supplementary Note 1 of our manuscript for a general description of how an ARG is represented in arg-needle-lib
.
You can find the souce for arg-needle-lib here. If you would like to contribute, feel free to let us know and/or fork the code and open a pull request. Please use the same coding style and conventions currently used in the code base.
You may want to look at the files that are created when running arg_association_prepare_example
to see examples of files provided in input to ARG-Needle.
These files are provided in input to ARG-Needle. The file format explained here. These files are output by phasing programs like Eagle and Shapeit.
The genetic map provided in input to ARG-Needle is in Plink map format, in which each line has four columns with format “chromosome SNP_name genetic_position physical_position”. Genetic positions are in centimorgans, physical positions are in bp.
Additional files used by ASMC within ARG-Needle (demographic history *.demo
; time discretization *.disc
; decoding quantities *.decodingQuantities.gz
; time discretization intervals *.intervalsInfo
) are described here.
If you have any questions, suggestions, or bug reports, please contact Pier Palamara using <lastname>@stats.ox.ac.uk
.
If you use this software, please cite:
B. C. Zhang, A. Biddanda, Á. F. Gunnarsson, F. Cooper, P. F. Palamara. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nature Genetics, 2023. https://doi.org/10.1038/s41588-023-01379-x