ARG-RHE and ARG-LMM

This manual provides instructions to run the ARG-RHE software for scalable variance component analysis using ancestral recombination graphs (ARGs), as well as other ARG-based linear mixed model analyses implemented in the ARG-LMM package. These methods are described in Zhu, Kalantzis et al., Cell Genomics, 2025. The source code, which relies on the arg-needle-lib library, is available in this repository. Additional scripts to reproduce the ARG-based linear mixed model analyses reported in Zhang et al., Nature Genetics, 2023 are available in this repository.

Installation

To install, use

pip install arg-lmm

ARG-RHE

To run ARG-based association testing using randomized Haseman–Elston (ARG-RHE), use:

arg-rhe --assoc --argfile <path_to_argn_file> --pheno <path_to_phenotypes> --out <output_path> --mu <mutation_resampling_rate> --alpha <alpha> --mac <minimum_mac_to_include> --seed <seed>

This is a shorthand for running arg-lmm with the --rhe flag:

arg-lmm --rhe --assoc --argfile <path_to_argn_file> --pheno <path_to_phenotypes> --out <output_path> --mu <mutation_resampling_rate> --alpha <alpha> --mac <minimum_mac_to_include> --seed <seed>
  • <path_to_argn_file> should point to an ARG in .argn format.
  • <path_to_phenotypes> is a space-separated file without a header. The first two columns must contain the FID and IID of each sample, and the remaining columns contain phenotype values (preferably mean-centred and normalised using RINT; see paper). The sample order must match exactly the order in the ARG file.
  • <output_path> is the output file where the estimated h2_g values are written. These correspond to the estimated heritability on the GRM random component. Row ordering matches the phenotype columns in <path_to_phenotypes>.
  • <mutation_resampling_rate> (optional) specifies the rate at which new mutations are generated on the ARG. The default is 1e-6.
  • <alpha> (optional) is the normalising exponent applied after mean-centring, where each genotype entry is scaled by (af*(1-af))**alpha. The default is -1.
  • <minimum_mac_to_include> (optional) is the minimum MAC threshold for including resampled mutations in the analysis. The default is 1.
  • <seed> (optional) is the random seed. The default is 42.

ARGs in .argn format can be obtained using ARG-Needle or Threads, or converted from other formats (e.g. .tsz) using arg-needle-lib as described here.

Example

arg-rhe --assoc --argfile demo_files/10kb_region_20k_samples.argn --pheno demo_files/h2_5e-03_alpha_-0.5.phenos --out demo_files/output.csv --mu 1e-6 --alpha -0.5 --mac 1 --seed 42

Notes

  • The input ARG usually covers a small genome window (e.g. a gene region). We provide scripts in utils folder in the repo to trim an inferred ARG in arg-needle’s .argn format and tskit’s compressed .tsz format. Note that the trimming start and end positions treat 0 as the beginning of the ARG file, not including the offset between physical position and ARG position.
  • The input phenotypes should be residualised against all covariates. We provide an example routine in utils folder in the repo to do so.

BLUP or LOCO residuals

arg-lmm implements additional ARG-based LMM analyses using efficient ARG-matrix multiplication and numerical algorithms such as a conjugate gradient solver, as described in Zhu, Kalantzis et al., Cell Genomics, 2025.

  • Estimating genome-wide heritability with ARG-RHE
  • Computing the best linear unbiased predictor (BLUP)
  • Calculating leave-one-chromosome-out (LOCO) residuals for association testing

These analyses are more experimental than ARG-RHE, which has been more extensively tested and applied to UK Biobank data. They can be run using the additional flags described below:

arg-lmm --h2 --argfiles <list_of_paths_to_args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num_of_snps_to_estimate_gamma> --alpha <alpha>
arg-rhe --h2 --argfiles <list of paths to args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num of snps to estimate gamma> --alpha <alpha>

where

  • <list_of_paths_to_args> is a list of filepaths to chromosome-specific .argn ARGs; at least two are required.
  • <ncalib> (optional) is the number of markers per chromosome used to estimate the GRAMMAR–Gamma calibration factor.
  • <output_path> is a prefix for all output files (BLUP, LOCO residuals, calibration factors).
  • --blup is a flag to compute the BLUP (not performed by default).
  • --h2 is a flag to estimate genome-wide heritability and its jackknife standard error (treating each ARG as a single block); when present, per-phenotype h2 and se are saved to <output_path>.h2.csv.

We provide simulated ARGs (one per hypothetical chromosome) and five phenotypes for 5,000 diploid samples in demo_files, generated under a model with 10 chromosomes and h2 = 0.25.

Examples

Example genome-wide heritability analysis:

arg-rhe --h2 --argfiles demo_files/sims.N5000.chr{1..10}.arg --pheno demo_files/sims.N5000.phenotypes --out demo_files/arg_loco.new

Example BLUP analysis:

arg-lmm --argfiles demo_files/sims.N5000.chr{1..10}.arg --pheno demo_files/sims.N5000.phenotypes --out demo_files/arg_loco.new --blup

Example LOCO with ncalib:

arg-lmm --argfiles <list_of_paths_to_args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num_of_snps_to_estimate_gamma> --alpha <alpha>

Licenses

The liu_sf method is copied from chiscore under the MIT License instead of using PyPI due to a broken chi2comb dependency.

Citation

ARG-RHE and other scalable ARG-LMM analyses are described in

Zhu, Kalantzis, et al., “Leveraging ancestral recombination graphs for scalable mixed-model analysis of complex traits”, Cell Genomics, 2025.

ARG-based LMM analyses based on explicit ARG-GRMs are described in

Zhang et al., “Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits”, Nature Genetics, 2023.