ARG-RHE and ARG-LMM
This manual provides instructions to run the ARG-RHE software for scalable variance component analysis using ancestral recombination graphs (ARGs), as well as other ARG-based linear mixed model analyses implemented in the ARG-LMM package. These methods are described in Zhu, Kalantzis et al., Cell Genomics, 2025. The source code, which relies on the arg-needle-lib library, is available in this repository. Additional scripts to reproduce the ARG-based linear mixed model analyses reported in Zhang et al., Nature Genetics, 2023 are available in this repository.
To install, use
pip install arg-lmm
To run ARG-based association testing using randomized Haseman–Elston (ARG-RHE), use:
arg-rhe --assoc --argfile <path_to_argn_file> --pheno <path_to_phenotypes> --out <output_path> --mu <mutation_resampling_rate> --alpha <alpha> --mac <minimum_mac_to_include> --seed <seed>
This is a shorthand for running arg-lmm with the --rhe flag:
arg-lmm --rhe --assoc --argfile <path_to_argn_file> --pheno <path_to_phenotypes> --out <output_path> --mu <mutation_resampling_rate> --alpha <alpha> --mac <minimum_mac_to_include> --seed <seed>
<path_to_argn_file>should point to an ARG in.argnformat.<path_to_phenotypes>is a space-separated file without a header. The first two columns must contain the FID and IID of each sample, and the remaining columns contain phenotype values (preferably mean-centred and normalised using RINT; see paper). The sample order must match exactly the order in the ARG file.<output_path>is the output file where the estimatedh2_gvalues are written. These correspond to the estimated heritability on the GRM random component. Row ordering matches the phenotype columns in<path_to_phenotypes>.<mutation_resampling_rate>(optional) specifies the rate at which new mutations are generated on the ARG. The default is1e-6.<alpha>(optional) is the normalising exponent applied after mean-centring, where each genotype entry is scaled by(af*(1-af))**alpha. The default is-1.<minimum_mac_to_include>(optional) is the minimum MAC threshold for including resampled mutations in the analysis. The default is1.<seed>(optional) is the random seed. The default is42.
ARGs in .argn format can be obtained using ARG-Needle or Threads, or converted from other formats (e.g. .tsz) using arg-needle-lib as described here.
arg-rhe --assoc --argfile demo_files/10kb_region_20k_samples.argn --pheno demo_files/h2_5e-03_alpha_-0.5.phenos --out demo_files/output.csv --mu 1e-6 --alpha -0.5 --mac 1 --seed 42
- The input ARG usually covers a small genome window (e.g. a gene region). We provide scripts in
utilsfolder in the repo to trim an inferred ARG in arg-needle’s.argnformat and tskit’s compressed.tszformat. Note that the trimming start and end positions treat 0 as the beginning of the ARG file, not including the offset between physical position and ARG position. - The input phenotypes should be residualised against all covariates. We provide an example routine in
utilsfolder in the repo to do so.
arg-lmm implements additional ARG-based LMM analyses using efficient ARG-matrix multiplication and numerical algorithms such as a conjugate gradient solver, as described in Zhu, Kalantzis et al., Cell Genomics, 2025.
- Estimating genome-wide heritability with ARG-RHE
- Computing the best linear unbiased predictor (BLUP)
- Calculating leave-one-chromosome-out (LOCO) residuals for association testing
These analyses are more experimental than ARG-RHE, which has been more extensively tested and applied to UK Biobank data. They can be run using the additional flags described below:
arg-lmm --h2 --argfiles <list_of_paths_to_args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num_of_snps_to_estimate_gamma> --alpha <alpha>
arg-rhe --h2 --argfiles <list of paths to args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num of snps to estimate gamma> --alpha <alpha>
where
<list_of_paths_to_args>is a list of filepaths to chromosome-specific.argnARGs; at least two are required.<ncalib>(optional) is the number of markers per chromosome used to estimate the GRAMMAR–Gamma calibration factor.<output_path>is a prefix for all output files (BLUP, LOCO residuals, calibration factors).--blupis a flag to compute the BLUP (not performed by default).--h2is a flag to estimate genome-wide heritability and its jackknife standard error (treating each ARG as a single block); when present, per-phenotypeh2andseare saved to<output_path>.h2.csv.
We provide simulated ARGs (one per hypothetical chromosome) and five phenotypes for 5,000 diploid samples in demo_files, generated under a model with 10 chromosomes and h2 = 0.25.
Example genome-wide heritability analysis:
arg-rhe --h2 --argfiles demo_files/sims.N5000.chr{1..10}.arg --pheno demo_files/sims.N5000.phenotypes --out demo_files/arg_loco.new
Example BLUP analysis:
arg-lmm --argfiles demo_files/sims.N5000.chr{1..10}.arg --pheno demo_files/sims.N5000.phenotypes --out demo_files/arg_loco.new --blup
Example LOCO with ncalib:
arg-lmm --argfiles <list_of_paths_to_args> --pheno <path_to_phenotypes> --out <output_path> --blup --ncalib <num_of_snps_to_estimate_gamma> --alpha <alpha>
The liu_sf method is copied from chiscore under the MIT License instead of using PyPI due to a broken chi2comb dependency.
ARG-RHE and other scalable ARG-LMM analyses are described in
Zhu, Kalantzis, et al., “Leveraging ancestral recombination graphs for scalable mixed-model analysis of complex traits”, Cell Genomics, 2025.
ARG-based LMM analyses based on explicit ARG-GRMs are described in
Zhang et al., “Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits”, Nature Genetics, 2023.