Skip to content

remove_vacant() is VERY slow #393

@bhaller

Description

@bhaller

I'm sorry to report that remove_vacant() seems to be extremely slow – so much so that it really destroys the performance of a pipeline involving tree-sequence recording. I've got a simulation of the full human genome with the Gravel model, using tree-sequence recording to produce a trees archive which I then load into python for recapitation and neutral mutation overlay. You can download the resulting trees archive here. Here's a script that loads, recapitates, and overlays each chromosome's tree sequence; it's a bit messy because it has timer code all over it, sorry:

# model3_analysis.py
#
# By Ben Haller and Murillo F. Rodrigues, 24 August 2025
# Messer Lab, Cornell University

# note that this requires pyslim 1.1.0 or later to run!

from pathlib import Path
import tskit, msprime, pyslim, warnings, os
from timeit import default_timer as timer
import polars as pl
import numpy as np

# NOTE: the base repository path needs to be configured for your setup here!
repository_path = Path('/Users/bhaller/Documents/Research/MesserLab/SLiM_project/Publication 2025 HumanPopGen/SimHumanity')

# set the current working directory to the SimHumanity repository
os.chdir(repository_path)

in_path = repository_path / "simhumanity_trees"
out_path = repository_path / "simhumanity_trees_RO"
out_path.mkdir(parents=False, exist_ok=False)

MU_TOTAL = 2.0e-8
MU_BENEF = 1e-12
MU_DELET = 1.2e-8

def get_recomb_map(chrom, seq_len):
    """
    Returns a recombination rate map (msprime.RateMap) for a given chromosome.
    Note the relative path to stdpopsim extraction is hardcoded in the function.
    """
    if chrom == 'MT':
        positions = np.array([0, seq_len])
        rates = np.array([0.0])
    else:
        rec_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_recombination.txt')
        rec_df = pl.read_csv(rec_path, has_header=False, new_columns=["pos", "rate"])

        positions = np.append(rec_df['pos'].to_numpy(), seq_len) # need to include the end of the sequence
        rates = rec_df['rate'].to_numpy()
    return msprime.RateMap(position=positions, rate=rates)

def get_mut_map(chrom, seq_len, mt_multiplier = 20):
    """
    Returns a mutation rate map (msprime.RateMap) for a given chromosome.
    For the MT chromosome, the mutation rate is constant and equal to MU_TOTAL multiplied by mt_multiplier.
    Note the relative paths to stdpopsim extraction and the mutation rates are hardcoded in the function.
    """
    if chrom == 'MT':
        breakpoints = np.array([0, seq_len])
        rates = np.array([MU_TOTAL*mt_multiplier])
    else:
        ge_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_genomic_elements.txt')
        ge_df = pl.read_csv(ge_path, has_header=False, new_columns=["type", "start", "end"])
        ge_df = ge_df.with_columns(pl.when(pl.col("type") == 0).then(pl.lit(MU_TOTAL)).otherwise(pl.lit(MU_TOTAL-MU_BENEF-MU_DELET)).alias("neutral_rate")) # computing the neutral rate for each genomic element
        breakpoints = np.append(ge_df["start"].to_numpy(), seq_len)
        rates = ge_df["neutral_rate"].to_numpy()
    return msprime.RateMap(position=breakpoints, rate=rates)


start_overall = timer()
total_load = 0
total_remove_vacant = 0
total_recap = 0
total_overlay_recap = 0
total_overlay_slim = 0
total_dump = 0

try:
    # iterate over the .trees files in the trees archive, in alphabetical order
    for file_path in sorted(in_path.iterdir(), key=lambda x:x.name):
        if file_path.is_file() and file_path.suffix == '.trees':
            print(f"Processing {file_path.name}...")
            print(f"   loading...")

            # extract the chromosome number from the file path
            chrom = str(file_path).split('.')[0].split('_')[-1]

            start_load = timer()
            ts = tskit.load(file_path)
            time_load = timer() - start_load
            total_load = total_load + time_load
            print("   ### time for tskit.load(): ", time_load)

            # remove vacant nodes from the sample, to avoid issues with "missing data" in stats computations
            start_remove_vacant = timer()
            ts = pyslim.remove_vacant(ts)
            time_remove_vacant = timer() - start_remove_vacant
            total_remove_vacant = total_remove_vacant + time_remove_vacant
            print("   ### time for pyslim.remove_vacant(): ", time_remove_vacant)

            print(f"Basic tree sequence information:")
            print(ts)


            # recapitate; note this issues a warning about large provenance, because it includes the rate map in the provenance
            start_recap = timer()

            print(f"   fetching the recombination rate map...")
            rec_map = get_recomb_map(chrom, ts.sequence_length)

            print(f"   recapitating...")
            ts = pyslim.recapitate(ts, ancestral_Ne=7310, recombination_rate=rec_map, random_seed=1)

            time_recap = timer() - start_recap
            total_recap = total_recap + time_recap
            print("   ### time for pyslim.recapitate(): ", time_recap)


            # compute where the recapitation period starts (in time ago)
            recap_start_time = ts.metadata["SLiM"]["tick"]-ts.metadata["SLiM"]["cycle"]
            print(f"   recapitation period starts at {recap_start_time} time ago")

            num_slim_muts = ts.num_mutations
            print(f"   {num_slim_muts} SLiM mutations")


            # overlay neutral mutations during the recapitation period
            print(f"   overlaying mutations over the recapitated portion...")
            next_id = pyslim.next_slim_mutation_id(ts)

            start_overlay_recap = timer()
            ts = msprime.sim_mutations(ts, rate=MU_TOTAL, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, start_time=recap_start_time) # start_time is in time ago, so we need to add mutations with a constant rate across the chromosome starting at the time the recapitated period starts (and older)
            time_overlay_recap = timer() - start_overlay_recap
            total_overlay_recap = total_overlay_recap + time_overlay_recap
            print("   ### time for overlay in recap interval: ", time_overlay_recap)

            num_postrecap_muts = ts.num_mutations
            print(f"   {num_postrecap_muts-num_slim_muts} mutations overlaid over the recapitated portion")


            # overlay neutral mutations during the SLiM period
            start_overlay_slim = timer()
            print(f"   fetching the mutation rate map for the SLiM period...")
            mut_map = get_mut_map(chrom, ts.sequence_length)

            print(f"   overlaying mutations over the SLiM period...")
            next_id = pyslim.next_slim_mutation_id(ts)

            ts = msprime.sim_mutations(ts, rate=mut_map, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, end_time=recap_start_time)
            time_overlay_slim = timer() - start_overlay_slim
            total_overlay_slim = total_overlay_slim + time_overlay_slim
            print("   ### time for overlay in SLiM interval: ", time_overlay_slim)


            num_total_muts = ts.num_mutations
            print(f"   {num_total_muts-num_slim_muts} mutations overlaid over the SLiM portion for a total of {num_total_muts} mutations")


            # write out the modified tree sequence to simhumanity_trees_RO
            print(f"   writing modified tree sequence...")

            start_dump = timer()
            ts.dump(str(out_path / file_path.name))
            time_dump = timer() - start_dump
            total_dump = total_dump + time_dump
            print("   ### time for dump(): ", time_dump)
except FileNotFoundError:
    print(f"Error: Directory '{directory_path}' not found.")
except Exception as e:
    print(f"An error occurred: {e}")

end_overall = timer()
print()
print("Elapsed time: ", end_overall - start_overall)
print("### total time in tskit.load(): ", total_load)
print("### total time in pyslim.remove_vacant(): ", total_remove_vacant)
print("### total time recapitating: ", total_recap)
print("### total time overlaying in the recapitation interval: ", total_overlay_recap)
print("### total time overlaying in the SLiM-simulated interval: ", total_overlay_slim)
print("### total time in dump(): ", total_dump)

Focusing on the output for chromosome 1 as an example, we get:

Processing chromosome_1.trees...
   loading...
   ### time for tskit.load():  1.5624354580650106
   ### time for pyslim.remove_vacant():  389.9872399580199
Basic tree sequence information:
╔═════════════════════════════╗
║TreeSequence                 ║
╠═══════════════╤═════════════╣
║Trees          │    5,282,216║
╟───────────────┼─────────────╢
║Sequence Length│2.4895642e+08║
╟───────────────┼─────────────╢
║Time Units     │  generations║
╟───────────────┼─────────────╢
║Sample Nodes   │      198,848║
╟───────────────┼─────────────╢
║Total Size     │      1.6 GiB║
╚═══════════════╧═════════════╝
╔═══════════╤══════════╤═════════╤════════════╗
║Table      │Rows      │Size     │Has Metadata║
╠═══════════╪══════════╪═════════╪════════════╣
║Edges      │23,110,403│705.3 MiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Individuals│    99,424│  9.5 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Migrations │         0│  8 Bytes│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Mutations  │    63,616│  3.8 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Nodes      │14,250,288│543.6 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Populations│         4│  2.7 KiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Provenances│         1│  8.9 KiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Sites      │    62,941│  1.4 MiB│          No║
╚═══════════╧══════════╧═════════╧════════════╝

   fetching the recombination rate map...
   recapitating...
The provenance information for the resulting tree sequence is 8.38MB. This is nothing to worry about as provenance is a good thing to have, but if you want to save this memory/storage space you can disable provenance recording by setting record_provenance=False
   ### time for pyslim.recapitate():  123.96915929205716
   recapitation period starts at 65794 time ago
   63616 SLiM mutations
   overlaying mutations over the recapitated portion...
   ### time for overlay in recap interval:  3.5198231670074165
   4996 mutations overlaid over the recapitated portion
   fetching the mutation rate map for the SLiM period...
   overlaying mutations over the SLiM period...
   ### time for overlay in SLiM interval:  112.23789012490306
   10730034 mutations overlaid over the SLiM portion for a total of 10793650 mutations
   writing modified tree sequence...
   ### time for dump():  1.1096230839611962

So unless I have a typo in my timer code, remove_vacant() apparently takes 1.6 times longer than all of the other steps combined. And in this case, it doesn't even do much of anything, right? Chromosome 1 is an autosome, so there are in fact no vacant nodes. Maybe it sets a whole bunch of flags on nodes; but for this (perhaps typical) workflow those flags aren't even needed, since we never call restore_vacant(). If setting all the flags is the slow part, maybe it would be worthwhile to add a flag to remove_vacant() that says "no need to set the flags, I won't be using them". (Maybe there would be a way to note that, and then have restore_vacant() throw an error if it later gets called, saying "Hey, you said you didn't need the flags!")

I will mitigate this for my present use case by calling remove_vacant() only for the X/Y/MT chromosomes, which should get rid of most of the problem; but it will remain extremely slow for those three chromosomes. There must be a way to make it faster...? I'll be submitting this paper at the end of this week, and it'd be great to have a better performance story to tell, but I imagine that is too tight a timeframe for you to want to roll a new pyslim. This will bite everybody that uses remove_vacant(), though, I imagine (but perhaps won't be a huge deal for those not simulating whole genomes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions