remove_vacant() is VERY slow

I'm sorry to report that `remove_vacant()` seems to be extremely slow – so much so that it really destroys the performance of a pipeline involving tree-sequence recording.  I've got a simulation of the full human genome with the Gravel model, using tree-sequence recording to produce a trees archive which I then load into python for recapitation and neutral mutation overlay.  You can download the resulting trees archive [here](http://benhaller.com/slim/simhumanity_trees.zip).  Here's a script that loads, recapitates, and overlays each chromosome's tree sequence; it's a bit messy because it has timer code all over it, sorry:

```
# model3_analysis.py
#
# By Ben Haller and Murillo F. Rodrigues, 24 August 2025
# Messer Lab, Cornell University

# note that this requires pyslim 1.1.0 or later to run!

from pathlib import Path
import tskit, msprime, pyslim, warnings, os
from timeit import default_timer as timer
import polars as pl
import numpy as np

# NOTE: the base repository path needs to be configured for your setup here!
repository_path = Path('/Users/bhaller/Documents/Research/MesserLab/SLiM_project/Publication 2025 HumanPopGen/SimHumanity')

# set the current working directory to the SimHumanity repository
os.chdir(repository_path)

in_path = repository_path / "simhumanity_trees"
out_path = repository_path / "simhumanity_trees_RO"
out_path.mkdir(parents=False, exist_ok=False)

MU_TOTAL = 2.0e-8
MU_BENEF = 1e-12
MU_DELET = 1.2e-8

def get_recomb_map(chrom, seq_len):
    """
    Returns a recombination rate map (msprime.RateMap) for a given chromosome.
    Note the relative path to stdpopsim extraction is hardcoded in the function.
    """
    if chrom == 'MT':
        positions = np.array([0, seq_len])
        rates = np.array([0.0])
    else:
        rec_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_recombination.txt')
        rec_df = pl.read_csv(rec_path, has_header=False, new_columns=["pos", "rate"])

        positions = np.append(rec_df['pos'].to_numpy(), seq_len) # need to include the end of the sequence
        rates = rec_df['rate'].to_numpy()
    return msprime.RateMap(position=positions, rate=rates)

def get_mut_map(chrom, seq_len, mt_multiplier = 20):
    """
    Returns a mutation rate map (msprime.RateMap) for a given chromosome.
    For the MT chromosome, the mutation rate is constant and equal to MU_TOTAL multiplied by mt_multiplier.
    Note the relative paths to stdpopsim extraction and the mutation rates are hardcoded in the function.
    """
    if chrom == 'MT':
        breakpoints = np.array([0, seq_len])
        rates = np.array([MU_TOTAL*mt_multiplier])
    else:
        ge_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_genomic_elements.txt')
        ge_df = pl.read_csv(ge_path, has_header=False, new_columns=["type", "start", "end"])
        ge_df = ge_df.with_columns(pl.when(pl.col("type") == 0).then(pl.lit(MU_TOTAL)).otherwise(pl.lit(MU_TOTAL-MU_BENEF-MU_DELET)).alias("neutral_rate")) # computing the neutral rate for each genomic element
        breakpoints = np.append(ge_df["start"].to_numpy(), seq_len)
        rates = ge_df["neutral_rate"].to_numpy()
    return msprime.RateMap(position=breakpoints, rate=rates)


start_overall = timer()
total_load = 0
total_remove_vacant = 0
total_recap = 0
total_overlay_recap = 0
total_overlay_slim = 0
total_dump = 0

try:
    # iterate over the .trees files in the trees archive, in alphabetical order
    for file_path in sorted(in_path.iterdir(), key=lambda x:x.name):
        if file_path.is_file() and file_path.suffix == '.trees':
            print(f"Processing {file_path.name}...")
            print(f"   loading...")

            # extract the chromosome number from the file path
            chrom = str(file_path).split('.')[0].split('_')[-1]

            start_load = timer()
            ts = tskit.load(file_path)
            time_load = timer() - start_load
            total_load = total_load + time_load
            print("   ### time for tskit.load(): ", time_load)

            # remove vacant nodes from the sample, to avoid issues with "missing data" in stats computations
            start_remove_vacant = timer()
            ts = pyslim.remove_vacant(ts)
            time_remove_vacant = timer() - start_remove_vacant
            total_remove_vacant = total_remove_vacant + time_remove_vacant
            print("   ### time for pyslim.remove_vacant(): ", time_remove_vacant)

            print(f"Basic tree sequence information:")
            print(ts)


            # recapitate; note this issues a warning about large provenance, because it includes the rate map in the provenance
            start_recap = timer()

            print(f"   fetching the recombination rate map...")
            rec_map = get_recomb_map(chrom, ts.sequence_length)

            print(f"   recapitating...")
            ts = pyslim.recapitate(ts, ancestral_Ne=7310, recombination_rate=rec_map, random_seed=1)

            time_recap = timer() - start_recap
            total_recap = total_recap + time_recap
            print("   ### time for pyslim.recapitate(): ", time_recap)


            # compute where the recapitation period starts (in time ago)
            recap_start_time = ts.metadata["SLiM"]["tick"]-ts.metadata["SLiM"]["cycle"]
            print(f"   recapitation period starts at {recap_start_time} time ago")

            num_slim_muts = ts.num_mutations
            print(f"   {num_slim_muts} SLiM mutations")


            # overlay neutral mutations during the recapitation period
            print(f"   overlaying mutations over the recapitated portion...")
            next_id = pyslim.next_slim_mutation_id(ts)

            start_overlay_recap = timer()
            ts = msprime.sim_mutations(ts, rate=MU_TOTAL, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, start_time=recap_start_time) # start_time is in time ago, so we need to add mutations with a constant rate across the chromosome starting at the time the recapitated period starts (and older)
            time_overlay_recap = timer() - start_overlay_recap
            total_overlay_recap = total_overlay_recap + time_overlay_recap
            print("   ### time for overlay in recap interval: ", time_overlay_recap)

            num_postrecap_muts = ts.num_mutations
            print(f"   {num_postrecap_muts-num_slim_muts} mutations overlaid over the recapitated portion")


            # overlay neutral mutations during the SLiM period
            start_overlay_slim = timer()
            print(f"   fetching the mutation rate map for the SLiM period...")
            mut_map = get_mut_map(chrom, ts.sequence_length)

            print(f"   overlaying mutations over the SLiM period...")
            next_id = pyslim.next_slim_mutation_id(ts)

            ts = msprime.sim_mutations(ts, rate=mut_map, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, end_time=recap_start_time)
            time_overlay_slim = timer() - start_overlay_slim
            total_overlay_slim = total_overlay_slim + time_overlay_slim
            print("   ### time for overlay in SLiM interval: ", time_overlay_slim)


            num_total_muts = ts.num_mutations
            print(f"   {num_total_muts-num_slim_muts} mutations overlaid over the SLiM portion for a total of {num_total_muts} mutations")


            # write out the modified tree sequence to simhumanity_trees_RO
            print(f"   writing modified tree sequence...")

            start_dump = timer()
            ts.dump(str(out_path / file_path.name))
            time_dump = timer() - start_dump
            total_dump = total_dump + time_dump
            print("   ### time for dump(): ", time_dump)
except FileNotFoundError:
    print(f"Error: Directory '{directory_path}' not found.")
except Exception as e:
    print(f"An error occurred: {e}")

end_overall = timer()
print()
print("Elapsed time: ", end_overall - start_overall)
print("### total time in tskit.load(): ", total_load)
print("### total time in pyslim.remove_vacant(): ", total_remove_vacant)
print("### total time recapitating: ", total_recap)
print("### total time overlaying in the recapitation interval: ", total_overlay_recap)
print("### total time overlaying in the SLiM-simulated interval: ", total_overlay_slim)
print("### total time in dump(): ", total_dump)
```

Focusing on the output for chromosome 1 as an example, we get:

```
Processing chromosome_1.trees...
   loading...
   ### time for tskit.load():  1.5624354580650106
   ### time for pyslim.remove_vacant():  389.9872399580199
Basic tree sequence information:
╔═════════════════════════════╗
║TreeSequence                 ║
╠═══════════════╤═════════════╣
║Trees          │    5,282,216║
╟───────────────┼─────────────╢
║Sequence Length│2.4895642e+08║
╟───────────────┼─────────────╢
║Time Units     │  generations║
╟───────────────┼─────────────╢
║Sample Nodes   │      198,848║
╟───────────────┼─────────────╢
║Total Size     │      1.6 GiB║
╚═══════════════╧═════════════╝
╔═══════════╤══════════╤═════════╤════════════╗
║Table      │Rows      │Size     │Has Metadata║
╠═══════════╪══════════╪═════════╪════════════╣
║Edges      │23,110,403│705.3 MiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Individuals│    99,424│  9.5 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Migrations │         0│  8 Bytes│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Mutations  │    63,616│  3.8 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Nodes      │14,250,288│543.6 MiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Populations│         4│  2.7 KiB│         Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Provenances│         1│  8.9 KiB│          No║
╟───────────┼──────────┼─────────┼────────────╢
║Sites      │    62,941│  1.4 MiB│          No║
╚═══════════╧══════════╧═════════╧════════════╝

   fetching the recombination rate map...
   recapitating...
The provenance information for the resulting tree sequence is 8.38MB. This is nothing to worry about as provenance is a good thing to have, but if you want to save this memory/storage space you can disable provenance recording by setting record_provenance=False
   ### time for pyslim.recapitate():  123.96915929205716
   recapitation period starts at 65794 time ago
   63616 SLiM mutations
   overlaying mutations over the recapitated portion...
   ### time for overlay in recap interval:  3.5198231670074165
   4996 mutations overlaid over the recapitated portion
   fetching the mutation rate map for the SLiM period...
   overlaying mutations over the SLiM period...
   ### time for overlay in SLiM interval:  112.23789012490306
   10730034 mutations overlaid over the SLiM portion for a total of 10793650 mutations
   writing modified tree sequence...
   ### time for dump():  1.1096230839611962
```

So unless I have a typo in my timer code, `remove_vacant()` apparently takes 1.6 times longer than all of the other steps _combined_.  And in this case, it doesn't even do much of anything, right?  Chromosome 1 is an autosome, so there are in fact no vacant nodes.  Maybe it sets a whole bunch of flags on nodes; but for this (perhaps typical) workflow those flags aren't even needed, since we never call `restore_vacant()`.  If setting all the flags is the slow part, maybe it would be worthwhile to add a flag to `remove_vacant()` that says "no need to set the flags, I won't be using them".  (Maybe there would be a way to note that, and then have `restore_vacant()` throw an error if it later gets called, saying "Hey, you said you didn't need the flags!")

I will mitigate this for my present use case by calling `remove_vacant()` only for the X/Y/MT chromosomes, which should get rid of most of the problem; but it will remain extremely slow for those three chromosomes.  There must be a way to make it faster...?  I'll be submitting this paper at the end of this week, and it'd be great to have a better performance story to tell, but I imagine that is too tight a timeframe for you to want to roll a new pyslim.  This will bite everybody that uses `remove_vacant()`, though, I imagine (but perhaps won't be a huge deal for those not simulating whole genomes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove_vacant() is VERY slow #393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

remove_vacant() is VERY slow #393

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions