-
Notifications
You must be signed in to change notification settings - Fork 23
Description
I'm sorry to report that remove_vacant()
seems to be extremely slow – so much so that it really destroys the performance of a pipeline involving tree-sequence recording. I've got a simulation of the full human genome with the Gravel model, using tree-sequence recording to produce a trees archive which I then load into python for recapitation and neutral mutation overlay. You can download the resulting trees archive here. Here's a script that loads, recapitates, and overlays each chromosome's tree sequence; it's a bit messy because it has timer code all over it, sorry:
# model3_analysis.py
#
# By Ben Haller and Murillo F. Rodrigues, 24 August 2025
# Messer Lab, Cornell University
# note that this requires pyslim 1.1.0 or later to run!
from pathlib import Path
import tskit, msprime, pyslim, warnings, os
from timeit import default_timer as timer
import polars as pl
import numpy as np
# NOTE: the base repository path needs to be configured for your setup here!
repository_path = Path('/Users/bhaller/Documents/Research/MesserLab/SLiM_project/Publication 2025 HumanPopGen/SimHumanity')
# set the current working directory to the SimHumanity repository
os.chdir(repository_path)
in_path = repository_path / "simhumanity_trees"
out_path = repository_path / "simhumanity_trees_RO"
out_path.mkdir(parents=False, exist_ok=False)
MU_TOTAL = 2.0e-8
MU_BENEF = 1e-12
MU_DELET = 1.2e-8
def get_recomb_map(chrom, seq_len):
"""
Returns a recombination rate map (msprime.RateMap) for a given chromosome.
Note the relative path to stdpopsim extraction is hardcoded in the function.
"""
if chrom == 'MT':
positions = np.array([0, seq_len])
rates = np.array([0.0])
else:
rec_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_recombination.txt')
rec_df = pl.read_csv(rec_path, has_header=False, new_columns=["pos", "rate"])
positions = np.append(rec_df['pos'].to_numpy(), seq_len) # need to include the end of the sequence
rates = rec_df['rate'].to_numpy()
return msprime.RateMap(position=positions, rate=rates)
def get_mut_map(chrom, seq_len, mt_multiplier = 20):
"""
Returns a mutation rate map (msprime.RateMap) for a given chromosome.
For the MT chromosome, the mutation rate is constant and equal to MU_TOTAL multiplied by mt_multiplier.
Note the relative paths to stdpopsim extraction and the mutation rates are hardcoded in the function.
"""
if chrom == 'MT':
breakpoints = np.array([0, seq_len])
rates = np.array([MU_TOTAL*mt_multiplier])
else:
ge_path = Path(f'stdpopsim extraction/extracted/chr{chrom}_genomic_elements.txt')
ge_df = pl.read_csv(ge_path, has_header=False, new_columns=["type", "start", "end"])
ge_df = ge_df.with_columns(pl.when(pl.col("type") == 0).then(pl.lit(MU_TOTAL)).otherwise(pl.lit(MU_TOTAL-MU_BENEF-MU_DELET)).alias("neutral_rate")) # computing the neutral rate for each genomic element
breakpoints = np.append(ge_df["start"].to_numpy(), seq_len)
rates = ge_df["neutral_rate"].to_numpy()
return msprime.RateMap(position=breakpoints, rate=rates)
start_overall = timer()
total_load = 0
total_remove_vacant = 0
total_recap = 0
total_overlay_recap = 0
total_overlay_slim = 0
total_dump = 0
try:
# iterate over the .trees files in the trees archive, in alphabetical order
for file_path in sorted(in_path.iterdir(), key=lambda x:x.name):
if file_path.is_file() and file_path.suffix == '.trees':
print(f"Processing {file_path.name}...")
print(f" loading...")
# extract the chromosome number from the file path
chrom = str(file_path).split('.')[0].split('_')[-1]
start_load = timer()
ts = tskit.load(file_path)
time_load = timer() - start_load
total_load = total_load + time_load
print(" ### time for tskit.load(): ", time_load)
# remove vacant nodes from the sample, to avoid issues with "missing data" in stats computations
start_remove_vacant = timer()
ts = pyslim.remove_vacant(ts)
time_remove_vacant = timer() - start_remove_vacant
total_remove_vacant = total_remove_vacant + time_remove_vacant
print(" ### time for pyslim.remove_vacant(): ", time_remove_vacant)
print(f"Basic tree sequence information:")
print(ts)
# recapitate; note this issues a warning about large provenance, because it includes the rate map in the provenance
start_recap = timer()
print(f" fetching the recombination rate map...")
rec_map = get_recomb_map(chrom, ts.sequence_length)
print(f" recapitating...")
ts = pyslim.recapitate(ts, ancestral_Ne=7310, recombination_rate=rec_map, random_seed=1)
time_recap = timer() - start_recap
total_recap = total_recap + time_recap
print(" ### time for pyslim.recapitate(): ", time_recap)
# compute where the recapitation period starts (in time ago)
recap_start_time = ts.metadata["SLiM"]["tick"]-ts.metadata["SLiM"]["cycle"]
print(f" recapitation period starts at {recap_start_time} time ago")
num_slim_muts = ts.num_mutations
print(f" {num_slim_muts} SLiM mutations")
# overlay neutral mutations during the recapitation period
print(f" overlaying mutations over the recapitated portion...")
next_id = pyslim.next_slim_mutation_id(ts)
start_overlay_recap = timer()
ts = msprime.sim_mutations(ts, rate=MU_TOTAL, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, start_time=recap_start_time) # start_time is in time ago, so we need to add mutations with a constant rate across the chromosome starting at the time the recapitated period starts (and older)
time_overlay_recap = timer() - start_overlay_recap
total_overlay_recap = total_overlay_recap + time_overlay_recap
print(" ### time for overlay in recap interval: ", time_overlay_recap)
num_postrecap_muts = ts.num_mutations
print(f" {num_postrecap_muts-num_slim_muts} mutations overlaid over the recapitated portion")
# overlay neutral mutations during the SLiM period
start_overlay_slim = timer()
print(f" fetching the mutation rate map for the SLiM period...")
mut_map = get_mut_map(chrom, ts.sequence_length)
print(f" overlaying mutations over the SLiM period...")
next_id = pyslim.next_slim_mutation_id(ts)
ts = msprime.sim_mutations(ts, rate=mut_map, random_seed=1, model=msprime.SLiMMutationModel(type=0, next_id=next_id), keep=True, end_time=recap_start_time)
time_overlay_slim = timer() - start_overlay_slim
total_overlay_slim = total_overlay_slim + time_overlay_slim
print(" ### time for overlay in SLiM interval: ", time_overlay_slim)
num_total_muts = ts.num_mutations
print(f" {num_total_muts-num_slim_muts} mutations overlaid over the SLiM portion for a total of {num_total_muts} mutations")
# write out the modified tree sequence to simhumanity_trees_RO
print(f" writing modified tree sequence...")
start_dump = timer()
ts.dump(str(out_path / file_path.name))
time_dump = timer() - start_dump
total_dump = total_dump + time_dump
print(" ### time for dump(): ", time_dump)
except FileNotFoundError:
print(f"Error: Directory '{directory_path}' not found.")
except Exception as e:
print(f"An error occurred: {e}")
end_overall = timer()
print()
print("Elapsed time: ", end_overall - start_overall)
print("### total time in tskit.load(): ", total_load)
print("### total time in pyslim.remove_vacant(): ", total_remove_vacant)
print("### total time recapitating: ", total_recap)
print("### total time overlaying in the recapitation interval: ", total_overlay_recap)
print("### total time overlaying in the SLiM-simulated interval: ", total_overlay_slim)
print("### total time in dump(): ", total_dump)
Focusing on the output for chromosome 1 as an example, we get:
Processing chromosome_1.trees...
loading...
### time for tskit.load(): 1.5624354580650106
### time for pyslim.remove_vacant(): 389.9872399580199
Basic tree sequence information:
╔═════════════════════════════╗
║TreeSequence ║
╠═══════════════╤═════════════╣
║Trees │ 5,282,216║
╟───────────────┼─────────────╢
║Sequence Length│2.4895642e+08║
╟───────────────┼─────────────╢
║Time Units │ generations║
╟───────────────┼─────────────╢
║Sample Nodes │ 198,848║
╟───────────────┼─────────────╢
║Total Size │ 1.6 GiB║
╚═══════════════╧═════════════╝
╔═══════════╤══════════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪══════════╪═════════╪════════════╣
║Edges │23,110,403│705.3 MiB│ No║
╟───────────┼──────────┼─────────┼────────────╢
║Individuals│ 99,424│ 9.5 MiB│ Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼──────────┼─────────┼────────────╢
║Mutations │ 63,616│ 3.8 MiB│ Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Nodes │14,250,288│543.6 MiB│ Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Populations│ 4│ 2.7 KiB│ Yes║
╟───────────┼──────────┼─────────┼────────────╢
║Provenances│ 1│ 8.9 KiB│ No║
╟───────────┼──────────┼─────────┼────────────╢
║Sites │ 62,941│ 1.4 MiB│ No║
╚═══════════╧══════════╧═════════╧════════════╝
fetching the recombination rate map...
recapitating...
The provenance information for the resulting tree sequence is 8.38MB. This is nothing to worry about as provenance is a good thing to have, but if you want to save this memory/storage space you can disable provenance recording by setting record_provenance=False
### time for pyslim.recapitate(): 123.96915929205716
recapitation period starts at 65794 time ago
63616 SLiM mutations
overlaying mutations over the recapitated portion...
### time for overlay in recap interval: 3.5198231670074165
4996 mutations overlaid over the recapitated portion
fetching the mutation rate map for the SLiM period...
overlaying mutations over the SLiM period...
### time for overlay in SLiM interval: 112.23789012490306
10730034 mutations overlaid over the SLiM portion for a total of 10793650 mutations
writing modified tree sequence...
### time for dump(): 1.1096230839611962
So unless I have a typo in my timer code, remove_vacant()
apparently takes 1.6 times longer than all of the other steps combined. And in this case, it doesn't even do much of anything, right? Chromosome 1 is an autosome, so there are in fact no vacant nodes. Maybe it sets a whole bunch of flags on nodes; but for this (perhaps typical) workflow those flags aren't even needed, since we never call restore_vacant()
. If setting all the flags is the slow part, maybe it would be worthwhile to add a flag to remove_vacant()
that says "no need to set the flags, I won't be using them". (Maybe there would be a way to note that, and then have restore_vacant()
throw an error if it later gets called, saying "Hey, you said you didn't need the flags!")
I will mitigate this for my present use case by calling remove_vacant()
only for the X/Y/MT chromosomes, which should get rid of most of the problem; but it will remain extremely slow for those three chromosomes. There must be a way to make it faster...? I'll be submitting this paper at the end of this week, and it'd be great to have a better performance story to tell, but I imagine that is too tight a timeframe for you to want to roll a new pyslim. This will bite everybody that uses remove_vacant()
, though, I imagine (but perhaps won't be a huge deal for those not simulating whole genomes).