Skip to content

Recapitating from an incomplete pedigree #1865

@jeromekelleher

Description

@jeromekelleher

Consider the following code:

import msprime
import msprime.pedigrees

N = 5
tables = msprime.pedigrees.sim_pedigree(
    population_size=N, random_seed=1234, end_time=10
)

tables.sequence_length = 100
ped_ts = msprime.sim_ancestry(
    initial_state=tables,
    model="wf_ped",
    random_seed=12345,
)
final_ts = msprime.sim_ancestry(
    initial_state=ped_ts, population_size=N, model="dtwf", random_seed=123456
)

print(final_ts.draw_text())

The trees we get are:

41.00┊                              110        ┊   
     ┊                        ┏━━━━━━┻━━━━━━┓  ┊   
10.00┊                        8             9  ┊   
     ┊                        ┃             ┃  ┊   
9.00 ┊                        ┃             ┃  ┊   
     ┊                        ┃             ┃  ┊   
8.00 ┊                        ┃             ┃  ┊   
     ┊                        ┃             ┃  ┊   
7.00 ┊                       37             ┃  ┊   
     ┊                ┏━━━━━━━┻━━━━━━━┓     ┃  ┊   
6.00 ┊                ┃               ┃     ┃  ┊   
     ┊                ┃               ┃     ┃  ┊   
5.00 ┊                ┃              50     ┃  ┊   
     ┊                ┃             ┏━┻━┓   ┃  ┊   
4.00 ┊                ┃             ┃   ┃   ┃  ┊   
     ┊                ┃             ┃   ┃   ┃  ┊   
3.00 ┊                ┃             ┃   ┃   ┃  ┊   
     ┊                ┃             ┃   ┃   ┃  ┊   
2.00 ┊               83             ┃   ┃   ┃  ┊   
     ┊      ┏━━━━━━━┳━┻━━━┳━━━━━┓   ┃   ┃   ┃  ┊   
1.00 ┊     93       ┃    98     ┃   ┃   ┃   ┃  ┊   
     ┊  ┏━━━╋━━━┓   ┃   ┏━┻━┓   ┃   ┃   ┃   ┃  ┊   
0.00 ┊ 100 107 108 101 102 104 109 105 106 103 ┊   
   0.00                                     100.00 

This tree is a valid sample from a Wright-Fisher process with N=5 because we've simulated the pedigree under this model, and then completed the simulation under the same model. The pedigree finished at time 10, and we had nodes 8 and 9. These were then picked up as the samples in a DTWF simulation starting from this point, and all was good.

However, as @hyanwong and @petrelharp pointed out, there's a problem when we have an incompete pedigree. Suppose we break 10% of the links in the pedigree, like this:

N = 5
tables = msprime.pedigrees.sim_pedigree(
    population_size=N, random_seed=1234, end_time=10
)

parents = tables.individuals.parents
parents[0::10] = -1
tables.individuals.parents = parents

tables.sequence_length = 100
ped_ts = msprime.sim_ancestry(
    initial_state=tables,
    model="wf_ped",
    random_seed=12345,
)
print(ped_ts.draw_text())

the tree sequence we get out of the pedigree looks like this:

10.00┊          9                              ┊   
     ┊          ┃                              ┊   
9.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
8.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
7.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
6.00 ┊          ┃                              ┊   
     ┊          ┃                              ┊   
5.00 ┊          ┃                 50           ┊   
     ┊          ┃            ┏━━━━━┻━━━━┓      ┊   
4.00 ┊          ┃           67          ┃      ┊   
     ┊          ┃       ┏━━━━┻━━━┓      ┃      ┊   
3.00 ┊          ┃       ┃        ┃      ┃      ┊   
     ┊          ┃       ┃        ┃      ┃      ┊   
2.00 ┊         83       ┃       85      ┃      ┊   
     ┊      ┏━━━╋━━━┓   ┃     ┏━━┻━━┓   ┃      ┊   
1.00 ┊      ┃   ┃   ┃   ┃    92     ┃   ┃  90  ┊   
     ┊      ┃   ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┃  ┊   
0.00 ┊ 100 101 104 108 102 105 107 106 109 103 ┊   
   0.00                                     100.00 

So, we have nodes 100, 9, and 90 popping out of the pedigree at different times. Then, if we recapitate as before we get

31.00┊                      112                ┊   
     ┊             ┏━━━━━━━━━┻━━━━━━━━━━┓      ┊   
15.00┊            111                   ┃      ┊   
     ┊    ┏━━━━━━━━┻━━━━━━━┓            ┃      ┊   
14.00┊   110               ┃            ┃      ┊   
     ┊  ┏━┻━┓              ┃            ┃      ┊   
10.00┊  ┃   ┃              ┃            9      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
9.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
8.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
7.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
6.00 ┊  ┃   ┃              ┃            ┃      ┊   
     ┊  ┃   ┃              ┃            ┃      ┊   
5.00 ┊  ┃   ┃             50            ┃      ┊   
     ┊  ┃   ┃        ┏━━━━━┻━━━━┓       ┃      ┊   
4.00 ┊  ┃   ┃       67          ┃       ┃      ┊   
     ┊  ┃   ┃   ┏━━━━┻━━━┓      ┃       ┃      ┊   
3.00 ┊  ┃   ┃   ┃        ┃      ┃       ┃      ┊   
     ┊  ┃   ┃   ┃        ┃      ┃       ┃      ┊   
2.00 ┊  ┃   ┃   ┃       85      ┃      83      ┊   
     ┊  ┃   ┃   ┃     ┏━━┻━━┓   ┃   ┏━━━╋━━━┓  ┊   
1.00 ┊  ┃  90   ┃    92     ┃   ┃   ┃   ┃   ┃  ┊   
     ┊  ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┃   ┃   ┃  ┊   
0.00 ┊ 100 103 102 105 107 106 109 101 104 108 ┊   
   0.00                                     100.00 

We have picked up the three pedigree founders and simulated their history out to coalescence, as required. However, this is not a valid Wright-Fisher sample.

When we start the recapitation simulation, msprime sees that there is an sample at node 90 in generation 1, so it sets things up accordingly and starts simulating a DTWF where 90 is a sample and there is a diploid population of 5 individuals which it can descend from. However, this is not true, because the size of the population it's in is effectively smaller due to the existence of the pedigree individuals.

It seems like we should be able to correct for this by reasoning about the number of lineages that are in the trees we're recapitating from at each time. This would be tricky, probably, so it would be good to get people's thoughts on what's happening here.

Pinging @sgravel @LukeAndersonTrocme @apragsdale for thoughts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions