Experiment with alternative HMM implementation #258

jeromekelleher · 2024-09-04T11:08:24Z

Haven't tried this out at scale yet, but setting a threshold of mu**k to detect sequences with k mismatches seems much more intuitive to me than all the other fiddling around we had to do.

Requires tskit-dev/tsinfer#959

jeromekelleher · 2024-09-05T10:09:27Z

This seems to be working very well. Here's how we perform on a particular day:

ms-match (1/0)100%|████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:58, 23.1it/s]
ms-match (1/0)100%|████████████████████████████████████████████████████████████████████████| 454/454 [01:02, 7.25it/s]
ms-match (1/0)100%|████████████████████████████████████████████████████████████████████████| 143/143 [02:07, 1.12it/s]
ms-match (1/0)100%|██████████████████████████████████████████████████████████████████████| 82.0/82.0 [02:29, 1.82s/it]

On 2020-12-27 we match against a ts with nodes=231163;samples=209704;mutations=137794.

We run the first batch of 1339 matches with a likelihood threshold of 1 (i.e., we only consider two possible likelihoods, 0 and 1 essentially). Those run in 1 minute, and a rate of 23 matches per second. We find 885 matches with either 0 or 1 mismatches, which we accept as final.

We then run the second batch of the remaining 454 with a likelihood threshold of 0.125 (the arbitrarily chosen mu). This runs in one minute at a rate of 7 matches per second. We find 311 final matches with 1 or 2 mismatches.

We then run the third batch of the remaining 143 with a likelihood threshold of 0.01562 (mu**2). This runs in 2 minutes at a rate of 1.1 matches per second. We find 61 final matches found at with 2 or 3 mismatches.

We then run the final batch of 82 sequences at the highest possible precision (likelihood threshold of 1e-200). This runs in 2.5 minutes at a rate of 1.82 seconds per match.

The output is very similar to other recent inferences, although with substantially fewer recombinants (only 8 as of the end of 2020). I'm hoping this is because the final high-precision pass is actually finding 3 mutation paths that earlier versions working at a lower precision weren't.

Conclusions:

The general strategy of running the HMM multiple times at different likelihood thresholds is working really well and substantially speeds things up. If we were to do all the matches at full precision it would take about 40 minutes. We did this here in about 6.5 minutes which is a ~6X speed up.
The third-pass to pick up 2 and 3 mutation matches probably isn't worth it as we end up redoing most of the matches again anyway, and we can just do 0, 1 and final pass.
We can probably set a higher threshold for the final pass if we think about it a bit harder. However, some of the sequences we're interested in will have a very low likelihood indeed, and it is important that we get good matches for them.

Experiment with alternative HMM implementation

694812d

Requires tskit-dev/tsinfer#959

jeromekelleher mentioned this pull request Sep 4, 2024

Identify lowest normalised HMM likelihood value given MMR #242

Closed

Minor tweak

2e4f09a

jeromekelleher merged commit 23be916 into main Sep 5, 2024
3 checks passed

jeromekelleher deleted the dynamic-precision-experimental-hmm branch September 5, 2024 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with alternative HMM implementation #258

Experiment with alternative HMM implementation #258

jeromekelleher commented Sep 4, 2024

jeromekelleher commented Sep 5, 2024 •

edited

Loading

Experiment with alternative HMM implementation #258

Experiment with alternative HMM implementation #258

Conversation

jeromekelleher commented Sep 4, 2024

jeromekelleher commented Sep 5, 2024 • edited Loading

Conclusions:

jeromekelleher commented Sep 5, 2024 •

edited

Loading