Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with alternative HMM implementation #258

Merged
merged 2 commits into from
Sep 5, 2024

Conversation

jeromekelleher
Copy link
Owner

Requires tskit-dev/tsinfer#959

Haven't tried this out at scale yet, but setting a threshold of mu**k to detect sequences with k mismatches seems much more intuitive to me than all the other fiddling around we had to do.

@jeromekelleher
Copy link
Owner Author

jeromekelleher commented Sep 5, 2024

This seems to be working very well. Here's how we perform on a particular day:

ms-match (1/0)100%|████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:58, 23.1it/s]
ms-match (1/0)100%|████████████████████████████████████████████████████████████████████████| 454/454 [01:02, 7.25it/s]
ms-match (1/0)100%|████████████████████████████████████████████████████████████████████████| 143/143 [02:07, 1.12it/s]
ms-match (1/0)100%|██████████████████████████████████████████████████████████████████████| 82.0/82.0 [02:29, 1.82s/it]

On 2020-12-27 we match against a ts with nodes=231163;samples=209704;mutations=137794.

We run the first batch of 1339 matches with a likelihood threshold of 1 (i.e., we only consider two possible likelihoods, 0 and 1 essentially). Those run in 1 minute, and a rate of 23 matches per second. We find 885 matches with either 0 or 1 mismatches, which we accept as final.

We then run the second batch of the remaining 454 with a likelihood threshold of 0.125 (the arbitrarily chosen mu). This runs in one minute at a rate of 7 matches per second. We find 311 final matches with 1 or 2 mismatches.

We then run the third batch of the remaining 143 with a likelihood threshold of 0.01562 (mu**2). This runs in 2 minutes at a rate of 1.1 matches per second. We find 61 final matches found at with 2 or 3 mismatches.

We then run the final batch of 82 sequences at the highest possible precision (likelihood threshold of 1e-200). This runs in 2.5 minutes at a rate of 1.82 seconds per match.

The output is very similar to other recent inferences, although with substantially fewer recombinants (only 8 as of the end of 2020). I'm hoping this is because the final high-precision pass is actually finding 3 mutation paths that earlier versions working at a lower precision weren't.

Conclusions:

  • The general strategy of running the HMM multiple times at different likelihood thresholds is working really well and substantially speeds things up. If we were to do all the matches at full precision it would take about 40 minutes. We did this here in about 6.5 minutes which is a ~6X speed up.
  • The third-pass to pick up 2 and 3 mutation matches probably isn't worth it as we end up redoing most of the matches again anyway, and we can just do 0, 1 and final pass.
  • We can probably set a higher threshold for the final pass if we think about it a bit harder. However, some of the sequences we're interested in will have a very low likelihood indeed, and it is important that we get good matches for them.

@jeromekelleher jeromekelleher merged commit 23be916 into main Sep 5, 2024
3 checks passed
@jeromekelleher jeromekelleher deleted the dynamic-precision-experimental-hmm branch September 5, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant