Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more reliable algorithms for choosing which questions to post #4

Open
njsmith opened this issue Aug 13, 2019 · 2 comments
Open

more reliable algorithms for choosing which questions to post #4

njsmith opened this issue Aug 13, 2019 · 2 comments

Comments

@njsmith
Copy link
Collaborator

njsmith commented Aug 13, 2019

I don't know if this matters at all. I suspect probably not. But it was an interesting puzzle that I got nerd-sniped by, so I figured I'd right down what I thought of :-)

The bot's goal is to post questions without any duplicates or missing any, ie exactly-once delivery. As we know from distributed systems theory, reliable exactly-once delivery is ludicrously difficult or impossible, so we want to pick some "good enough" heuristic that isn't too complicated to implement. So the challenge is to optimize that reliability/simplicity tradeoff.

Right now the bot uses an extremely simple heuristic: it polls for questions every ten minutes, and then it posts any questions that have timestamps newer than ten minutes. I love this because if you think of it from a distributed systems perspective it's obviously too simple to work (schedulers aren't accurate! We're comparing clocks across two completely separate platforms!), but in fact it (probably) works very well and it's so simple that it's actually hard to beat.

It does at least theoretically have flaws though: successive ten minute windows are never going to line up exactly; there will always be a gap or some overlap. I guess probably a few seconds every ten minutes. If a question happens to be posted during that time, then it will be either double-posted or lost entirely. Can this be avoided, without massively complicating the implementation?

I did have one clever idea: when we start up, fetch the feed and store a list in-memory of which questions are already there. Then on future iterations, fetch the list, and post all the questions that aren't in our in-memory list, and add them to the list. That totally removes the dependency on accurate clocks, and is still super simple. It does have the downside that if a question is posted a few minutes before the bot is restarted, it might get lost entirely. I guess we could try to minimize that chance by using a hybrid of the two approaches: schedule our regular checks at a known absolute time (like: not just every ten minutes, but at 1:00, 1:10, 1:20, etc.), and then at startup (only) use the question timestamps and assume that anything since the last scheduled check time was probably not posted by our previous incarnation, so we should post it.

Or, the other obvious approach would be to keep a record of what questions have already been posted. Heroku makes it very easy to attach a database to a project, and at the scale we need it's free (less than 10k rows). This does add moderate complexity though, since we have to pick a postgres client (async or sync?), set up a simple schema, remind ourselves how to do SQL, and at least think about garbage collection. It's still pretty simple though, and TBH it might score better overall than my Very Clever solution. Sigh.

@njsmith
Copy link
Collaborator Author

njsmith commented Aug 15, 2019

@jtrakk made an interesting suggestion on gitter:

I wonder if the bot could just ask SO for this week's trio questions, and ask gitter for this week's posts by trio-bot, and post any missing ones?

It doesn't look like there's any way to list the bot's posts specifically, but one could do something like, iterate over messages until you find the last thing the bot said or have gone an hour back in time.

Still dubious that it's worth actually worth the effort to implement anything here, but it's a nice approach! Probably nicer than any of the other options mentioned above.

@Mariatta
Copy link
Member

Mariatta commented Oct 2, 2019

If we're open to commercial solution, we can use Zapier to detect the new item in RSS, which has pretty sophisticated way to detect which item is new or not (Note: I work at Zapier, so I have insider knowlege) https://zapier.com/apps/rss/integrations
So instead of maintaining our own database, writing code for detecting new item, etc, we use Zapier. We can even build our own gitter Zapier integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants