Allow parallel testing, take #2 #181

Babar · 2021-12-09T02:35:57Z

Basically a re-do of #158.
This got rolled back because @Babar wrote 3 different PRs and one of them was bad.
This one wasn't. Technically it wasn't good either, as it only did test, not really untest, run or keeptesting.

This re-do fixes this.

Still the same features:

Can parametrize the number of threads in the config, or on the command line, default to the number of cores
Reports hosts depending whether they failed ssh or something else
Doesn't have any extra dependency, as the thread pool is done in a very simple way

Test Plan:
Ran it on my machine. Tried test and untest, both worked.

Signed-off-by: Olivier Raginel [email protected]

jaymzh · 2021-12-09T03:10:49Z

IIRC, this broke internal wrappers... so CC @NaomiReeves on that.

But you definitely should make sure you've tested this with internal wrappers.

Babar · 2021-12-09T05:22:52Z

The other PR broke the wrappers, from what I remember. The part where I moved the ARGS to be hostnames, 'cause some parts of our wrappers rely on TT just dropping unknown parameters.
I've fixed the wrapper too, but this is mostly to get speed improvements when testing a bunch of hosts.
Also, I've fixed the lint, so updating the PR shortly.

Babar · 2021-12-09T08:04:15Z

Oh one thing I'd love feedback on:

How do people think we should deal with failures?
How do we deal with logs?
In test, we report the failures and exit with:
0 if no failure
1 if all hosts failed and at least one failure was due to ssh connection issue
2 otherwise (meaning at least one host failed for whatever reason and at least one succeeded)
3 if all hosts failed because they were already tested by another user
I kept this logic for the other modes (so untest, run and keeptesting), even though 3 cannot happen.

As for logging, run can get very noisy with all the threads writing at the same time, and not knowing what's what. So I started writing something to prefix the logs with the hostname. But then I went down the rabbit hole, so I'll try to finish this some other night.

gogsbread · 2021-12-09T19:07:28Z

lib/taste_tester/commands.rb

+        # Poor man thread pool manager: keeping it simple
+        nb_threads_over_max = host_threads.length - TasteTester::Config.parallel_hosts
+        if nb_threads_over_max >= 0
+          host_threads[nb_threads_over_max].join


shouldn't this join be rescued as well?

Good point. I guess this never happened in my testing 'cause I have a gazillion of cores :)

jaymzh · 2021-12-09T19:43:45Z

Oh one thing I'd love feedback on:

How do people think we should deal with failures?

How do we deal with logs?
In test, we report the failures and exit with:

0 if no failure

1 if all hosts failed and at least one failure was due to ssh connection issue

2 otherwise (meaning at least one host failed for whatever reason and at least one succeeded)

3 if all hosts failed because they were already tested by another user
I kept this logic for the other modes (so untest, run and keeptesting), even though 3 cannot happen.

As for logging, run can get very noisy with all the threads writing at the same time, and not knowing what's what. So I started writing something to prefix the logs with the hostname. But then I went down the rabbit hole, so I'll try to finish this some other night.

I'm glad you asked!

A lot of this is actually the kind of stuff that we (we = TM folks, not FB folks, so past-past-past-life) solved over in https://github.com/txcketmaster-xx/onall - it forks a bunch of ssh's, joins 'em all back, and handles output nicely. It has a few options - live, buffered, prefixed-by-hostname, in-a-directory one-file-per-host, etc. and I think it's a great model for how to manage output and make it usable for folks.

As for return values: 0 - no errors, we can all agree on that. But maybe some more nuance for errors:

All failed, all SSH failures (no test-setup/run failures)
All failed, all because in-testing
All failed, other test-setup/run failures
All failed, mix of issues
Some failed, all SSH related
Some failed, all because in-testing
Some failed, other test-setup/run failures
Some failed, mix of issues

Further, I might recommend outputing a JSON structure of host=>failure, perhaps on stderr, or perhaps to a configurable file.

Babar · 2021-12-09T23:56:29Z

Stupid question, but what's the difference between onall and parallel? They're bot written in perl, and seem to be doing pretty much the same thing. I mean, parallel --tag ssh {} .... Haven't read the entire code yet, but I know parallel's code fairly well, which is why I'm asking.

@Babar

Summary: Basically a re-do of facebook#158. This got rolled back because @Babar wrote 3 different PRs and one of them was bad. This one wasn't. Technically it wasn't good either, as it only did `test`, not really `untest`, `run` or `keeptesting`. This re-do fixes this. Still the same features: * Can parametrize the number of threads in the config, or on the command line, default to the number of cores * Reports hosts depending whether they failed ssh or something else * Doesn't have any extra dependency, as the thread pool is done in a very simple way Test Plan: Ran it on my machine. Tried test and untest, both worked. Reviewers: nreeves, dcavalca, jaymzh Subscribers: Tasks: Tags: Signed-off-by: Olivier Raginel <[email protected]>

Summary: As upload can take a very long time, we can just have a dedicated thread for it

…n we run out of threads in our poor man thread pool implementation

…while to finish

facebook-github-bot added the CLA Signed label Dec 9, 2021

gogsbread reviewed Dec 9, 2021

View reviewed changes

Babar force-pushed the parallel-testing-v2 branch from 8652279 to 7b35279 Compare April 21, 2024 07:32

Babar added 4 commits April 21, 2024 00:36

Fix the lint and remove debug messages

bf6c476

Prefix runchef IO stream with the hostname

0647233

Parallelize cookbook upload too

76ef58a

Summary: As upload can take a very long time, we can just have a dedicated thread for it

Babar force-pushed the parallel-testing-v2 branch from ed550cf to 76ef58a Compare April 21, 2024 07:36

Olivier Raginel added 3 commits April 21, 2024 00:59

Fix all the lints

8258e7b

Handle the exceptions better, and also handle the threads we join whe…

acbceed

…n we run out of threads in our poor man thread pool implementation

Enhance reporting when things go wrong, including when hosts takes a …

9f5a2fe

…while to finish

Babar requested a review from gogsbread May 24, 2024 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parallel testing, take #2 #181

Allow parallel testing, take #2 #181

Babar commented Dec 9, 2021

jaymzh commented Dec 9, 2021

Babar commented Dec 9, 2021

Babar commented Dec 9, 2021

gogsbread Dec 9, 2021

Babar Dec 9, 2021

jaymzh commented Dec 9, 2021

Babar commented Dec 9, 2021

Allow parallel testing, take #2 #181

Are you sure you want to change the base?

Allow parallel testing, take #2 #181

Conversation

Babar commented Dec 9, 2021

jaymzh commented Dec 9, 2021

Babar commented Dec 9, 2021

Babar commented Dec 9, 2021

gogsbread Dec 9, 2021

Choose a reason for hiding this comment

Babar Dec 9, 2021

Choose a reason for hiding this comment

jaymzh commented Dec 9, 2021

Babar commented Dec 9, 2021