Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-armed bandit sampler #155

Merged
merged 6 commits into from
Oct 8, 2024

Conversation

ryota717
Copy link
Contributor

Contributor Agreements

Please read the contributor agreements and if you agree, please click the checkbox below.

  • I agree to the contributor agreements.

Tip

Please follow the Quick TODO list to smoothly merge your PR.

Motivation

#113

Description of the changes

Adding multi-armed bandit sampler in #113.

TODO List towards PR Merge

Please remove this section if this PR is not an addition of a new package.
Otherwise, please check the following TODO list:

  • Copy ./template/ to create your package
  • Replace <COPYRIGHT HOLDER> in LICENSE of your package with your name
  • Fill out README.md in your package
  • Add import statements of your function or class names to be used in __init__.py
  • Apply the formatter based on the tips in README.md
  • Check whether your module works as intended based on the tips in README.md

Please Check Here

Please tell me if other options(like annealing epsilon, need _n_startup_trials like TPESampler,...) are necessary.

@ryota717 ryota717 force-pushed the 113-add-bandit-sampler branch 3 times, most recently from b728e4e to 568c8c6 Compare September 25, 2024 21:25
states = (TrialState.COMPLETE, TrialState.PRUNED)
trials = study._get_trials(deepcopy=False, states=states, use_cache=True)

rewards_by_choice: defaultdict = defaultdict(float)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QUESTION] This defaultdict treats never choiced arm having 0 rewards. Should i replace any other idea?
(as far as I can think of using _n_startup_trials like TPESampler or letting user set default reward instead of 0)

Copy link
Contributor

@nabenabe0928 nabenabe0928 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point actually:)
I should have given the pseudocode of the $\epsilon$-greedy algorithm, but it usually works as follows:

  1. The control parameter of the algorithm is $\epsilon$, i.e. the probability of random sampling, n_trials, which we define as $T$ hereafter, and the number of choices $K$.
  2. Try every single arm $\epsilon T / K$ times.
  3. Choose the optimal arm (up to $\epsilon T / K$ or up to the latest trial) for each dimension.

So usually, we start from the random initialization.
However, we do not have to strictly stick to this algorithm, meaning that it is totally acceptable to not follow the classic algorithm implementation.
Instead, we can do it in the UCB policy fashion where we try each arm once at the initialization.
In this way, your issue will be resolved and we can still retain most of your implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion!

Instead, we can do it in the UCB policy fashion where we try each arm once at the initialization.

This looks me to good and changed initialization in 371556f
(random initialization seems difficult for Optuna because of its high objective flexibility 🙏)

@y0z y0z added new-package New packages contribution-welcome Contribution welcome issues labels Sep 26, 2024
@y0z
Copy link
Member

y0z commented Sep 26, 2024

@nabenabe0928

Could you review this PR? (cf. #113 (comment))

@@ -0,0 +1,20 @@
import optuna
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that the example works!

package/samplers/multi_armed_bandit/multi_armed_bandit.py Outdated Show resolved Hide resolved
states = (TrialState.COMPLETE, TrialState.PRUNED)
trials = study._get_trials(deepcopy=False, states=states, use_cache=True)

rewards_by_choice: defaultdict = defaultdict(float)
Copy link
Contributor

@nabenabe0928 nabenabe0928 Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point actually:)
I should have given the pseudocode of the $\epsilon$-greedy algorithm, but it usually works as follows:

  1. The control parameter of the algorithm is $\epsilon$, i.e. the probability of random sampling, n_trials, which we define as $T$ hereafter, and the number of choices $K$.
  2. Try every single arm $\epsilon T / K$ times.
  3. Choose the optimal arm (up to $\epsilon T / K$ or up to the latest trial) for each dimension.

So usually, we start from the random initialization.
However, we do not have to strictly stick to this algorithm, meaning that it is totally acceptable to not follow the classic algorithm implementation.
Instead, we can do it in the UCB policy fashion where we try each arm once at the initialization.
In this way, your issue will be resolved and we can still retain most of your implementation.

Copy link
Contributor Author

@ryota717 ryota717 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabenabe0928 Thank you for suggestion and comment! Colud you check my revisions, please?

states = (TrialState.COMPLETE, TrialState.PRUNED)
trials = study._get_trials(deepcopy=False, states=states, use_cache=True)

rewards_by_choice: defaultdict = defaultdict(float)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion!

Instead, we can do it in the UCB policy fashion where we try each arm once at the initialization.

This looks me to good and changed initialization in 371556f
(random initialization seems difficult for Optuna because of its high objective flexibility 🙏)

@nabenabe0928
Copy link
Contributor

@ryota717

Hi, thank you for the prompt action!
I will look into the changes asap:)
But probably, it is better to rename the directory name to something like mab_epsilon_greedy?

@ryota717
Copy link
Contributor Author

ryota717 commented Oct 2, 2024

@nabenabe0928 Renamed modules and directory in 0bc6cb7.

@nabenabe0928
Copy link
Contributor

@ryota717
Thank you so much! I will check the rest asap 🙇

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the modification and sorry for the late response:(
I added some comments, but you can choose whether you take the suggestions or not!
Feel free to tell me your opinion and then we can promptly merge this PR!

@@ -0,0 +1,25 @@
---
author: Ryota Nishijima
title: MAB Epsilon-Greedy Sampler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
title: MAB Epsilon-Greedy Sampler
title: A Sampler Based on Epsilon-Greedy Multi-Armed Bandit Algorithm

if study.direction == StudyDirection.MINIMIZE:
return min(
param_distribution.choices,
key=lambda x: rewards_by_choice[x] / max(cnt_by_choice[x], 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]
Now, thanks to the last modification, we do not have to use min/max operator here!

Suggested change
key=lambda x: rewards_by_choice[x] / max(cnt_by_choice[x], 1),
key=lambda x: rewards_by_choice[x] / cnt_by_choice[x],

else:
return max(
param_distribution.choices,
key=lambda x: rewards_by_choice[x] / max(cnt_by_choice[x], 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]
Same here:)

Suggested change
key=lambda x: rewards_by_choice[x] / max(cnt_by_choice[x], 1),
key=lambda x: rewards_by_choice[x] / cnt_by_choice[x],

Copy link
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this PR once, so if you would like to add my suggestions, please publish another PR!

@nabenabe0928 nabenabe0928 merged commit 9e353ea into optuna:main Oct 8, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution-welcome Contribution welcome issues new-package New packages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants