# A Roadmap to Multi-Arm Bandit Algorithms
Recently, I've been reading
[Bandit Algorithms by Tor Lattimore and Csaba Szepesvári](https://tor-lattimore.com/downloads/book/book.pdf)
and discussing it in a reading group.
I thought it might be nice to step back for a moment
and summarize, at a very high-level
some of the major questions one might ask
when thinking about or characterizing a bandit problem!
If you aren't familiar with bandit algorithms yet,
no worries!
I hope to do a nice write-up on them as my group progresses through
the Bandit Algorithm text (it's really good!).
But for now, let me give a short-and-sweet summary:
Bandit algorithms are a type of learning algorithm that can
perform in uncertain environments.
Specifically, they get their name from the idea
of a slot machine.
It may have many arms
and pulling an arm may give you a reward.
The idea is that through balancing exploration and exploitation,
a bandit algorithm may come to understand the underlying
reward distributions for each arm,
allowing for the learner to exploit
and attain the best cumulative reward possible.
So, why are they called bandits?
Because they take all your money (or rather, slot machines do lol).
![Will you trust this bandit algorithm?](/images/bandit_win.png)
As we'll see from some of these questions I'm pulling out and highlighting,
bandit algorithms provide a very general formulation
for a lot of learning problems,
so long as we retain certain key assumptions
(such as choosing an action today doesn't change the actions available tomorrow).
For more complex learning problems,
reinforcement learning becomes necessary!
Some of the questions we'll consider in this post include:
![Questions to ask for bandit algorithms](/images/bandit_questions.png)
Without further ado, let's jump in!
## Action Space
What does your environment look like
and what type of action space does your learner need
in order to interact with the environment?
### Infinite versus Finite
One of the first distinctions to make is the difference between
infinite and finite action spaces.
Finite action spaces are easy to understand.
It's akin to "select one of k arms" each time
and is likely how most people first imagine a bandit algorithm
interacting with their environment.
In contrast, an infinite action space is also possible.
You might be thinking "but how?" because it seems almost an impossibility.
But in fact, it is very possible if we have the space constrained.
What I mean by that is imagine that you have only one action,
but can select it with a range of values from zero to one.
Now your space is infinite! Any number between zero and one is possible
(perhaps it's the amount of force we are applying to the arm).
The key is that there is some constraint here that makes the problem possible,
such as a [Lipschitz Condition](https://en.wikipedia.org/wiki/Lipschitz_continuity)
on the reward structure
so that the optimal action cannot be hidden in an arbitrarily small space.
### Single action versus combinatorial actions
When your bandit interacts with its environment,
is it only allowed to pull one arm (play one action)
or can it play multiple at the same time?
This is the difference between single and combinatorial action spaces!
If you have an environment in which your learner can pull
multiple arms at the same time (play multiple actions at the same time),
then you have a combinatorial action space.
Imagine, for a moment, that you have a graph and your learner wants
to get from node S to node T.
Each round, the learner has an opportunity to select
the edges it will include in its path from S to T.
Then, each edge in the entire graph will be dropped with a probability
unique to that edge.
If your learner's path is fully intact, then it receives a reward
as it has navigated from S to T.
If not, it receives no reward.
This sub-selection of edges is an example of a combinatorial action space!
In summary:
![Bandit action space distinctions](/images/bandit_actions.png)
## Problem Structure
Another essential question for approaching a bandit problem
is whether your problem/environment has structure.
What I mean by that can be succinctly phrased as so:
Will selecting an action reveal any information about
an action that you did not play?
If yes, the problem has structure!
Otherwise, you have an unstructured bandit problem.
An unstructured bandit problem is not hopeless
(in fact, it seems pretty common).
This just means that if I have a two-arm bandit,
I cannot learn anything about arm two
by playing arm one.
I might know that they're both generating rewards by Gaussian distributions,
but that doesn't give me any connection **between** the arms.
In contrast, we could have a two-arm bandit where both arms generate
rewards with a Bernoulli distribution that have a single parameter.
What I mean is say arm one is parameterized by $\theta$
and arm two is parameterized by $1-\theta$.
There's only one parameter to learn, underlying the whole problem,
and by playing the first arm, I gain information about the other arm!
## External Information
Is there any additional information that your learner can make use of?
This can make a very big difference in the performance
of your algorithm (typically, in a good way).
For example, if you have a context, your learner can use that to its advantage
through contextual bandit algorithms.
Say you have a two-arm bandit, but the arms perform differently
on different days of the week
(but the same on identical days of the week).
If you attack this problem with a normal bandit algorithm,
the rewards will appear highly non-stationary and will harm performance.
However, if you use the context (the days of the week)
in the action decision process
then your learner should perform much better!
Again, imagine that your bandit algorithm
is attempting to learn advertisement placement
for a popular search engine.
It could just select an ad from its action set
and show it to a user,
but that would probably perform poorly overall.
Instead, if there was a vector representing some information about
the user using the search engine,
perhaps we could fold that user-context vector
into our bandit algorithms learning processes!
## Reward Mechanism
How does the bandit problem generate rewards?
Are they stochastic and drawn from the same distribution?
Or, perhaps, the environment is adversarial
and rewards are selected to be the worst-case
given knowledge about how a bandit algorithm is learning.
### Stochastic
Under the stochastic reward generation setting,
each action corresponds to an IID reward.
That is, each action has an underlying distribution
that it samples from when an action is selected.
It never changes, therefore the learner simply needs
to explore the arm until it can properly bound
the shape of the reward generating distribution.
### Non-Stationary
Technically, reward distributions can be non-stationary,
but the learning algorithm will pay a cost for this fact.
Given some sort of rule for **how** the rewards will be non-stationary,
analysis can factor in how that will affect overall performance.
This can be viewed as a relaxation of the stochastic setting,
with a cost.
### Adversarial
In the adversarial setting,
all assumptions are thrown out.
Rewards are selected adversarially,
aiming to select the worst-case results
that will throw off a learner.
Fortunately, there is still much a learner can do in settings like this
(randomization is key).
I hope to talk about this in greater detail in another post!
To summarize:
![Bandit reward Mechanisms](/images/bandit_rewards.png)
## Learner Feedback
A final question we'll highlight here is the nature of feedback that
a bandit learner receives.
In order for a problem to be a bandit problem,
the learner needs to receive a reward signal at the end of every
round/time step.
But how much feedback does the learner receive?
### Bandit Feedback
Bandit feedback is a term that refers to only receiving reward information
from the action that the learner selected.
If arm one is played, only the reward from arm one is observed;
the other arms, although they could've been played, were not
so the learner gains no information about them or their rewards.
This is the traditional bandit setting and makes a lot of sense
when the environment is stochastic, for example.
In the combinatorial setting, it's very interesting
as a bandit feedback provides just one signal
for the combination of sub-actions.
That is, the learner does not immediately see what each sub-action provided,
but rather the cumulative signal.
### Full Feedback
On the other end of the spectrum,
a learner can observe all of the action rewards each step.
This is very useful in an adversarial setting
and allows for the development of some interesting algorithms.
In a stochastic setting, this wouldn't make any sense
because then a learner would never need to explore.
They would simply pull the best arm per bounded observations,
and then would update the bounds each time step.
It would make the stochastic problem very trivial.
### Partial Feedback / Semi-Bandit Feedback
Think of this feedback scenario as a middle-ground between
bandit and full feedback.
It makes the most sense in the combinatorial setting
and refers to the situation in which the cumulative reward signal is observed,
but so are the sub-component signals for each sub-action
selected in the combinatorial space.
This can make a lot of sense if one is modeling the amount of time it takes
to get to work.
If I take a side street today and learn that it takes me about 5 minutes
to do so, then I can apply that sub-information
when I go to drive to work tomorrow.
### Partial Monitoring
In the setting where feedback is not received every round,
we are outside of bandit algorithms.
However, if you're interested,
I'd encourage looking into [partial monitoring](https://arxiv.org/pdf/1902.00470.pdf).
Here's a summary graphic of these settings:
![Bandit feedback scenarios](/images/bandit_feedback.png)
# Wrapping Up
So, that's my current summary of some key questions to consider when
approaching multi-arm bandits and bandit problems.
How did I do?
Did I miss any key questions?
Let me know!
And if you found this post helpful / insightful,
maybe consider [buying me a coffee ☕️](https://www.buymeacoffee.com/hunterheiden)!
It helps to keep me choogling along and writing more content.