A Roadmap to Multi-Arm Bandit Algorithms

Recently, I’ve been reading Bandit Algorithms by Tor Lattimore and Csaba Szepesvári and discussing it in a reading group. I thought it might be nice to step back for a moment and summarize, at a very high-level some major questions one might ask when thinking about or characterizing a bandit problem!

If you aren’t familiar with bandit algorithms yet, no worries! I hope to do a nice write-up on them as my group progresses through the Bandit Algorithm text (it’s really good!). But for now, let me give a short-and-sweet summary:

Bandit algorithms are a type of learning algorithm that can perform in uncertain environments. Specifically, they get their name from the idea of a slot machine. It may have many arms and pulling an arm may give you a reward. The idea is that through balancing exploration and exploitation, a bandit algorithm may come to understand the underlying reward distributions for each arm, allowing for the learner to exploit and attain the best cumulative reward possible.

So, why are they called bandits? Because they take all your money (or rather, slot machines do lol).

A slot machine with a bandit algorithm

Will you trust this bandit algorithm?

As we’ll see from some of these questions I’m pulling out and highlighting, bandit algorithms provide a very general formulation for a lot of learning problems, so long as we retain certain key assumptions (such as choosing an action today doesn’t change the actions available tomorrow). For more complex learning problems, reinforcement learning becomes necessary!

The questions we’ll consider in this post include:

Questions to ask for bandit algorithms

Questions to ask for bandit algorithms

Without further ado, let’s jump in!

Action Space

What does your environment look like and what type of action space does your learner need in order to interact with the environment?

Infinite versus Finite

One of the first distinctions to make is the difference between infinite and finite action spaces. Finite action spaces are easy to understand. It’s akin to “select one of k arms” each time and is likely how most people first imagine a bandit algorithm interacting with their environment.

In contrast, an infinite action space is also possible. You might be thinking “but how?” because it seems almost an impossibility. But in fact, it is very possible if we have the space constrained. What I mean by that is imagine that you have only one action, but can select it with a range of values from zero to one. Now your space is infinite! Any number between zero and one is possible (perhaps it’s the amount of force we are applying to the arm). The key is that there is some constraint here that makes the problem possible, such as a Lipschitz Condition on the reward structure so that the optimal action cannot be hidden in an arbitrarily small space.

Single action versus combinatorial actions

When your bandit interacts with its environment, is it only allowed to pull one arm (play one action) or can it play multiple at the same time? This is the difference between single and combinatorial action spaces! If you have an environment in which your learner can pull multiple arms at the same time (play multiple actions at the same time), then you have a combinatorial action space.

Imagine, for a moment, that you have a graph and your learner wants to get from node S to node T. Each round, the learner has an opportunity to select the edges it will include in its path from S to T. Then, each edge in the entire graph will be dropped with a probability unique to that edge. If your learner’s path is fully intact, then it receives a reward as it has navigated from S to T. If not, it receives no reward. This sub-selection of edges is an example of a combinatorial action space!

In summary:

Bandit action space distinctions

Bandit action space distinctions

Problem Structure

Another essential question for approaching a bandit problem is whether your problem/environment has structure. What I mean by that can be succinctly phrased as so: Will selecting an action reveal any information about an action that you did not play? If yes, the problem has structure! Otherwise, you have an unstructured bandit problem.

An unstructured bandit problem is not hopeless (in fact, it seems pretty common). This just means that if I have a two-arm bandit, I cannot learn anything about arm two by playing arm one. I might know that they’re both generating rewards by Gaussian distributions, but that doesn’t give me any connection between the arms.

In contrast, we could have a two-arm bandit where both arms generate rewards with a Bernoulli distribution that have a single parameter. What I mean is say arm one is parameterized by $\theta$ and arm two is parameterized by $1-\theta$. There’s only one parameter to learn, underlying the whole problem, and by playing the first arm, I gain information about the other arm!

External Information

Is there any additional information that your learner can make use of? This can make a very big difference in the performance of your algorithm (typically, in a good way).

For example, if you have a context, your learner can use that to its advantage through contextual bandit algorithms. Say you have a two-arm bandit, but the arms perform differently on different days of the week (but the same on identical days of the week). If you attack this problem with a normal bandit algorithm, the rewards will appear highly non-stationary and will harm performance. However, if you use the context (the days of the week) in the action decision process then your learner should perform much better!

Again, imagine that your bandit algorithm is attempting to learn advertisement placement for a popular search engine. It could just select an ad from its action set and show it to a user, but that would probably perform poorly overall. Instead, if there was a vector representing some information about the user using the search engine, perhaps we could fold that user-context vector into our bandit algorithms learning processes!

Reward Mechanism

How does the bandit problem generate rewards? Are they stochastic and drawn from the same distribution? Or, perhaps, the environment is adversarial and rewards are selected to be the worst-case given knowledge about how a bandit algorithm is learning.


Under the stochastic reward generation setting, each action corresponds to an IID reward. That is, each action has an underlying distribution that it samples from when an action is selected. It never changes, therefore the learner simply needs to explore the arm until it can properly bound the shape of the reward generating distribution.


Technically, reward distributions can be non-stationary, but the learning algorithm will pay a cost for this fact. Given some sort of rule for how the rewards will be non-stationary, analysis can factor in how that will affect overall performance. This can be viewed as a relaxation of the stochastic setting, with a cost.


In the adversarial setting, all assumptions are thrown out. Rewards are selected adversarially, aiming to select the worst-case results that will throw off a learner. Fortunately, there is still much a learner can do in settings like this (randomization is key). I hope to talk about this in greater detail in another post!

To summarize:

Bandit reward mechanisms

Bandit reward mechanisms

Learner Feedback

A final question we’ll highlight here is the nature of feedback that a bandit learner receives. In order for a problem to be a bandit problem, the learner needs to receive a reward signal at the end of every round/time step. But how much feedback does the learner receive?

Bandit Feedback

Bandit feedback is a term that refers to only receiving reward information from the action that the learner selected. If arm one is played, only the reward from arm one is observed; the other arms, although they could’ve been played, were not so the learner gains no information about them or their rewards.

This is the traditional bandit setting and makes a lot of sense when the environment is stochastic, for example. In the combinatorial setting, it’s very interesting as a bandit feedback provides just one signal for the combination of sub-actions. That is, the learner does not immediately see what each sub-action provided, but rather the cumulative signal.

Full Feedback

On the other end of the spectrum, a learner can observe all of the action rewards each step. This is very useful in an adversarial setting and allows for the development of some interesting algorithms. In a stochastic setting, this wouldn’t make any sense because then a learner would never need to explore. They would simply pull the best arm per bounded observations, and then would update the bounds each time step. It would make the stochastic problem very trivial.

Partial Feedback / Semi-Bandit Feedback

Think of this feedback scenario as a middle-ground between bandit and full feedback. It makes the most sense in the combinatorial setting and refers to the situation in which the cumulative reward signal is observed, but so are the sub-component signals for each sub-action selected in the combinatorial space.

This can make a lot of sense if one is modeling the amount of time it takes to get to work. If I take a side street today and learn that it takes me about 5 minutes to do so, then I can apply that sub-information when I go to drive to work tomorrow.

Partial Monitoring

In the setting where feedback is not received every round, we are outside of bandit algorithms. However, if you’re interested, I’d encourage looking into partial monitoring.

Here’s a summary graphic of these settings:

Bandit feedback scenarios

Bandit feedback scenarios

Wrapping Up

So, that’s my current summary of some key questions to consider when approaching multi-arm bandits and bandit problems. How did I do? Did I miss any key questions? Let me know!