Multi-arm bandit algorithms tackle a fundamental challenge in machine learning: how to make good decisions when you don’t know which option is best. While reading Bandit Algorithms by Tor Lattimore and Csaba Szepesvári, I found it helpful to step back and identify the key questions that characterize any bandit problem.
Note: While originally written in 2020, bandit algorithms have become increasingly relevant with the rise of large language models, personalized AI systems, and real-time optimization challenges.
If you’re new to bandit algorithms, here’s the core idea: imagine a slot machine with multiple arms. Each arm has an unknown probability of paying out. You want to maximize your total winnings, but you don’t know which arms are best. This creates the exploration-exploitation dilemma—do you pull the arm that seems best so far (exploit) or try a different arm to gather more information (explore)?
Bandit algorithms solve this by learning the reward distributions for each arm over time, balancing the need to explore new options with exploiting what they’ve learned.

Multi-arm bandit algorithms balance exploration and exploitation
Bandit algorithms provide a general framework for many learning problems, as long as today’s action doesn’t affect tomorrow’s available options. For more complex scenarios where actions have lasting consequences, reinforcement learning becomes necessary.
The questions we’ll consider in this post include:

Questions to ask for bandit algorithms
Quick Navigation
This post covers five key dimensions for characterizing bandit problems:
- Action Space - Finite vs. infinite, single vs. combinatorial
- Problem Structure - Whether actions provide information about other actions
- External Information - Availability of contextual information
- Reward Mechanism - Stochastic, non-stationary, or adversarial
- Learner Feedback - Bandit, full, partial, or semi-bandit feedback
Jump to Real-World Applications to see these dimensions in practice.
Mathematical Notation Guide
Throughout this post, we use the following notation:
- $k$ = number of arms/actions
- $\theta$ = parameter of reward distribution
- $t$ = time step or round
- $X_t$ = reward received at time $t$
- $\mathbb{E}[X]$ = expected value of random variable $X$
Action Space
The action space defines what your algorithm can do and how it interacts with the environment.
Finite versus Infinite
The first distinction is between finite and infinite action spaces. Finite action spaces are straightforward—select one of $k$ discrete options each time. This matches how most people initially think about bandit problems.
Real-world example: A/B testing for website design where you have 5 different homepage layouts to test. Each “arm” represents a different layout, and you must choose which one to show to each visitor.
Infinite action spaces are possible when actions have continuous parameters. Consider having one action that can be selected with any value between zero and one—suddenly your space is infinite. The challenge is making the problem tractable, typically through constraints like Lipschitz continuity, which ensures that similar actions yield similar rewards.
Real-world example: Bidding in online advertising auctions where you can bid any amount between 0.01 and 10.00. The bid amount is continuous, but performance changes smoothly with bid price.
Single action versus combinatorial actions
The second distinction is whether your algorithm selects one action or multiple actions simultaneously. Combinatorial action spaces allow selecting multiple arms at once, which introduces interesting challenges.
Real-world example: Netflix recommendation system selecting which 10 movies to display on your homepage from thousands of options. The algorithm must choose a combination of movies that work well together to maximize engagement.
Consider a graph where your algorithm wants to navigate from node S to node T. Each round, it selects edges to include in its path. Then, each edge in the graph is randomly dropped with edge-specific probabilities. The algorithm receives a reward only if its selected path remains intact. This edge selection is a combinatorial action space.
Summary
Action spaces can be characterized along two key dimensions:
- Size: Finite (select from $k$ discrete options) vs. infinite (continuous parameter selection)
- Complexity: Single action vs. combinatorial action selection
Understanding your action space is crucial as it determines which algorithmic approaches are feasible and what theoretical guarantees you can expect.

Bandit action space distinctions
Problem Structure
Problem structure determines whether selecting one action provides information about other actions. The key question is: does playing arm A tell me anything useful about arm B?
An unstructured bandit problem means actions are independent. Playing arm one reveals nothing about arm two, even if both follow similar distributions. Each arm must be learned separately.
Real-world example: Testing two completely different marketing strategies (email vs. social media campaigns) for different customer segments. Email campaign performance provides no information about social media effectiveness.
Structured problems share information between arms. Consider a two-arm bandit where arm one has parameter $\theta$ and arm two has parameter $1-\theta$. There’s only one underlying parameter, so playing arm one immediately tells you about arm two’s performance.
Real-world example: Testing different discount percentages (10% vs. 15% off). If you learn that 10% discounts perform well, this gives you information about whether 15% discounts might be worth the additional cost, since customer price sensitivity follows patterns.
Summary
Problem structure refers to whether actions provide information about other actions:
- Structured: Actions reveal information about unplayed actions (shared parameters, smoothness assumptions)
- Unstructured: Each action provides information only about itself
Structured problems often enable more sample-efficient learning through information sharing across actions.
External Information
Additional information can significantly improve algorithm performance. This context helps the algorithm make better decisions by understanding when and why certain actions work better.
Contextual Information
Contextual bandit algorithms leverage environmental information that affects reward distributions. For example, if your two-arm bandit performs differently on weekdays versus weekends, a standard algorithm will struggle with the apparent non-stationarity. However, using day-of-week as context allows the algorithm to learn these patterns explicitly.
Real-world example: An e-commerce site testing different homepage layouts. Performance might vary dramatically based on whether it’s a weekday (people browsing quickly during lunch) vs. weekend (people have more time to explore). Using day-of-week as context helps the algorithm learn these patterns.
User Features
User characteristics provide valuable context for personalization. Rather than treating all users identically, algorithms can incorporate user feature vectors to make personalized decisions. This is particularly useful in recommendation systems and targeted advertising.
Real-world example: Spotify’s music recommendations use contextual information like time of day (upbeat music in morning, relaxing music in evening), device type (phone vs. desktop), and listening history to decide which songs to suggest.
Summary
External information can significantly improve algorithm performance:
- Contextual information: Environmental features that influence reward distributions
- User features: Characteristics that help personalize action selection
The key is identifying information that’s predictive of rewards and incorporating it into the decision-making process.
Reward Mechanism
The reward mechanism determines how the environment generates feedback. This fundamentally shapes which algorithms are appropriate and what performance you can expect.
Stochastic
In stochastic settings, each action has a fixed underlying reward distribution. The algorithm’s job is to identify these distributions through exploration, then exploit the best options. This is the most common and well-studied bandit setting.
Real-world example: Clinical drug trials where each treatment has a fixed probability of success. The underlying effectiveness doesn’t change, but you observe random variation in patient outcomes.
Non-Stationary
Reward distributions can change over time, but this comes with a performance cost. The algorithm must balance learning current distributions with adapting to changes. The rate and nature of change determines how much performance degrades compared to the stationary case.
Real-world example: Stock trading algorithms where market conditions change over time. A strategy that worked well in a bull market might perform poorly during a recession.
Adversarial
In adversarial settings, rewards are chosen to minimize the algorithm’s performance. This worst-case scenario requires fundamentally different approaches, typically involving randomization to prevent the adversary from exploiting predictable behavior.
Real-world example: Cybersecurity applications where an attacker observes your defense strategies and adapts their attacks to exploit whatever defense you’re currently using most.
Summary
Reward mechanisms determine how the environment generates feedback:
- Stochastic: Fixed underlying distributions with random variation
- Non-stationary: Reward distributions that change over time according to known rules
- Adversarial: Worst-case rewards selected to minimize algorithm performance
The reward mechanism fundamentally affects which algorithms are appropriate and what performance guarantees are possible.

Bandit reward mechanisms
Learner Feedback
The final dimension is how much information the algorithm receives after each decision. By definition, bandit problems provide some reward signal each round, but the amount of information varies significantly.
Bandit Feedback
Bandit feedback means observing rewards only for the selected action. If you choose arm one, you learn nothing about what would have happened with arm two. This is the traditional bandit setting and creates the fundamental exploration-exploitation trade-off.
In combinatorial settings, bandit feedback provides one signal for the entire combination of sub-actions. The algorithm doesn’t see individual sub-action rewards, only the cumulative result.
Real-world example: A/B testing where you show version A to one user and only see whether they converted or not. You learn nothing about how version B would have performed for that same user.
Full Feedback
Full feedback means observing rewards for all possible actions each round. This is useful in adversarial settings and enables effective algorithms. However, in stochastic settings, full feedback would make the problem trivial—just always select the action with the highest observed reward.
Real-world example: Historical stock market analysis where you can see the performance of all possible investment strategies for any given day, since all stock prices are publicly recorded.
Partial Feedback / Semi-Bandit Feedback
Semi-bandit feedback provides a middle ground between bandit and full feedback. It’s most relevant for combinatorial actions, where you observe both the overall reward and the individual contributions of each sub-action in your selection.
Consider modeling commute time: if you take a side street and learn it takes 5 minutes to traverse, you can use this sub-component information when planning future routes.
Real-world example: Online advertising where you run multiple ads simultaneously and can see both the overall campaign performance and the individual performance of each ad creative, allowing you to learn about component effectiveness.
Partial Monitoring
When feedback isn’t received every round, we move outside traditional bandit algorithms. For those interested in this setting, partial monitoring provides the theoretical framework.
Summary
Feedback mechanisms determine how much information the algorithm receives:
- Bandit feedback: Only observe rewards for selected actions
- Full feedback: Observe rewards for all possible actions
- Partial/Semi-bandit feedback: Observe rewards for sub-components of selected actions
- Partial monitoring: Incomplete feedback (outside traditional bandit scope)
The feedback type significantly impacts learning efficiency and appropriate algorithmic choices.

Bandit feedback scenarios
Putting It All Together: Real-World Applications
Let’s see how these dimensions apply to some common real-world scenarios:
E-commerce Product Recommendations
- Action Space: Combinatorial (selecting multiple products to recommend)
- Problem Structure: Structured (similar products have correlated performance)
- External Information: Rich contextual data (user history, time of day, device)
- Reward Mechanism: Stochastic (user preferences are relatively stable)
- Feedback: Bandit (only see engagement with recommended products)
Online Advertising Auctions
- Action Space: Infinite (continuous bid amounts)
- Problem Structure: Structured (bid performance varies smoothly)
- External Information: User demographics, search query, ad relevance scores
- Reward Mechanism: Non-stationary (market conditions change)
- Feedback: Bandit (only observe results for your winning bids)
Clinical Trial Optimization
- Action Space: Finite (different treatment protocols)
- Problem Structure: May be structured (similar treatments have related efficacy)
- External Information: Patient characteristics, medical history
- Reward Mechanism: Stochastic (treatment effects are inherently variable)
- Feedback: Bandit (only observe outcomes for assigned treatments)
Content Personalization
- Action Space: Combinatorial (selecting articles/videos for homepage)
- Problem Structure: Structured (content categories have shared patterns)
- External Information: User profile, browsing history, current trends
- Reward Mechanism: Non-stationary (user interests evolve over time)
- Feedback: Semi-bandit (can see performance of individual content pieces)
LLM Prompt Optimization
- Action Space: Combinatorial (selecting prompt templates, examples, and parameters)
- Problem Structure: Structured (similar prompts have correlated performance)
- External Information: Task type, user intent, model capabilities
- Reward Mechanism: Stochastic (model outputs have inherent variability)
- Feedback: Bandit (only observe quality of generated responses)
A Decision Framework
When faced with a new bandit problem, consider this systematic approach:
1. Start with Action Space:
├─ Finite actions? → Consider UCB, Thompson Sampling
└─ Infinite actions? → Consider Lipschitz bandits, GP-UCB
2. Check for Structure:
├─ Structured? → Leverage linear bandits, kernel methods
└─ Unstructured? → Treat arms independently
3. Available Context?
├─ Yes → Use contextual bandits (LinUCB, contextual Thompson Sampling)
└─ No → Standard multi-arm bandit algorithms
4. Reward Nature:
├─ Stochastic? → UCB, Thompson Sampling work well
├─ Non-stationary? → Use windowing or forgetting factors
└─ Adversarial? → EXP3, hedge algorithms
5. Feedback Type:
├─ Bandit feedback → Standard bandit algorithms
├─ Full feedback → Multiplicative weights, online gradient descent
└─ Semi-bandit → Exploit structure in combinatorial problems
This framework helps systematically identify the most appropriate algorithmic approaches for your specific problem.
Summary
This framework provides five key dimensions for characterizing multi-arm bandit problems:
- Action Space: Finite vs. infinite, single vs. combinatorial
- Problem Structure: Whether actions provide information about other actions
- External Information: Availability of contextual information
- Reward Mechanism: Stochastic, non-stationary, or adversarial
- Learner Feedback: Bandit, full, partial, or semi-bandit feedback
Understanding these dimensions helps identify appropriate algorithms and set realistic performance expectations. Structured problems enable more efficient learning, while adversarial settings require fundamentally different approaches.
Further Reading
- Bandit Algorithms (Lattimore & Szepesvári) - The detailed textbook this framework is based on
- Introduction to Multi-Armed Bandits - Excellent survey paper by Aleksandrs Slivkins