Do multi-armed and contextual bandits really adapt to new environments or regimes?

Kind of. But not really though.

DCL

2021-10-06

This post assumes some familiarity with bandit algorithms.

What do bandits actually adapt to?

It is common to hear that contextual or multi-armed bandits (MAB) adapt to things like ‘changes in the business environment’ or ‘changes in customer behaviour’. However, ordinary contextual or MAB’s are only capable of adapting to the new data that are being observed, they are not ‘really’ adapting to a change in the environment. In fact, these methods typically assume stationarity and that the environment or underlying phenomenon we are trying to understand doesn’t change in time! The difference here is subtle, but it has a big impact on the accuracy and reliability of our bandit.

Don’t get me wrong, bandits are great, but we ought to understand how they work and what their limitations are.

A simple MAB in a stationary environment

Consider a scenario where we use a simple MAB to determine which amongst 4 different options or treatments has the best effect on conversion rate whilst also not spending too many resources on the options which do not appear promising (a classic explore-exploit setup). Below is an animation of the evolution of a simple 4-armed bandit using the BernTS(K, α, β) algorithm on page 15 of this excellent tutorial on Thompson sampling.

An evolution of a bandit

The solid curves represent the estimated reward distribution of each arm, and the dashed vertical lines represent the true conversion probabilities. We observe that the bandit quickly learns which arms are the most effective. We obtain sharp peaks near the true values for the superior arms, but large uncertainties for the arms which are not believed to be effective and thus did not get selected very often (it is generally known that MAB’s are biased).

What happens in a non-stationary environment?

Let’s create a completely artificial example where there is a simple regime change or instance of concept drift (as ML/AI people prefer to call it). Consider the same scenario as the example above, but where the effectiveness of each arm on conversation probability changes at around round 2000. Observe what happens to the evolutions of the estimated distributions.

An evolution of a bandit with a regime change

The bandit quickly starts to figure out that previously superior arms are no longer as effective. By round 4000, the bandit correctly believes that arm 2 is the most effective and will continue to select it with increasingly higher probability. However, we must note:

That the estimated reward distributions for arms 1 and 4 is unlikely to be accurately estimated any time soon

because…

While the mean of the reward distributions for arms 1 and 4 start shifting to the left, the variances actually get smaller because our “confidence” in the estimated conversion probability only goes up with each arm pulling. Because our assumption that the underlying behaviour is stationary is violated, we’re getting get incorrect and very misleading estimates of the reward probabilities.

Nothing is easy in non-stationary environments

In reasonably well behaved non-stationary environments, the estimates that a bandit produces are not going to be great in terms of MSE or ever be unbiased. But the bandit will eventually adapt in the sense it will shift the reward distributions of the previously best-performing arms until another arm starts to outperform. In this setting, performance won’t be optimal and you can throw away all the theorems about optimally minimising regret. Additionally, as N get’s larger and larger in ever-changing environments, the harder it will be for the bandit to adapt because it only gets more and more confident in its almost certainly wrong estimates. If the effectiveness of the arms is changing often and in an unpredictable way, bandits may well perform worse than random.

What should we do if we are employing bandits?

Real-world problems involving people, businesses, or other agents where feedback loops or other complicated interactions can take place are unlikely to be truly stable, stationary environments, so care must be taken to avoid costly mistakes. If you use or are planning to use contextual or multi-armed bandits you can try the following:

Employ covariate and concept drift detection methods to detect changes in both the input data, and changes in the relationship between the input data and the response/outcomes.
Identify if old training data has become stale or is no longer useful. If so, remove it from bandit training, or attempt to find features that may reasonably explain the change in the relationship.
Employ more sophisticated methods such as Efficient Contextual Bandits in Non-stationary Worlds.
Hope your environment is relatively stationary and that the effectiveness of each arm stays as effective for the foreseeable future.

modelling data-science