Week 8 Bandits

General offline contextual bandits paper: Learning from Logged Implicit Exploration Data

By: Alex Strehl, John Langford, Sham Kakade

Context: suggesting ad based on search query:

  1. World draws an input vector and a reward vector (over all actions). Announces input vector.
  2. Algorithm chooses an action given input vector and historical information.
  3. World announces reward of this action. Rewards from other actions are not revealed.

Goal: maximize sum of rewards over the rounds of interaction. (Then we can compare our algorithm to rat’s behavior)

Problem: we have previously recorded events from interaction with an unknown, uncontrolled rat policy.

Abstract:

We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned.

The primary challenge in a variety of settings is that the exploration policy, in which “of- fline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged.

We empirically verify our solution on a reasonably sized set of real-world data obtained from an online advertising company.

“Approaches that fail”:

There are several approaches that may appear to solve this problem, but turn out to be inadequate:

  1. Supervised learning. We could learn a regressor s : X × A → [0, 1] which is trained to predict the reward, on observed events conditioned on the action a and other information x. From this regressor, a policy is derived according to h(x) = argmaxa∈A s(x, a). A flaw of this approach is that the argmax may extend over a set of choices not included in the training data, and hence may not generalize at all (or only poorly). This can be verified by considering some extreme cases. Suppose that there are two actions a and b with action a occurring 106 times and action b occuring 102 times. Since action b occurs only a 10−4 fraction of the time, a learning algorithm forced to trade off between predicting the expected value of ra and rb overwhelmingly prefers to estimate ra well at the expense of accurate estimation for rb. And yet, in application, action b may be chosen by the argmax. This problem is only worse when action b occurs zero times, as might commonly occur in exploration situations.

  2. Bandit approaches. In the standard setting these approaches suffer from the curse of dimensionality, because they must be applied conditioned on X. In particular, applying them requires data linear in X × A, which is extraordinarily wasteful. In essence, this is a failure to take advantage of generalization.

  3. Contextual Bandits. Existing approaches to contextual bandits such as EXP4 (Auer et al., 2002) or Epoch Greedy (Langford & Zhang, 2008), require either interaction to gather data or require knowledge of the probability the logging policy chose the action a. In our case the probability is unknown, and it may in fact always be 1.

  4. Exploration Scavenging. It is possible to recover exploration information from action visitation frequency when a logging policy chooses actions independent of the input x (but possibly dependent on history) (Langford et al., 2008). This doesn’t fit our setting, where the logging policy is surely dependent on the query.

Adapting to this problem

Concern #1 is not applicable here, because there are only two actions always. Simplest method to try.

Concern #2 isn’t such a big deal here, so we should look into those methods.

Similarly, for concern #3, we know that the logging policy always chooses the action?? So use those.

Finally, for concern #4, perhaps the rats do “choose actions independent of the input x but possibly dependent on history”.

Thus we should try all of these other methods, as well.

Walk-through of this paper

What is the context here? We can perhaps use these variables as part of the context:

  • reaction time
  • time between trials
  • how far into a session we are / how many trials have happened (if rat is getting tired)
  • what kind of a trial it is??

(Note that we should discretize these variables)

Notation:

  • x is input / context
  • a is chosen action: l or r
  • r is the reward the rat earns

Step 1 is to model logging policy with simple empirical estimation using ALL the data:

$$ \hat\pi(a \vert x) = \frac{\# (a_t=a \cap x_t=x)}{\# (x_t=x)} $$

Their decision to use all the data, instead of training data only or training data for training set and test data for test set, stems from their scenario where ads can be different in training time vs test time. For us, actions are always the same, so we should play with this setting.

In the formula above, it is important that input x’s repeat themselves, i.e. we need to discretize the input variables sufficiently.

Step 2 is to predict reward given input, chosen action.

Paper uses weighted least squares, i.e. want to minimize the weighted squared loss:

$$ \frac{(y - f(x,a)) ^ 2} { \max(\hat\pi(a,x), \tau) } $$

Threshold parameter $\tau$ ensures stability. They try 0.01, 0.05

But we can experiment with other techniques.

Step 3 is to choose new actions by maximizing expected reward:

  • Set of available actions: $C = {a \in A : \hat\pi(a,x) > 0}$
  • New action policy: $h(x) = argmax_{a \in C(x)} {f(a,x)}$

Step 4 is to evaluate the new policy:

We compute avg reward/trial from new policy applied to T test samples restricted to those where suggested actions match logged action.

Notation: $a_i$ is the logged action for event $i$; $h(x_i)$ is the new suggested action for event ; $ 1_{\ldots} $ is the indicator function.

Estimator:

$$ V = \frac{1}{T} \sum_{i=1}^{T} {\frac{1_{a_i = h(x_i)} * r_{i}}{\max(\hat\pi(a_i, x_i) , \tau)}} $$

Note that we can swap out argmax regressor.

The importance weighting ensures that training process “emphasizes good estimation for each action equally”. Won’t have much of an effect when we only have two actions.

This is computed for this method, as well as for Random Policy (i.e. choose action at random), and for Naive Policy (method #1 at top)

Next steps

  1. Implement this paper and see how we do
  2. Compare to other approaches listed above
  3. Compare these “optimal” bandit results to rat’s behavior and performance
Written on November 18, 2015