Efficient Counterfactual Learning from Bandit Feedback
Yusuke Narita, Shota Yasui, Kohei Yata
Abstract
What is the most statistically efficient way to do off-policy evaluation and optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-the-art benchmark.
Results
Related Papers
Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning2025-07-07UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning2025-06-27Causal-Aware Intelligent QoE Optimization for VR Interaction with Adaptive Keyframe Extraction2025-06-24Quantum Neural Networks for Propensity Score Estimation and Survival Analysis in Observational Biomedical Studies2025-06-24Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition2025-06-23T-CPDL: A Temporal Causal Probabilistic Description Logic for Developing Logic-RAG Agent2025-06-23