Sequential decisions are an almost universal problem class, spanning dynamic resource allocation problems, control problems, discrete graph problems, active learning problems, as well as two-agent games and multiagent problems. Application settings span engineering, the sciences, transportation, health services, medical decision making, energy, e-commerce and finance. A rich problem class involves systems that must actively learn about the environment, possibly via drones or robots. In multi-agent systems, we may need to learn about the behavior of other agents.
These problems have been addressed in the academic literature using a variety of modeling and algorithmic frameworks, including dynamic programming, stochastic programming, stochastic control, simulation optimization, approximate dynamic programming/reinforcement learning, and even multiarmed bandit problems.
I will describe a universal modeling framework that can be used for any sequential decision problem in the presence of different sources of uncertainty. The framework is centered on an optimization problem that optimizes over policies (rules for making decisions), where we show that there are two fundamental strategies for designing policies (policy search and policies based on lookahead approximations), each of which further divide into two classes, creating four (meta)classes of policies that are the foundation of any solution approach that has ever been proposed for a sequential problem. I will demonstrate these policies in two broad contexts: pure learning problems (“bandit problems”) and dynamic resource allocation problems, where I will use a simple energy storage problem to show that each of the four classes (and a hybrid) can be made to work best.