We develop reinforcement learning trials for discovering individualized treatment regimens for life threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is illustrated in a simulation analysis. For future research, we will apply this general strategy to studying and identifying new treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer, which usually includes multiple lines of chemotherapy treatment.


Clinical Trials | Disease Modeling | Statistical Methodology | Statistical Theory