* denotes equal contribution and joint lead authorship.
Blue - Conference Papers.
Red - Workshop and Doctoral Consortia Papers.
Orange - Journal Papers.


  1. Learning Interpretable, High-Performing Policies for Autonomous Driving

    Robotics: Science and Systems, 2022

    Gradient-based approaches in reinforcement learning have achieved tremendous success in learning policies for autonomous vehicles. While the performance of these approaches warrants real-world adoption, these policies lack interpretability, limiting deployability in the safety-critical and legally-regulated domain of autonomous driving (AD). AD requires interpretable and verifiable control policies that maintain high performance. We propose Interpretable Continuous Control Trees (ICCTs), a tree-based model that can be optimized via modern, gradient-based, RL approaches to produce high-performing, interpretable policies. The key to our approach is a procedure for allowing direct optimization in a sparse decision-tree-like representation. We validate ICCTs against baselines across six domains, showing that ICCTs are capable of learning interpretable policy representations that parity or outperform baselines by up to 33% in AD scenarios while achieving a 300x-600x reduction in the number of policy parameters against deep learning baselines. Furthermore, we demonstrate the interpretability and utility of our ICCTs through a 14-car physical robot demonstration.
  2. Scaling Multi-Agent Reinforcement Learning via State Upsampling

    RSS 2022 Workshop on Scaling Robot Learning (RSS22-SRL)

    We consider the problem of scaling Multi-Agent Reinforcement Learning (MARL) algorithms toward larger environments and team sizes. While it is possible to learn a MARL-synthesized policy on these larger problems from scratch, training is difficult as the joint state-action space is much larger. Policy learning will require a large amount of experience (and associated training time) to reach a target performance. In this paper, we propose a transfer learning method that accelerates the training performance in such high-dimensional tasks with increased complexity. Our method upsamples an agent’s state representation in a smaller, less challenging, source task in order to pre-train a target policy for a larger, more challenging, target task. By transferring the policy after pre-training and continuing MARL in the target domain, the information learned within the source task enables higher performance within the target task in significantly less time than training from scratch. As such, our method enables the scalability of coordination problems. Furthermore, as our method only changes the state representation of agents across tasks, it is agnostic to the policy’s architecture and can be deployed across different MARL algorithms. We provide results showing that a policy trained under our method is able to achieve up to a 7.88$\times$ performance improvement under the same amount of training time, compared to a policy trained from scratch. Moreover, our method enables learning in difficult target task settings where training from scratch fails.
  3. Learning Efficient Diverse Communication for Cooperative Heterogeneous Teaming

    International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022

    High-performing teams learn intelligent and efficient communication and coordination strategies to maximize their joint utility. These teams implicitly understand the different roles of heterogeneous team members and adapt their communication protocols accordingly. Multi-Agent Reinforcement Learning (MARL) seeks to develop computational methods for synthesizing such coordination strategies, but formulating models for heterogeneous teams with different state, action, and observation spaces has remained an open problem. Without properly modeling agent heterogeneity, as in prior MARL work that leverages homogeneous graph networks, communication becomes less helpful and can even deteriorate the cooperativity and team performance. We propose Heterogeneous Policy Networks (HetNet) to learn efficient and diverse communication models for coordinating cooperative heterogeneous teams. Building on heterogeneous graph-attention networks, we show that HetNet not only facilitates learning heterogeneous collaborative policies per existing agent-class but also enables end-to-end training for learning highly efficient binarized messaging.
  4. Mutual Understanding in Human-Machine Teaming
    Rohan Paleja*, and Matthew Gombolay.

    Association for the Advancement of Artificial Intelligence Conference (AAAI) Doctoral Consortium, 2022

    Collaborative robots (i.e., “cobots”) and machine learning-based virtual agents are increasingly entering the human workspace with the aim of increasing productivity, enhancing safety, and improving the quality of our lives. These agents will dynamically interact with a wide variety of people in dynamic and novel contexts, increasing the prevalence of human-machine teams in healthcare, manufacturing, and search-and-rescue. In this research, we enhance the mutual understanding within a human-machine team by enabling cobots to understand heterogeneous teammates via person-specific embeddings, identifying contexts in which xAI methods can help improve team mental model alignment, and enabling cobots to effectively communicate information that supports high-performance human-machine teaming.


  1. Using Machine Learning to Predict Perfusionists Critical Decision-Making during Cardiac Surgery
    Roger Dias, Marco Zenati, Geoff Rance, Rithy Srey, David Arney, Letian Chen, Rohan Paleja, Lauren Kennedy-Metz, and Matthew Gombolay.

    Computer Methods in Biomechanics and Biomedical Engineering, 2021

    The cardiac surgery operating room is a high-risk and complex environment in which multiple experts work as a team to provide safe and excellent care to patients. During the cardiopulmonary bypass phase of cardiac surgery, critical decisions need to be made and the perfusionists play a crucial role in assessing available information and taking a certain course of action. In this paper, we report the findings of a simulation-based study using machine learning to build predictive models of perfusionists’ decision-making during critical situations in the operating room (OR). Performing 30-fold cross-validation across 30 random seeds, our machine learning approach was able to achieve an accuracy of 78.2% (95% confidence interval: 77.8% to 78.6%) in predicting perfusionists’ actions, having access to only 148 simulations. The findings from this study may inform future development of computerised clinical decision support tools to be embedded into the OR, improving patient safety and surgical outcomes.
  2. The Utility of Explainable AI in Ad Hoc Human-Machine Teaming

    Conference on Neural Information Processing Systems (NeurIPS), 2021

    Recent advances in machine learning have led to growing interest in Explainable AI (xAI) to enable humans to gain insight into the decision-making of machine learning models. Despite this recent interest, the utility of xAI techniques has not yet been characterized in human-machine teaming. Importantly, xAI offers the promise of enhancing team situational awareness (SA) and shared mental model development, which are the key characteristics of effective human-machine teams. Rapidly developing such mental models is especially critical in ad hoc human-machine teaming, where agents do not have a priori knowledge of others’ decision-making strategies. In this paper, we present two novel human-subject experiments quantifying the benefits of deploying xAI techniques within a human-machine teaming scenario. First, we show that xAI techniques can support SA ($p<0.05)$. Second, we examine how different SA levels induced via a collaborative AI policy abstraction affect ad hoc human-machine teaming performance. Importantly, we find that the benefits of xAI are not universal, as there is a strong dependence on the composition of the human-machine team. Novices benefit from xAI providing increased SA ($p<0.05$) but are susceptible to cognitive overhead ($p<0.05$). On the other hand, expert performance degrades with the addition of xAI-based support ($p<0.05$), indicating that the cost of paying attention to the xAI outweighs the benefits obtained from being provided additional information to enhance SA. Our results demonstrate that researchers must deliberately design and deploy the right xAI techniques in the right scenario by carefully considering human-machine team composition and how the xAI method augments SA.
  3. Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration
    Letian Chen, Rohan Paleja, and Matthew Gombolay.

    AAAI Artificial Intelligence for Human-Robot Interaction (AI-HRI) Fall Symposium, 2021

    Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform novel tasks by providing demonstrations. However, as demonstrators are typically non-experts, modern LfD techniques are unable to produce policies much better than the suboptimal demonstration. A previously-proposed framework, SSRR, has shown success in learning from suboptimal demonstration but relies on noise-injected trajectories to infer an idealized reward function. A random approach such as noise-injection to generate trajectories has two key drawbacks: 1) Performance degradation could be random depending on whether the noise is applied to vital states and 2) Noise-injection generated trajectories may have limited suboptimality and therefore will not accurately represent the whole scope of suboptimality. We present Systematic Self-Supervised Reward Regression, S3RR, to investigate systematic alternatives for trajectory degradation.
  4. Multi-Agent Graph-Attention Communication and Teaming
    Yaru Niu*, Rohan Paleja*, and Matthew Gombolay.

    International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2021
    Best Workshop Paper Award Winner at ICCV MAIR2 Workshop

    High-performing teams learn effective communication strategies to judiciously share information and reduce the cost of communication overhead. Within multi-agent reinforcement learning, synthesizing effective policies requires reasoning about when to communicate, whom to communicate with, and how to process messages. We propose a novel multi-agent reinforcement learning algorithm, Multi-Agent Graph-attentIon Communication (MAGIC), with a graph-attention communication protocol in which we learn 1) a Scheduler to help with the problems of when to communicate and whom to address messages to, and 2) a Message Processor using Graph Attention Networks (GATs) with dynamic graphs to deal with communication signals. The Scheduler consists of a graph attention encoder and a differentiable attention mechanism, which outputs dynamic, differentiable graphs to the Message Processor, which enables the Scheduler and Message Processor to be trained end-to-end. We evaluate our approach on a variety of cooperative tasks, including Google Research Football. Our method outperforms baselines across all domains, achieving $\approx 10\%$ increase in reward in the most challenging domain. We also show MAGIC communicates $23.2\%$ more efficiently than the average baseline, is robust to stochasticity, and scales to larger state-action spaces. Finally, we demonstrate MAGIC on a physical, multi-robot testbed.
  5. Effects of Social Factors and Team Dynamics on Adoption of Collaborative Robot Autonomy
    Mariah Schrum*, Glen Neville*, Michael Johnson*, Nina Moorman, Rohan Paleja, Karen Feigh, and Matthew Gombolay.

    ACM/IEEE International Conference on Human Robot Interaction (HRI), 2021

    As automation becomes more prevalent, the fear of job loss due to automation increases. Workers may not be amenable to working with a robotic co-worker due to a negative perception of the technology. The attitudes of workers towards automation are influenced by a variety of complex and multi-faceted factors such as intention to use, perceived usefulness and other external variables. In an analog manufacturing environment, we explore how these various factors influence an individual’s willingness to work with a robot over a human co-worker in a collaborative Lego building task. We specifically explore how this willingness is affected by: 1) the level of social rapport established between the individual and his or her human co-worker, 2) the anthropomorphic qualities of the robot, and 3) factors including trust, fluency and personality traits. Our results show that a participant’s willingness to work with automation decreased due to lower perceived team fluency (p=0.045), rapport established between a participant and their co-worker (p=0.003), the gender of the participant being male (p=0.041), and a higher inherent trust in people (p=0.018).


  1. Learning from Suboptimal Demonstration via Self-Supervised Reward Regression
    Letian Chen, Rohan Paleja, and Matthew Gombolay.

    Conference on Robot Learning (CoRL), 2021
    Best Paper Award Finalist

    Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with ~0.95 correlation with ground-truth reward versus ~0.75 for prior work. We can then train policies achieving ~200% improvement over the suboptimal demonstration and ~90% improvement over prior work. We present a physical demonstration of teaching a robot a topspin strike in table tennis that achieves 32% faster returns and 40% more topspin than user demonstration.
  2. Interpretable and Personalized Apprenticeship Scheduling: Learning Interpretable Scheduling Policies from Heterogeneous User Demonstrations

    Conference on Neural Information Processing Systems (NeurIPS), 2020.

    Resource scheduling and coordination is an NP-hard optimization requiring an efficient allocation of agents to a set of tasks with upper- and lower bound temporal and resource constraints. Due to the large-scale and dynamic nature of resource coordination in hospitals and factories, human domain experts manually plan and adjust schedules on the fly. To perform this job, domain experts leverage heterogeneous strategies and rules-of-thumb honed over years of apprenticeship. What is critically needed is the ability to extract this domain knowledge in a heterogeneous and interpretable apprenticeship learning framework to scale beyond the power of a single human expert, a necessity in safety-critical domains. We propose a personalized and interpretable apprenticeship scheduling algorithm that infers an interpretable representation of all human task demonstrators by extracting decision-making criteria specified by an inferred, personalized embedding without constraining the number of decision-making strategies. We achieve near-perfect LfD accuracy in synthetic domains and 88.22% accuracy on a real-world planning domain, outperforming baselines. Further, a user study conducted shows that our methodology produces both interpretable and highly usable models (p < 0.05).
  3. Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation
    Rohan Paleja

    ACM/IEEE International Conference on Human Robot Interaction (HRI), 2020.

    Reinforcement learning (RL) has achieved tremendous success as a general framework for learning how to make decisions. However, this success relies on the interactive hand-tuning of a reward function by RL experts. On the other hand, inverse reinforcement learning (IRL) seeks to learn a reward function from readily-obtained human demonstrations. Yet, IRL suffers from two major limitations: 1)reward ambiguity – there are an infinite number of possible re-ward functions that could explain an expert’s demonstration and 2) heterogeneity-human experts adopt varying strategies and preferences, which makes learning from multiple demonstrators difficult due to the common assumption that demonstrators seeks to maximize the same reward. In this work, we propose a method to jointly infer a task goal and humans’ strategic preferences via network distillation. This approach enables us to distill a robust task reward (addressing reward ambiguity) and to model each strategy’s objective (handling heterogeneity). We demonstrate our algorithm can better recover task reward and strategy rewards and imitate the strategies two simulated tasks and a real-world table tennis task.


  1. Heterogeneous Learning from Demonstration
    Rohan Paleja, and Matthew Gombolay.

    International Conference on Human Robot Interaction (HRI) Pioneers Workshop

    The development of human-robot systems able to leverage the strengths of both humans and their robotic counterparts has been greatly sought after because of the foreseen, broad-ranging impact across industry and research. We believe the true potential of these systems cannot be reached unless the robot is able to act with a high level of autonomy, reducing the burden of manual tasking or teleoperation. To achieve this level of autonomy, robots must be able to work fluidly with its human partners, inferring their needs without explicit commands. This inference requires the robot to be able to detect and classify the heterogeneity of its partners. We propose a framework for learning from heterogeneous demonstration based upon Bayesian inference and evaluate a suite of approaches on a real-world dataset of gameplay from StarCraft II. This evaluation provides evidence that our Bayesian approach can outperform conventional methods by up to 12.8%.