Vitchyr H. Pong
I am a research scientist at OpenAI.
I completed my PhD at UC Berkeley, where I was advised by Sergey
Levine. I studied how to apply deep reinforcement learning for
robotics. I completed my B.S. at Cornell University, where I worked with
Ross Knepper and Hadas Kress-Gazit.
CV (March 2019)  / 
LinkedIn  / 
GitHub  / 
PhD Thesis
vitchyr at berkeley dot edu
|
|
|
Offline Meta-Reinforcement Learning with Online Self-Supervision
Vitchyr H. Pong,
Ashvin Nair,
Laura Smith
,
Catherine Huang
,
Sergey Levine
.
International Conference on Machine Learning. 2022.
[arXiv]
[website]
Meta-reinforcement learning (RL) can meta-train policies that adapt to
new tasks with orders of magnitude less data than standard RL, but
meta-training itself is costly and time-consuming. If we can
meta-train on offline data, then we can reuse the same static dataset,
labeled once with rewards for different tasks, to meta-train policies
that adapt to a variety of new tasks at meta-test time. Although this
capability would make meta-RL a practical tool for real-world use,
offline meta-RL presents additional challenges beyond online meta-RL
or standard offline RL settings. Meta-RL learns an exploration
strategy that collects data for adapting, and also meta-trains a
policy that quickly adapts to data from a new task. Since this policy
was meta-trained on a fixed, offline dataset, it might behave
unpredictably when adapting to data collected by the learned
exploration strategy, which differs systematically from the offline
data and thus induces distributional shift. We do not want to remove
this distributional shift by simply adopting a conservative
exploration strategy, because learning an exploration strategy enables
an agent to collect better data for faster adaptation. Instead, we
propose a hybrid offline meta-RL algorithm, which uses offline data
with rewards to meta-train an adaptive policy, and then collects
additional unsupervised online data, without any reward labels to
bridge this distribution shift. By not requiring reward labels for
online collection, this data can be much cheaper to collect. We
compare our method to prior work on offline meta-RL on simulated robot
locomotion and manipulation tasks and find that using additional
unsupervised online data collection leads to a dramatic improvement in
the adaptive capabilities of the meta-trained policies, matching the
performance of fully online meta-RL on a range of challenging domains
that require generalization to new tasks.
|
|
Outcome-Driven Reinforcement Learning via Variational Inference
Tim G. J. Rudner*
,
Vitchyr H. Pong*,
Rowan McAllister
,
Yarin Gal
,
Sergey Levine
.
Neural Information Processing Systems. 2021.
[arXiv]
While reinforcement learning algorithms provide
automated acquisition of optimal policies, practical application of
such methods requires a number
of design decisions, such as manually designing
reward functions that not only define the task, but
also provide sufficient shaping to accomplish it.
In this paper, we discuss a new perspective on
reinforcement learning, recasting it as the problem of inferring
actions that achieve desired outcomes, rather than a problem of
maximizing rewards. To solve the resulting outcome-directed inference
problem, we establish a novel variational
inference formulation that allows us to derive a
well-shaped reward function which can be learned
directly from environment interactions. From the
corresponding variational objective, we also derive a new
probabilistic Bellman backup operator reminiscent of the standard
Bellman backup
operator and use it to develop an off-policy algorithm to solve
goal-directed tasks. We empirically
demonstrate that this method eliminates the need
to design reward functions and leads to effective
goal-directed behaviors.
|
|
DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies
Soroush Nasiriany*,
Vitchyr H. Pong*,
Ashvin Nair*,
Alexander Khazatsky,
Glen Berseth,
Sergey Levine.
International Conference on Robotics and Automation, 2021.
[arXiv]
[videos]
Can we use reinforcement learning to learn general-purpose policies
that can perform a wide range of different tasks, resulting in
flexible and reusable skills? Contextual policies provide this
capability in principle, but the representation of the context
determines the degree of generalization and expressivity. Categorical
contexts preclude generalization to entirely new tasks.
Goal-conditioned policies may enable some generalization, but cannot
capture all tasks that might be desired. In this paper, we propose
goal distributions as a general and broadly applicable task
representation suitable for contextual policies. Goal distributions
are general in the sense that they can represent any state-based
reward function when equipped with an appropriate distribution class,
while the particular choice of distribution class allows us to trade
off expressivity and learnability. We develop an off-policy algorithm
called distribution-conditioned reinforcement learning (DisCo RL) to
efficiently learn these policies. We evaluate DisCo RL on a variety of
robot manipulation tasks and find that it significantly outperforms
prior methods on tasks that require generalization to new goal
distributions.
|
|
Skew-Fit: State-Covering Self-Supervised Reinforcement Learning
Vitchyr H. Pong*,
Murtaza Dalal*,
Steven Lin*,
Ashvin Nair*,
Shikhar Bahl,
Sergey Levine.
International Conference on Machine Learning. 2020.
[arXiv]
[videos]
For an autonomous agent to fulfill a wide range of user-specified goals
at test time, it must be able to learn broadly applicable and
general-purpose skill repertoires. Furthermore, to provide the requisite
level of generality, these skills must handle raw sensory input such as
images. In this paper, we propose an algorithm that acquires such
general-purpose skills by combining unsupervised representation learning
and reinforcement learning of goal-conditioned policies. Since the
particular goals that might be required at test-time are not known in
advance, the agent performs a self-supervised "practice" phase where it
imagines goals and attempts to achieve them. We learn a visual
representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw
sensory inputs, and computing a reward signal for goal reaching. We also
propose a retroactive goal relabeling scheme to further improve the
sample-efficiency of our method. Our off-policy algorithm is efficient
enough to learn policies that operate on raw image observations and
goals for a real-world robotic system, and substantially outperforms
prior techniques.
|
|
Contextual Imagined Goals for Self-Supervised Robotic Learning
Ashvin Nair*,
Shikhar Bahl*,
Alexander Khazatsky*,
Glen Berseth,
Vitchyr H. Pong,
Sergey Levine.
Conference on Robot Learning. 2019.
[arXiv]
[website]
For an autonomous agent to fulfill a wide range of user-specified goals
at test time, it must be able to learn broadly applicable and
general-purpose skill repertoires. Furthermore, to provide the requisite
level of generality, these skills must handle raw sensory input such as
images. In this paper, we propose an algorithm that acquires such
general-purpose skills by combining unsupervised representation learning
and reinforcement learning of goal-conditioned policies. Since the
particular goals that might be required at test-time are not known in
advance, the agent performs a self-supervised "practice" phase where it
imagines goals and attempts to achieve them. We learn a visual
representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw
sensory inputs, and computing a reward signal for goal reaching. We also
propose a retroactive goal relabeling scheme to further improve the
sample-efficiency of our method. Our off-policy algorithm is efficient
enough to learn policies that operate on raw image observations and
goals for a real-world robotic system, and substantially outperforms
prior techniques.
|
|
Planning with Goal-Conditioned Policies
Soroush Nasiriany*,
Vitchyr Pong*,
Steven Lin,
Sergey Levine.
Neural Information Processing Systems. 2019.
[arXiv]
[videos/website]
[code]
Planning methods can solve temporally extended sequential decision
making problems by composing simple behaviors. However, planning
requires suitable abstractions for the states and transitions, which
typically need to be designed by hand. In contrast, model-free
reinforcement learning (RL) can acquire behaviors from low-level
inputs directly, but often struggles with temporally extended tasks.
Can we utilize reinforcement learning to automatically form the
abstractions needed for planning, thus obtaining the best of both
approaches? We show that goal-conditioned policies learned with RL can
be incorporated into planning, so that a planner can focus on which
states to reach, rather than how those states are reached. However,
with complex state observations such as images, not all inputs
represent valid states. We therefore also propose using a latent
variable model to compactly represent the set of valid states for the
planner, so that the policies provide an abstraction of actions, and
the latent variable model provides an abstraction of states. We
compare our method with planning-based and model-free methods and find
that our method significantly outperforms prior work when evaluated on
image-based robot navigation and manipulation tasks that require
non-greedy, multi-staged behavior.
|
|
Visual Reinforcement Learning with Imagined Goals
Ashvin Nair*,
Vitchyr Pong*,
Murtaza Dalal,
Shikhar Bahl,
Steven Lin,
Sergey Levine
.
Neural Information Processing Systems. 2018.
Spotlight.
[arXiv]
[videos]
[code]
[blog]
For an autonomous agent to fulfill a wide range of user-specified goals
at test time, it must be able to learn broadly applicable and
general-purpose skill repertoires. Furthermore, to provide the requisite
level of generality, these skills must handle raw sensory input such as
images. In this paper, we propose an algorithm that acquires such
general-purpose skills by combining unsupervised representation learning
and reinforcement learning of goal-conditioned policies. Since the
particular goals that might be required at test-time are not known in
advance, the agent performs a self-supervised "practice" phase where it
imagines goals and attempts to achieve them. We learn a visual
representation with three distinct purposes: sampling goals for
self-supervised practice, providing a structured transformation of raw
sensory inputs, and computing a reward signal for goal reaching. We also
propose a retroactive goal relabeling scheme to further improve the
sample-efficiency of our method. Our off-policy algorithm is efficient
enough to learn policies that operate on raw image observations and
goals for a real-world robotic system, and substantially outperforms
prior techniques.
|
|
Composable Deep Reinforcement Learning for Robotic Manipulation
Tuomas Haarnoja
,
Vitchyr Pong,
Aurick Zhou,
Murtaza Dalal,
Pieter
Abbeel,
Sergey Levine
.
International Conference on Robotics and Automation, 2018.
[arXiv]
[video]
[code]
Model-free deep reinforcement learning has been shown to exhibit good
performance in domains ranging from video games to simulated robotic
manipulation and locomotion. However, model-free methods are known to
perform poorly when the interaction time with the environment is
limited, as is the case for most real-world robotic tasks. In this
paper, we study how maximum entropy policies trained using soft
Q-learning can be applied to real-world robotic manipulation. The
application of this method to real-world manipulation is facilitated by
two important features of soft Q-learning. First, soft Q-learning can
learn multimodal exploration strategies by learning policies represented
by expressive energy-based models. Second, we show that policies learned
with soft Q-learning can be composed to create new policies, and that
the optimality of the resulting policy can be bounded in terms of the
divergence between the composed policies.
|
|
Temporal Difference Models: Model-Free Deep RL for Model-Based Control
Vitchyr Pong*,
Shixiang Gu*,
Murtaza Dalal,
Sergey Levine
.
International Conference on Learning Representations. 2018.
[arXiv]
[code]
[blog]
Model-free reinforcement learning (RL) is a powerful, general tool for
learning complex behaviors. However, its sample efficiency is often
impractical large for solving challenging real-world problems, even with
off-policy algorithms such as Q-learning.
We introduce temporal difference models (TDMs), a family of
goal-conditioned value functions that can be trained
with model-free learning and used for model-based control.
TDMs combine the benefits of model-free and model-based RL: they
leverage the rich information in state transitions to learn very
efficiently, while still attaining asymptotic performance that exceeds
that of direct model-based RL methods.
|
|
Uncertainty-Aware Reinforcement Learning for Collision Avoidance
Gregory Kahn,
Adam Villaflor,
Vitchyr Pong,
Pieter Abbeel,
Sergey Levine
.
arXiv:1702.01182
[Video]
[arXiv]
Practical deployment of reinforcement learning methods must contend with
the fact that the training process itself can be unsafe for the robot.
In this paper, we consider the specific case of a mobile robot learning
to navigate an a priori unknown environment while avoiding collisions.
We present an uncertainty-aware model-based learning algorithm that
estimates the probability of collision together with a statistical
estimate of uncertainty. We evaluate our method on a simulated and
real-world quadrotor, and a real-world RC car.
|
|
Learning Long-term Dependencies with Deep Memory States
Vitchyr Pong,
Shixiang Gu,
Sergey Levine
.
Lifelong Learning: A Reinforcement Learning Approach Workshop,
International Conference on Machine Learning.
2017.
Training an agent to use past memories to adapt to new tasks and
environments is important for lifelong learning algorithms.
We propose a reinforcement learning method that addresses the
limitations of methods like BPTT and truncated BPTT by training
a critic to estimate truncated gradients and by saving and loading
hidden states outputted by recurrent neural networks.
We present results showing that our algorithm can learn long-term
dependencies while avoiding the computational constraints of BPTT.
|
|
Reactive high-level behavior synthesis for an Atlas
humanoid robot
Spyros Maniatopoulos,
Philipp Schillinger,
Vitchyr Pong,
David D. Connor,
Hadas Kress-Gazit.
IEEE International Conference on Robotics and Automation,
2016.
We present and end-to-end approach for the automatic generation of code
that implements high-level robot behaviors in a verifiably correct
manner. We start with Linear Temporal Logic (LTL) equations and use them
to synthesize a reactive mission plana that is gauranteed to satisfy the
formal specifications.
|
|
Two evolving social network models
Sam Magura,
Vitchyr Pong,
Rick Durrett,
David Sivakoff
.
ALEA, Lat. Am. J. Probab. Math. Stat., 2015.
We study two different social network models. We prove that their
stationary distributions satisfy the detailed balance condition and give
explicit formulas for the stationary distributions. From this
distribution, we also obtain results about the degree distribution,
connectivity, and diameter for each model.
|
|
Chomp the Graph
Sam Magura,
Vitchyr Pong,
Elliot Cartee,
Kevin Valakuzhy
.
Broad Street Scientific, 2012
Chomp the Graph is a terminating impartial game that adheres to
normal play convetion. By the Sprague-Grundy Theorem, Chomp has a
number, which determines if a position leads to a win if played
optimally. We prove the nimber of certain types of graphs.
|
|
REPLAB: A Reproducible Low-Cost Arm Benchmark Platform for Robotic Learning
Brian Yang,
Jesse Zhang,
Vitchyr H. Pong,
Sergey Levine
,
Dinesh Jayaraman.
arXiv preprint
[arXiv]
[website]
Standardized evaluation measures have aided in the progress of
machine learning approaches in disciplines such as computer vision and
machine translation. In this paper, we make the case that robotic
learning would also benefit from benchmarking, and present the
"REPLAB" platform for benchmarking vision-based manipulation tasks.
REPLAB is a reproducible and self-contained hardware stack (robot arm,
camera, and workspace) that costs about 2000 USD, occupies a cuboid of
size 70x40x60 cm, and permits full assembly within a few hours.
Through this low-cost, compact design, REPLAB aims to drive wide
participation by lowering the barrier to entry into robotics and to
enable easy scaling to many robots. We envision REPLAB as a framework
for reproducible research across manipulation tasks, and as a step in
this direction, we define a template for a grasping benchmark
consisting of a task definition, evaluation protocol, performance
measures, and a dataset of 92k grasp attempts. We implement, evaluate,
and analyze several previously proposed grasping approaches to
establish baselines for this benchmark. Finally, we also implement and
evaluate a deep reinforcement learning approach for 3D reaching tasks
on our REPLAB platform.
|
|
Keyboard Gloves
Vitchyr Pong, Gulnar Mirza, 2015
Demo / Video
Explanation
Designed and created gloves that allow users to type on any hard
surface as if they were using a QWERTY keyboard. The gloves recognize
the standard QWERTY keyboard layout by recognizing which finger is
pressed via push buttons, and how bent the finger is via flex sensors.
We combined knowledge of analog circuit design, serial communication
protocols, and embedded programming to implement this project.
|
|
CS285: Deep Reinforcement Learning
Graduate Student Instructor.
University of California, Berkeley. Fall 2020.
|
|
CS188: Artificial Intelligence
Graduate Student Instructor.
University of California, Berkeley. Spring 2017.
|
|
CS4780 / CS5780: Machine Learning
Teaching Assistant.
Cornell University. Fall 2015.
|
|