June 20, 2024

Chongyi Zheng, Benjamin Eysenbach

Unsupervised learning is really powerful. It lies at the heart of large language models (just predict the next token), generative image models (predict what noise to remove from an image), and vision-language models (which text matches with which image). 

In recent work, we have explored how to build unsupervised RL methods. In particular, we aim to build an RL method that takes as input a description of the desired state of the world (a prompt), and then performs a sequence of actions in the world to attempt to get to this desired state. This problem is also sometimes known as goal-conditioned RL.

This problem statement highlights the close connection between RL and generative AI methods. A large language model takes as input a prompt and then does a bunch of computation before generating a textual response. A generative image model works similarly, starting with an image of pure noise and iteratively adding details and removing noise until it produces a high-resolution image. The RL problem similarly involves iterative computation: iteratively taking actions in an environment until it eventually arrives at a desired state.

We build upon prior work that shows how these sort of unsupervised RL methods can be solved via representation learning: learning representations so that observations that occur nearby in time are represented with similar observations. Once learned, we can determine good actions for reaching a goal by simply seeing which actions would steer the representations towards the representation of the goal.

Our recent work [1, 2] makes two contributions to this exciting and growing area. First, while prior methods look at simulated benchmarks, we show how these methods can be applied to real world robotic manipulation. We discover a set of important architectural considerations that, taken together, allow us to solve complex manipulation tasks that stymie prior methods:

 

Generative AI image

 

Design Decisions

Value

Network architecture

3-layer CNN 

+ (1024, 4) MLP

Layer normalization

Cold initialization

Unif[-10-12, 10-12]

Data augmentation

Random cropping

Batch size

2048

One useful property of these methods is that they are trained without any guidance. After training, the user provides an image of the desired goal state. Thus, if we wanted the robot to do something else, we don't need to train the system again, but rather can just provide a single goal image:

Different from representations learned in large language models and generative image models, we also found that our representations encode long horizon causal information: linear interpolation in the representation space seems to correspond to planning.

 

Generative AI image

 

Second, we make a contribution to the algorithmic foundations of these methods by demonstrating how to perform counterfactual reasoning. This enables our method to determine what sorts of actions should be taken, even if we plan to act in a different way in the future. 

We explain this property by a simple didactic gridworld navigation task, where the actions are to move up, down, left, and right. We collected a dataset containing trajectories moving in diagonal and off-diagonal directions. Importantly, there is no trajectory that goes from the top left to the bottom left. Our method successfully finds a path between the start and the goal, even though there was no path between them during training. 

Generative AI image

 

Empirically, this sort of counterfactual reasoning significantly boosts data efficiency, by up to 1500x on tabular settings. Experiments on continuous and high-dimensional simulated robotics tasks demonstrate that these ideas scale.

 

Generative AI image

Summary

Our recent work provides some important tools for extending the capabilities of these unsupervised RL methods. Moving forward, these tools also provide new ideas for learning representations in language and vision tasks. We are excited about these methods because they have the potential to draw connections between all types of generative AI.

 


References

[1] Zheng, Chongyi, Ruslan Salakhutdinov, and Benjamin Eysenbach. "Contrastive Difference Predictive Coding." The Twelfth International Conference on Learning Representations. 2024. 

[2] Zheng, Chongyi, et al. "Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data." The Twelfth International Conference on Learning Representations. 2024.