DADP: Domain Adaptive Diffusion Policy

¹University of California, Berkeley ²Peking University ³Carnegie Mellon University ⁴Tsinghua University ^*Equal contribution in random order

Abstract

Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation.

To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods.

Methods

Learn Representation by Extracting Static Info

First, we identify an important issue during domain representation learning through dynamical prediction. We find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation.

To remove disentangle time-varying properties from the unsupervisedly learned representation, we introduce Lagged Context Dynamical Prediction. Specifically, we break the temporal correlation between the context and the current step by introducing a large historical offset $\Delta t \rightarrow \infty$ between the context and the current timestep, preventing time-varying information in the context from assisting dynamical prediction and thereby excluding it from the extracted representation during learning.

We also provide the t-SNE visualizations of the representations of Walker2d and HalfCheetah learned with different $\Delta t$ below. As $\Delta t$ increases, the representations from different domains become gradually distinctly clustered. Compared some previous methods based on contrastive learning, our approach can achieve great embedding qualities from a simple dynamical prediction objective without extra data generation process.

Unsupervised Representation

Supervised Representation

Unsupervised Representation

Supervised Representation

We also provide quantitative results of the learned representations. As $\Delta t$ increases, both representation qualitymetrics consistently improve, eventually becoming comparable to those of supervised representations, which is trained with access to ground-truth labels

Utilize Representation by Diffusion Modulation

Second, we inject the learned representations directly into the generative process. Specifically, we start denoising with a representation-biased mixed guassian distribution, and reformulate the diffusion target to include the learned representation.

Policy Rollouts

We visualize the trajectories generated by different policies on a specific domain in the Walker2d with its corresponding embeddings. The green and gray area indicates the target and other domain embedding distribution present in the dataset.

As shown in the videos below, the Conditional Policy often fails to generate stable rollouts, as it cannot effectively utilize the domain representation to locate the target domain. The MixedDDIM Policy shows improved stability with better locating capability but still struggles to maintain consistent performance throughout the rollout. Contrarily, our DADP Policy consistently generates stable and effective behaviours, demonstrating its superior ability to leverage the learned domain representations for robust domain-aware decision-making.

Conditional Policy Rollouts

MixedDDIM Policy Rollouts

DADP Policy Rollouts (Ours)

Experiments

We compared DADP with prior methods and different variants on multiple challenging benchmarks across locomotion and manipulation, achieving superior performance and generalizability across nearly all tasks.