Abstract and 1 Introduction

2 Preliminaries

3 TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction and 3.1 Learning Base Policies in Simulation with RL

3.2 Learning Residual Policies from Online Correction

3.3 An Integrated Deployment Framework and 3.4 Implementation Details

4 Experiments

4.1 Experiment Settings

4.2 Quantitative Comparison on Four Assembly Tasks

4.3 Effectiveness in Addressing Different Sim-to-Real Gaps (Q4)

4.4 Scalability with Human Effort (Q5) and 4.5 Intriguing Properties and Emergent Behaviors (Q6)

5 Related Work

6 Conclusion and Limitations, Acknowledgments, and References

A. Simulation Training Details

B. Real-World Learning Details

C. Experiment Settings and Evaluation Details

D. Additional Experiment Results

4.2 Quantitative Comparison on Four Assembly Tasks

As shown in Fig. 4 and Table 1, TRANSIC achieves the best performance on average and on all four tasks with significant margins. We now compare them in detail and discuss the main findings.

TRANSIC is effective for sim-to-real transfer (Q1). It successfully achieves a high average success rate of 81% across four tasks on real robots. For the task Stabilize, it can achieve a 100% success rate while the best baseline method (BC Fine-Tune) only succeeds half the time. For challenging tasks that require precise and contact-rich manipulation, such as Insert and Screw, TRANSIC results in promising outcomes (45% and 85% success rates, respectively) while the Direct Transfer baseline never succeeds.

What are the reasons for successful transfer? We observe that adding real-world human correction data does not guarantee improvement. For example, among traditional sim-to-real methods, the best baseline BC Fine-Tune outperforms DR. & Data Aug. by 7% but IQL Fine-Tune leads to worse performance. In contrast, TRANSIC effectively uses human correction data, which boosts average performance by 124%. Presumably, BC Fine-Tune’s marginal improvement is due to the domain difference between simulation and reality. This significant distinction cannot be easily bridged through naïve fine-tuning. Overall, TRANSIC not only achieves the best transfer performance, but also improves simulation policies the most among various sim-to-real approaches.

TRANSIC can better incorporate human correction into the original learned policy (Q2). Due to the differences between human and robot behaviors [68], using real-world data to directly finetune a policy that has been largely trained on machine-generated trajectories can lead to undesired performance. Specifically, TRANSIC outperforms interactive IL methods including HG-Dagger [66] and IWR [67] by 75% on average. While both of them weigh the intervention data higher during training (Sec. 2.2), we find that they tend to erase the original policy and lead to catastrophic forgetting. Therefore, in state space where no human intervention exists, they behave suboptimally and hence suffer from out-of-distribution issues due to compounding error. In contrast, by incorporating human correction with a separate residual policy and integrating both base and residual policies through gating, TRANSIC combines the best properties of both policies during deployment. It relies on the simulation policy for robust execution most of the time; when the base policy is likely to fail, it automatically applies the residual policy to prevent failures and correct mistakes.

TRANSIC requires significantly less real-world data (Q3). It only requires dozens of real-robot trajectories to achieve superior performance. However, methods such as BC-RNN and IQL trained on such a limited number of data suffer from overfitting and model collapse. TRANSIC achieves 3.6× better performance than them. In fact, as task complexity increases, we may need exponentially more real-world data to train these models, which solely rely on realworld demonstration data, according to the previous literature [68]. This result highlights the importance of training in simulation first, and then leveraging sim-to-real transfer for robot learning practitioners.

Summary we show that in simto-real transfer, a good base policy learned from the simulation can be combined with limited real-world data to achieve success (Q3). However, effectively utilizing human correction data to address the sim-to-real gap is challenging (Q1), especially when we want to prevent catastrophic forgetting of the base policy (Q2).

4.3 Effectiveness in Addressing Different Sim-to-Real Gaps (Q4)

While TRANSIC is a holistic approach to address multiple sim-to-real gaps simultaneously, we shed light on its ability to close each individual gap. To do so, we create five different simulation-reality pairs. For each of them, we intentionally create large gaps between the simulation and the real world. These gaps are applied to the real-world setting and they include perception error, underactuated controller, embodiment mismatch, dynamics difference, and object asset mismatch. Note that these are artificial settings for a controlled study. See the Appendix Sec. C.2 for detailed setups.

As shown in Fig. 5, TRANSIC achieves an average success rate of 77% across five different simulation-reality pairs with deliberately exacerbated sim-to-real gaps. This indicates its remarkable ability to close these individual gaps. In contrast, the best baseline method, IWR, only achieves an average success rate of 18%. We attribute this effectiveness in addressing different sim-to-real gaps to the residual policy design. Zeng et al. [83] echos our finding that residual learning is an effective tool to compensate for domain factors that cannot be explicitly modeled. Furthermore, training with data specifically collected from a particular setting generally increases TRANSIC’s performance. However, this is not the case for IWR, where fine-tuning on new data can even lead to worse performance. These results show that TRANSIC is better not only in addressing multiple sim-to-real gaps as a whole, but also in handling individual types of gaps of very different nature.

Authors:

(1) Yunfan Jiang, Department of Computer Science;

(2) Chen Wang, Department of Computer Science;

(3) Ruohan Zhang, Department of Computer Science and Institute for Human-Centered AI (HAI);

(4) Jiajun Wu, Department of Computer Science and Institute for Human-Centered AI (HAI);

(5) Li Fei-Fei, Department of Computer Science and Institute for Human-Centered AI (HAI).


This paper is available on arxiv under CC BY 4.0 DEED license.