Authors:

(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies

(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);

(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;

(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;

(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;

(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;

(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;

(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.

Abstract and 1. Introduction

  1. Related Work

  2. Preliminaries

  3. Proposed Method

  4. Experimental Setup

  5. Results and Analysis

  6. Discussion and Conclusion, and References

A. The Connection Between Prefix-tuning and Hypernetwork

B. Number of Tunable Parameters

C. Input-output formats

Abstract

The workflow of pretraining and fine-tuning has emerged as a popular paradigm for solving various NLP and V&L (Vision-and-Language) downstream tasks. With the capacity of pretrained models growing rapidly, how to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment. In this paper, we design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings as input, and outputs weights for fine-tuning different small modules in a pretrained language model, such as tuning the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feedforward blocks (i.e., adapter-tuning). We define a set of embeddings (e.g., layer, block, task and visual embeddings) as the key components to calculate hyper-embeddings, which thus can support both pure language and V&L tasks. Our proposed framework adds fewer trainable parameters in multi-task learning while achieves superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework on both textual and visual modalities. [1]

1. Introduction

Pretraining and fine-tuning are now the prevalent paradigm in natural language processing, yielding state-of-the-art performances on a variety of down-steam tasks (Devlin et al., 2019). With pre-trained language models (PLMs) growing rapidly in size, it becomes increasingly infeasible to perform conventional fine-tuning on all model parameters, i.e., full fine-tuning. It is even more time & space-consuming for multi-tasking if separate replicas of model parameters are updated and saved per single task.

To mitigate these issues, there has recently been one line of research on Parameter-Efficient Language model Tuning (PELT). A few lightweight transfer learning methods have been proposed and they only update a subset of model parameters while freeze the remaining most parameters (Liu et al., 2021b). Extra trainable task-specific model parameters can also be newly introduced to PLMs, such as the widely used adapter-tuning (Houlsby et al., 2019) and prefixtuning (Li & Liang, 2021) methods. The former adaptertuning adds new parameters between transformer layers, while the later prepends tunable prefix vectors to the keys and values of multi-head attention at each layer. Although the number of parameters in the introduced adapter or prefix is much fewer than the original PLM, training these new parameters still requires a lot of resources due to the complex structure of PLMs.

Apart from traditional NLP tasks, fine-tuning language models pretrained on pure text corpora to perform various V&L tasks, has merged as a upward trend. Previous methods (e.g., VL-T5 from Cho et al. (2021)) often concatenate visual patch tokens and textual tokens as input to a pretrained language model (e.g., T5 from Raffel et al. (2020)), and then finetune the whole model on V&L tasks. This tuning towards vision-and-language has achieved a noticeable improvement to V&L tasks (Cho et al., 2021). The key advantage therein is that language models with large capacity and semantic interpretation serve as a cornerstone to benefit visual language alignment and modelling in a wide range of V&L tasks.

Similarly, training all the parameters of PLMs for handling visual input is time-consuming. It is crucial to explore how a small number of trainable parameters can equip a language model with the ability of handling visual input and V&L tasks. Existing methods typically handle the visual input via a prompt-tuning form, and prepend visual patch tokens (i.e., visual prefix of Frozen in Tsimpoukelli et al. (2021)) to the textual sequence. To reduce the trainable parameters, VL-adapter (Sung et al., 2021) adopts the adapter-tuning technique from NLP to the frozen model VL-T5, which can match the performance of full fine-tuning.

Inspired by the recent progress of parameter-efficient tuning, we are motivated to unify a transfer learning framework that supports both language and V&L models in tackling with multitasks. We use a shared hypernetwork (Mahabadi et al., 2021) that is able to take multi-task and multi-modal information as input, and generate weights for tuning different task-specific modules of PLMs in transfer learning. As shown in Figure 1, when finetuning on multitasks, only the shared hypernetwork and its input embedding (namely, hyper-embedding) consisting of layer, block, task and visual embeddings, along with layer normalization, are trained. Such unified parameter-efficient tuning reduces a great number of trainable parameters.

We experiment with two task-specific modules that use the weights output by our hypernetwork. They are respectively multi-head attention modules (Li & Liang, 2021) and task-specific adapter (Houlsby et al., 2019). Different from previous methods using visual input in a prompt-tuning manner, we present a novel perspective of adopting visual input to the above prefix-tuning and adapter-tuning modules. Empirical results on GLUE benchmark and multiple V&L tasks confirm the effectiveness of our unified framework.

In summary, we make the following contributions:

• We propose an unified parameter-efficient framework for vision and language transfer learning, which supports tuning both language and V&L models on multitasks.

• We present a novel method of leveraging visual modality as input for a shared hypernetwork, which generates weights for prefix-tuning and adapter-tuning modules.

• We demonstrate that our framework scales more efficiently than prior work. Empirical results on GLUE benchmark show the effectiveness of our proposed framework in multi-task learning. Empirical results on multiple vision-and-language tasks evidence its feasibility and usefulness in multi-modal transfer learning.

• We also perform extensive experiments on few-shot domain transfer in pure language and V&L scenarios, and results reveal that the learned shared knowledge across multitasks in our framework is able to positively transfer to unseen domain tasks.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] We will release our code to facilitate future work.