Authors:
(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies
(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);
(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;
(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;
(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;
(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;
(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;
(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.
Table of Links
A. The Connection Between Prefix-tuning and Hypernetwork
B. Number of Tunable Parameters
Abstract
The workflow of pretraining and fine-tuning has emerged as a popular paradigm for solving various NLP and V&L (Vision-and-Language) downstream tasks. With the capacity of pretrained models growing rapidly, how to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment. In this paper, we design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings as input, and outputs weights for fine-tuning different small modules in a pretrained language model, such as tuning the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feedforward blocks (i.e., adapter-tuning). We define a set of embeddings (e.g., layer, block, task and visual embeddings) as the key components to calculate hyper-embeddings, which thus can support both pure language and V&L tasks. Our proposed framework adds fewer trainable parameters in multi-task learning while achieves superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework on both textual and visual modalities. [1]
1. Introduction
Pretraining and fine-tuning are now the prevalent paradigm in natural language processing, yielding state-of-the-art performances on a variety of down-steam tasks (Devlin et al., 2019). With pre-trained language models (PLMs) growing rapidly in size, it becomes increasingly infeasible to perform conventional fine-tuning on all model parameters, i.e., full fine-tuning. It is even more time & space-consuming for multi-tasking if separate replicas of model parameters are updated and saved per single task.
To mitigate these issues, there has recently been one line of research on Parameter-Efficient Language model Tuning (PELT). A few lightweight transfer learning methods have been proposed and they only update a subset of model parameters while freeze the remaining most parameters (Liu et al., 2021b). Extra trainable task-specific model parameters can also be newly introduced to PLMs, such as the widely used adapter-tuning (Houlsby et al., 2019) and prefixtuning (Li & Liang, 2021) methods. The former adaptertuning adds new parameters between transformer layers, while the later prepends tunable prefix vectors to the keys and values of multi-head attention at each layer. Although the number of parameters in the introduced adapter or prefix is much fewer than the original PLM, training these new parameters still requires a lot of resources due to the complex structure of PLMs.
Apart from traditional NLP tasks, fine-tuning language models pretrained on pure text corpora to perform various V&L tasks, has merged as a upward trend. Previous methods (e.g., VL-T5 from Cho et al. (2021)) often concatenate visual patch tokens and textual tokens as input to a pretrained language model (e.g., T5 from Raffel et al. (2020)), and then finetune the whole model on V&L tasks. This tuning towards vision-and-language has achieved a noticeable improvement to V&L tasks (Cho et al., 2021). The key advantage therein is that language models with large capacity and semantic interpretation serve as a cornerstone to benefit visual language alignment and modelling in a wide range of V&L tasks.
Similarly, training all the parameters of PLMs for handling visual input is time-consuming. It is crucial to explore how a small number of trainable parameters can equip a language model with the ability of handling visual input and V&L tasks. Existing methods typically handle the visual input via a prompt-tuning form, and prepend visual patch tokens (i.e., visual prefix of Frozen in Tsimpoukelli et al. (2021)) to the textual sequence. To reduce the trainable parameters, VL-adapter (Sung et al., 2021) adopts the adapter-tuning technique from NLP to the frozen model VL-T5, which can match the performance of full fine-tuning.
Inspired by the recent progress of parameter-efficient tuning, we are motivated to unify a transfer learning framework that supports both language and V&L models in tackling with multitasks. We use a shared hypernetwork (Mahabadi et al., 2021) that is able to take multi-task and multi-modal information as input, and generate weights for tuning different task-specific modules of PLMs in transfer learning. As shown in Figure 1, when finetuning on multitasks, only the shared hypernetwork and its input embedding (namely, hyper-embedding) consisting of layer, block, task and visual embeddings, along with layer normalization, are trained. Such unified parameter-efficient tuning reduces a great number of trainable parameters.
We experiment with two task-specific modules that use the weights output by our hypernetwork. They are respectively multi-head attention modules (Li & Liang, 2021) and task-specific adapter (Houlsby et al., 2019). Different from previous methods using visual input in a prompt-tuning manner, we present a novel perspective of adopting visual input to the above prefix-tuning and adapter-tuning modules. Empirical results on GLUE benchmark and multiple V&L tasks confirm the effectiveness of our unified framework.
In summary, we make the following contributions:
• We propose an unified parameter-efficient framework for vision and language transfer learning, which supports tuning both language and V&L models on multitasks.
• We present a novel method of leveraging visual modality as input for a shared hypernetwork, which generates weights for prefix-tuning and adapter-tuning modules.
• We demonstrate that our framework scales more efficiently than prior work. Empirical results on GLUE benchmark show the effectiveness of our proposed framework in multi-task learning. Empirical results on multiple vision-and-language tasks evidence its feasibility and usefulness in multi-modal transfer learning.
• We also perform extensive experiments on few-shot domain transfer in pure language and V&L scenarios, and results reveal that the learned shared knowledge across multitasks in our framework is able to positively transfer to unseen domain tasks.
This paper is
[1] We will release our code to facilitate future work.