Table of Links
-
Experiments
2 Preliminaries
2.1 PELT Methods without Additional Parameters
PLMs can be used as feature extractors where only the top layers or prediction head are fine-tuned without additional parameters (Lee et al., 2019). However, such fine-tuning approaches generally lead to degenerate model performance that is much worse than fine-tuning all parameters (Lee et al., 2019; Pfeiffer et al., 2021). A recent method BitFit (Ben Zaken et al., 2021) only tunes the bias terms of the PLM and is shown to achieve performance comparable to fine-tuning on certain tasks when training data is limited. Therefore, we select BitFit as the representative of this category for analysis.
2.2 PELT Methods with Additional Parameters
Alternatively, one may fix the entire PLM and introduce a small number of new trainable parameters. Notable examples in this category include adapter (Houlsby et al., 2019) and its extensions (Pfeiffer et al., 2021; Karimi Mahabadi et al., 2021b), prefix-tuning (Li and Liang, 2021) and its extensions (Lester et al., 2021), and additive methods (Guo et al., 2021; Hu et al., 2021).
Next, we will briefly describe these methods to facilitate the introduction of our proposed framework. An illustration is shown in Fig. 1 for better understanding.
Adapter. Adapter (Houlsby et al., 2019) adds a trainable bottleneck layer after the feedforward network in each Transformer layer of the PLM. A bottleneck layer consists of a down+up projection pair that shrinks and recovers the size of token hidden states. Mathematically, if we denote the output of the feedforward network after residual connection and layer normalization as hF N with hidden size Dhidden and bottleneck size Dmid, then the output of a bottleneck layer hA is:
dapter has shown to be on par with fine-tuning and sometimes exhibits better effectiveness in the low-resource setting (He et al., 2021). Later studies extend adapter to multi-lingual (Pfeiffer et al., 2020b) and multi-task (Karimi Mahabadi et al., 2021b) settings, or further reduce its trainable parameters (Karimi Mahabadi et al., 2021a), which can be easily incorporated into UNIPELT as a replacement of the vanilla adapter.
Prefix-tuning is originally evaluated on natural language generation and we adapt it to understanding tasks. A follow-up method named prompttuning (Lester et al., 2021) further reduces taskspecific parameters by limiting the prefix to the first layer but only performs competitively with very large model sizes (billions of total parameters), and is thus not considered in our study. Note that prefix-tuning (or prompt-tuning) is different from prompt-based fine-tuning methods (Schick and Schütze, 2021; Gao et al., 2021) (see App. A for specific differences).
3 Unifying PELT Methods
3.1 Task Formulation
3.2 Proposed Method
Motivation & Intuition. During the analysis of individual PELT methods, we observe that different PELT methods exhibit diverse characteristics and perform rather differently on the same task. For example, prefix-tuning generally performs well on natural language inference tasks regardless of the size of training data. Also, as can be seen in Fig. 1 and Sec. 2, different PELT methods often involve different parts of the PLM architecture (e.g., before multi-head attention for prefix-tuning and after feedforward layer for adapter), making it feasible to combine multiple PELT methods without (directly) interfering with each other.
In light of the two observations above, we propose a unified PELT framework, UNIPELT, which takes a hybrid approach by incorporating multiple PELT methods as submodules. At a high level, UNIPELT improves over single PELT methods due to two factors. First, UNIPELT learns to activate (upweight) the submodules that best suit the current task or specific data sample and deactivate (downweight) the rest. Second, we find that UNIPELT generally performs better than taking the best performance of all its submodules used individually on each task, suggesting that there could be some compounding effects that lead to better model effectiveness when multiple PELT methods (that modify different parts of the PLM) are used.
Next, we will introduce how different PELT methods can be incorporated into UNIPELT via gating mechanism.
Despite the seeming simplicity of UNIPELT, we note that it is nontrivial for a unified approach to work well under different scenarios. Naively combining different PELT methods as a hybrid approach could lead to mixed or worse performance than using individual methods, as observed in both our experiments and prior studies (Hu et al., 2021).
Authors:
(1) Yuning Mao, University of Illinois Urbana-Champaign and the work was done during internship at Meta AI ([email protected]);
(2) Lambert Mathias, Meta AI ([email protected]);
(3) Rui Hou, Meta AI ([email protected]);
(4) Amjad Almahairi, Meta AI ([email protected]);
(5) Hao Ma, Meta AI ([email protected]);
(6) Jiawei Han, University of Illinois Urbana-Champaign ([email protected]);
(7) Wen-tau Yih, Meta AI ([email protected]);
(8) Madian Khabsa, Meta AI ([email protected]).
This paper is
[3] Prefix-tuning cannot be fully eliminated as adapter or LoRA due to the softmax operation in multi-head attention.