Why PEFT Beats FFT in Active Learning: Forgetting Dynamics and Stable Representations

By: Hackernoon

2025/08/26 05:22

MORE$0.10012-0.97%

AL$0.0767-3.40%

LAYER$0.5193-8.87%

WHY$0.0000000291+1.96%

Table of Links

Abstract and 1. Introduction

Related Work
Preliminaries
Experiments
Analysis
Conclusion, Limitations, and References

A. Reproducibility

5 Analysis

In Section 4, we have demonstrated that PEFT exhibits larger gains than FFT when combined with AL in low-resource settings, which is also accompanied by superior performance with passive leaning. To better understand why PEFT displays superior behavior with limited data, we now examine two specific properties of adapters and FFT models. First, we analyze the influence of TAPT on the forgetting dynamics during training. We continue with example-level representation analysis, where we investigate the representation similarity of PEFT and FFT to their respective base models.

5.1 Forgetting dynamics

We employ forgetting dynamics to compare PEFT and FFT’s learning stability and their impact on AL data selection. The underlying hypothesis is that having fewer forgetting events in adapters would indicate a more stable and effective learning process. In utilizing forgetting dynamics, we draw upon the study by Toneva et al. (2019), focusing on the occurrence of forgetting events — cases where a specific training example transitions from correct to incorrect classification over the course of multiple learning epochs. More specifically, we divide the instances into three categories: (1) unforgettable instances, i.e., the ones that have never experienced a forgetting event during training, (2) instances that have encountered one or two forgetting events, labeled as moderately forgettable, and (3) instances subjected to three or more forgetting events, referred to as highly forgettable instances. As pointed out in the original study, moderately forgettable, ambiguous instances are more valuable for the learning model than unforgettable, easy instances. However, it is worth noting that AL is often hindered by too hard or impossible-to-learn examples (Karamcheti et al., 2021), which roughly correspond to the highly forgettable examples.

\ Figure 4 shows the distribution of instances across the three categories of forgetting events for SUBJ and TREC datasets. We focus on these two datasets as examples of a simple binary classification task and a more complex multi-class classification task, respectively. Specifically, we compare RND with MC, which achieves consistent performance improvements across all datasets. Our findings suggest that FFT tends to select a higher number of unforgettable instances and fewer moderately forgettable instances when compared to adapters. Interestingly, the adapters that perform best — Prefix-tuning and UniPELT — appear to favor moderately forgettable instances. However, when TAPT is applied, the discrepancies in forgetting profiles between FFT and the top two adapters, Prefix-tuning and UniPELT, seem to diminish. In contrast, TAPT amplifies the differences between FFT and the other two adapters, LoRA and Adapter, which typically show smaller improvements than Prefix-tuning and UniPELT. Given their superior AL performance, we hypothesize that the forgetting profiles of Prefix-tuning and UniPELT are more favorable compared to other adapters. Moreover, FFT with TAPT approaches the performance of the superior adapters and simultaneously develops a forgetting profile similar to theirs.

5.2 Representation analysis

To bolster our findings, we explore the representations of adapters and FFT models. As suggested in previous research (He et al., 2021; Li and Liang, 2021; Mao et al., 2022), adapters often display greater stability in terms of loss, especially in scenarios with limited resources. Our aim is to examine the stability of their representations and their relationship with overall AL performance.

\ We draw inspiration from research by Stephenson et al. (2021) and Baldock et al. (2021), which suggests that different layers of networks specialize in different features — earlier layers tend to acquire more generalized knowledge, while the deeper layers are more focused on task-specific information. This leads us to a layerwise examination of similarity. To analyze the effect of PEFT and FFT on AL selection with respect to their layerwise similarity to the base model, we utilize centered kernel alignment (CKA) as a similarity measure between two sets of representations (Kornblith et al., 2019). It has been shown that PEFT methods result in representations closer to the base model at the token level (He et al., 2021). We extend the analysis to example-level representation to explore the behavior of models with AL. We opt for CKA as it is designed to be invariant to invertible linear transformation and still can measure meaningful similarities between representations of higher dimensions than the number of data points. This stands in contrast to other metrics, which frequently falter when dealing with high-dimensional representations.

\ Figure 4: Forgetting dynamics for random sampling (passive learning) and AL with MC without and with TAPT on SUBJ and TREC. The x-axis shows the number of instances in each of the forgetting categories: the “never” category representing unforgettable instances, moderately forgettable instances, and highly forgettable instances.

\ For a more direct comparison between PEFT and FFT, we analyze the differences between their respective similarities to their base models. Specifically, we compute the difference CKA(adapter, base)−CKA(FFT, base) for a specific adapter or FFT and their base models. We hypothesize that superior PEFT performance with AL compared to FFT will be accompanied by a more similar early layer representation to the base model in PEFT. Figure 5 visualizes the layerwise difference in similarity between the base model and the adapter model and between the base model and the FFT model. We find that PEFT representations are more similar to the base model in the early and middle layers when compared to FFT. This holds for all AL methods, with differences more pronounced than in passive learning. Specifically, up to the eighth layer, representations are much more similar in adapters than in FFT models. In the final four layers, the difference in CKA scores between the adapter and FFT model is close to zero. Interestingly, the penultimate layer is more similar in the FFT model with respect to the base model.

\ When fine-tuning on a downstream task, we believe that the increased stability of PEFT in earlier layers, relative to FFT, is instrumental in retaining the foundational knowledge from the PLM’s pretraining phase. Conversely, PEFT exhibits more substantial transformations in the latter, more task-specific layers. This ensures the preservation of essential pre-trained knowledge while allowing for task-relevant flexibility. We speculate that this strategic balance in PEFT influences its propensity to select moderately forgettable instances when combined with AL, contributing to its enhanced performance over FFT. These instances are neither too trivial to provide no learning value, nor are they too complex to risk misinterpretation, thereby enhancing the effectiveness of learning.

6 Conclusion

Our study has shed light on the advantages of parameter-efficient fine-tuning (PEFT) in low-resource settings, confirming its superiority over full fine-tuning (FFT) methods. Importantly, we have demonstrated that the integration of PEFT with active learning (AL) can offer substantial performance gains compared to passive learning, even in settings where labeled data is scarce. Furthermore, we highlighted the potential of task-adaptive pre-training (TAPT) to improve model performance further when used in conjunction with both PEFT and AL. We found that AL methods, in combination with PEFT, tend to select fewer unforgettable instances and more moderately forgettable examples. We further found that PEFT maintains the integrity of early and middle layer representations similar to the base model. We conjecture that this property mitigates forgetting during downstream task fine-tuning. These insights inform us of a possible underpinning mechanism that contributes to PEFT’s superior performance and stability in low-resource settings. Overall, our work highlights the potential of PEFT and AL and establishes a foundation for developing increasingly efficient and cost-effective approaches for training models in low-resource settings.

Limitations

While our study advances the understanding of PEFT and AL’s interaction in low-resource settings and uncovers intriguing insights about the forgetting dynamics during fine-tuning, it has a number of limitations.

\ To begin with, we have focused on text classification tasks, which are but one aspect of the wide range of potential applications for PLMs. Different tasks such as question answering, translation, or summarization, might exhibit different behaviors under the same conditions. Consequently, the observed advantages of PEFT in the context of AL might not necessarily translate to other NLP tasks.

\ Next, our results are limited to the specific PLMs, AL strategies, and PEFT methods we have examined in this study. While we have attempted to be comprehensive in our experiments, the outcomes might vary with different models, strategies, or methods. For example, the effectiveness of AL combined with PEFT might differ if other AL strategies are employed. Similarly, different types of adapter architectures could potentially lead to different results.

\ Although we found that PEFT methods produce instance-level representations of early and middle layers more similar to the base PLM than FFT, a comprehensive understanding of how and why this similarity leads to increased stability and performance in low-resource settings is still lacking. Our hypothesis about the role of early and middle layer stability in mitigating the issue of forgetting the knowledge obtained during pre-training needs further substantiation.

\ Finally, it is important to acknowledge the complexity and multifaceted nature of forgetting dynamics. While our investigation provides valuable insights about the interaction of forgetting with PEFT and TAPT in AL scenarios, a deeper understanding of the mechanisms of forgetting in the context of large PLMs is needed. Particularly, it would be interesting to investigate whether the balance between unforgettable and moderately forgettable instances selected by the AL methods changes as the size of the model or the amount of available data changes.

\ Future work should aim to address these limitations and further explore the mechanisms behind the promising results obtained with the combination of PEFT and AL. This will contribute to a more comprehensive understanding of the interaction between AL and PLMs, and help refine strategies for efficient fine-tuning in low-resource settings.

References

Alan Ansell, Edoardo Maria Ponti, Jonas Pfeiffer, Sebastian Ruder, Goran Glavaš, Ivan Vulic, and Anna ´ Korhonen. 2021. MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781, Punta Cana, Dominican Republic. Association for Computational Linguistics.

\ Robert Baldock, Hartmut Maennel, and Behnam Neyshabur. 2021. Deep learning through the lens of example difficulty. In Advances in Neural Information Processing Systems, volume 34, pages 10876– 10889. Curran Associates, Inc.

\ David A Cohn, Zoubin Ghahramani, and Michael I Jordan. 1996. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145.

\ Sanjoy Dasgupta. 2011. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781. Algorithmic Learning Theory (ALT 2009).

\ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

\ Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.

\ Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, and Noam Slonim. 2020. Active Learning for BERT: An Empirical Study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962, Online. Association for Computational Linguistics.

\ Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA. PMLR.

\ Daniel Gissin and Shai Shalev-Shwartz. 2019. Discriminative active learning. arXiv preprint arXiv:1907.06347.

\ Daniel Grießhaber, Johannes Maucher, and Ngoc Thang Vu. 2020. Fine-tuning BERT for low-resource natural language understanding via active learning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1158–1171, Barcelona, Spain (Online). International Committee on Computational Linguistics.

\ Suchin Gururangan, Ana Marasovic, Swabha ´ Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.

\ Junxian He, Chunting Zhou, Xuezhe Ma, Taylor BergKirkpatrick, and Graham Neubig. 2022. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.

\ Ruidan He, Linlin Liu, Hai Ye, Qingyu Tan, Bosheng Ding, Liying Cheng, Jiawei Low, Lidong Bing, and Luo Si. 2021. On the effectiveness of adapter-based tuning for pretrained language model adaptation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2208– 2222, Online. Association for Computational Linguistics.

\ Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.

\ Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.

\ Josip Jukic and Jan Šnajder. 2023. ´ Smooth sailing: Improving active learning for pre-trained language models with representation smoothness analysis. In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 11–24, Gothenburg, Sweden. Association for Computational Linguistics.

\ Siddharth Karamcheti, Ranjay Krishna, Li Fei-Fei, and Christopher Manning. 2021. Mind your outliers! investigating the negative impact of outliers on active learning for visual question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7265–7281, Online. Association for Computational Linguistics.

\ Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. 2021. Parameterefficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 565–576, Online. Association for Computational Linguistics.

\ Seungwon Kim, Alex Shum, Nathan Susanj, and Jonathan Hilgart. 2021. Revisiting pretraining with adapters. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 90–99, Online. Association for Computational Linguistics.

\ Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR.

\ Jaeseong Lee, Seung-won Hwang, and Taesup Kim. 2022. FAD-X: Fusing adapters for cross-lingual transfer to low-resource languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 57–64, Online only. Association for Computational Linguistics.

\ David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In SIGIR’94, pages 3–12. Springer.

\ Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582– 4597, Online. Association for Computational Linguistics.

\ Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.

\ Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, and Madian Khabsa. 2022. UniPELT: A unified framework for parameter-efficient language model tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6253–6264, Dublin, Ireland. Association for Computational Linguistics.

\ Katerina Margatina, Loic Barrault, and Nikolaos Aletras. 2022. On the importance of effectively adapting pretrained language models for active learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 825–836, Dublin, Ireland. Association for Computational Linguistics.

\ Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

\ Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. 2021. On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.

\ Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271–278, Barcelona, Spain.

\ Marinela Parovic, Goran Glavaš, Ivan Vuli ´ c, and Anna ´ Korhonen. 2022. BAD-X: Bilingual adapters improve zero-shot cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1791–1799, Seattle, United States. Association for Computational Linguistics.

\ Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulic, Sebastian Ruder, Kyunghyun ´ Cho, and Iryna Gurevych. 2020. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics.

\ Jonas Pfeiffer, Sebastian Ruder, Ivan Vulic, and ´ Edoardo Maria Ponti. 2023. Modular deep learning. arXiv preprint arXiv:2302.11529.

\ Christopher Schröder, Andreas Niekler, and Martin Potthast. 2022. Revisiting uncertainty-based query strategies for active learning with transformers. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2194–2203, Dublin, Ireland. Association for Computational Linguistics.

\ Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.

\ Burr Settles. 2009. Active learning literature survey. Computer sciences technical report, University of Wisconsin-Madison.

\ Artem Shelmanov, Dmitri Puzyrev, Lyubov Kupriyanova, Denis Belyakov, Daniil Larionov, Nikita Khromov, Olga Kozlova, Ekaterina Artemova, Dmitry V. Dylov, and Alexander Panchenko. 2021. Active learning for sequence tagging with deep pre-trained models and Bayesian uncertainty estimates. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1698–1712, Online. Association for Computational Linguistics.

\ Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 455–465, Sofia, Bulgaria. Association for Computational Linguistics.

\ Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.

\ Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations.

\ Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2019. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations.

\ Yue Yu, Lingkai Kong, Jieyu Zhang, Rongzhi Zhang, and Chao Zhang. 2022. AcTune: Uncertainty-based active self-training for active fine-tuning of pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1422–1436, Seattle, United States. Association for Computational Linguistics.

\ Michelle Yuan, Hsuan-Tien Lin, and Jordan BoydGraber. 2020. Cold-start active learning through selfsupervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7935–7948, Online. Association for Computational Linguistics.

\ Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi. 2021. Revisiting few-sample BERT fine-tuning. In International Conference on Learning Representations.

\ Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

\ Table 3: Dataset sizes by splits. Although we do not use a validation set (VAL) in our experiments, we report its size for completeness. For the AGN dataset, we performed uniform subsampling to ensure the computational feasibility of the experiments.

A Reproducibility

A.1 Dataset statistics

The sizes of the datasets per split are provided in Table 3. Predominantly, the datasets encompass texts in English.

A.2 Adapters

We use the implementation of adapters from AdapterHub (Pfeiffer et al., 2020).

\ Adapter We set reduction factor to 16 and use swish function as nonlinearity.

\ LoRA We include LoRA to the self-attention weights, intermediate, and output MLP weights of a model. We set the rank of the LoRA layer and the scaling factor α to 8.

\ Prefix-tuning We use tanh activation for Prefixtuning, with prefix length set to 30 and bottleneck size of 512.

\ UniPELT We use Adapter, LoRA, and Prefixtuning as components of UniPELT with the same hyperparameters as described for individual components. The only exception is that we set the prefix length for Prefix-tuning to 10 instead of 30.

A.3 AL methods

MC In experiments, we use ten inference cycles to approximate the entropy of the output via Monte-Carlo dropout sampling.

\ CS We use the [CLS] token representation from the Transformer’s penultimate layer. We follow the greedy method described in the original work (Sener and Savarese, 2018)

Table 4: Experiment duration in minutes for all models across datasets. We report the average runtime over five different runs and five different sampling methods (five AL methods and random sampling).

A.4 Preprocessing

We undertake a few pre-processing steps: convert all tokens to lowercase, eliminate nonalphanumeric tokens, and limit the token sequence to a maximum length of 200.

A.5 Hyperparameters

A.6 Computing infrastructure

We conducted our experiments on 4× AMD Ryzen Threadripper 3970X 32-Core Processors and 4× NVIDIA GeForce RTX 3090 GPUs with 24GB of RAM. We used PyTorch version 1.9.0 and CUDA 11.4.

A.7 Average runtime

We report the average runtime of experiments in Table 4.

B Additional Results

We report the results that were omitted from the main part of the paper due to space constraints. Table 5 shows AUC scores for different combinations of AL methods and adapters, complementing the relative improvement scores as AUC represents absolute scores for each configuration. In Figure 6, we display the difference in similarities of adapters and FFT compared to their base models on the remaining three datasets.

Table 5: AUC scores for AL methods with different adapters shown separately without TAPT and with TAPT. We include random sampling for comparison with AL methods. Values in bold denote the best result for a particular dataset within the same regime (with or without TAPT).

:::info Authors:

(1) Josip Jukic, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia (josip.jukic@fer.hr);

(2) Jan Šnajder, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia (jan.snajder@fer.hr).

:::

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

Share Insights