Table of Links
- Abstract and Introduction
- Related Work
- Experiments
- Discussion
- Limitations and Future Work
- Conclusion, Acknowledgments and Disclosure of Funding, and References
6 Conclusion
Fine-tuning models via repositories such as the Hugging Face Model Hub has become increasingly popular thanks to increasingly capable open models. This work has shown how fine-tuning can impact toxicity rates in hard-to-predict ways, across models from different AI labs. Model creators’ efforts to reduce toxicity during the instruction-tuning process can easily and inadvertently be undone when models are further fine-tuned on non-adversarial datasets. This phenomenon can be seen in practice in popular models fine-tuned by community contributors, where models fine-tuned for issues like multilingual capabilities can see surprisingly variable toxicity rates. These results emphasize the need for model creators, community contributors, model users, and policy-makers to pay attention to the toxicity performance of fine-tuned models, even when fine-tuning does not target toxicity.
Acknowledgments and Disclosure of Funding
The authors would like to thank the following individuals for helpful discussions and feedback throughout the course of this project: Kevin McKee, Inga Campos, Seliem El-Sayed, Laura Weidinger, Ramona Comanescu, and Charvi Rastogi.
Brent Mittelstadt and Chris Russell’s contributions to this work have been supported through research funding provided by the Wellcome Trust (grant nr 223765/Z/21/Z), Sloan Foundation (grant nr G2021-16779), Department of Health and Social Care, EPSRC (grant nr EP/Y019393/1), and Luminate Group. Their funding supports the Trustworthiness Auditing for AI project and Governance of Emerging Technologies research programme at the Oxford Internet Institute, University of Oxford. During the course of this work Will Hawkins held an employed position at Google DeepMind.
References
Anthropic. (2023). Claude 2. https://www.anthropic.com/news/claude-2
Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., & Cunningham, J. P. (2024). LoRA Learns Less and Forgets Less (arXiv:2405.09673). arXiv. http://arxiv.org/abs/2405.09673
Bilenko, M. (2024, April 23). Introducing Phi-3: Redefining what’s possible with SLMs. Microsoft Azure Blog. https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/
Cecchini, D., Nazir, A., Chakravarthy, K., & Kocaman, V. (2024). Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications. In A. Ovalle, K.-W. Chang, Y. T. Cao, N. Mehrabi, J. Zhao, A. Galstyan, J. Dhamala, A. Kumar, & R. Gupta (Eds.), Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024) (pp. 109–117). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.trustnlp-1.11
Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., & Xin, R. (2023, December 4). Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. Databricks. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language (arXiv:1703.04009). arXiv. http://arxiv.org/abs/1703.04009
Dawson, N. V., & Weiss, R. (2012). Dichotomizing Continuous Variables in Statistical Analysis: A Practice to Avoid. Medical Decision Making, 32(2), 225–226. https://doi.org/10.1177/0272989X12437605
Fu, Z., Yang, H., So, A. M.-C., Lam, W., Bing, L., & Collier, N. (2022). On the Effectiveness of ParameterEfficient Fine-Tuning (arXiv:2211.15583). arXiv. https://doi.org/10.48550/arXiv.2211.15583
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models (arXiv:2009.11462). arXiv. http://arxiv.org/abs/2009.11462
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-BA117A
Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., . . . Vinyals, O. (2024). Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805). arXiv. https://doi.org/10.48550/arXiv.2312.11805
Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., . . . Kenealy, K. (2024). Gemma: Open Models Based on Gemini Research and Technology (arXiv:2403.08295). arXiv. http://arxiv.org/abs/2403.08295
He, L., Xia, M., & Henderson, P. (2024). What’s in Your ‘Safe’ Data?: Identifying Benign Data that Breaks Safety (arXiv:2404.01099). arXiv. http://arxiv.org/abs/2404.01099
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. https://doi.org/10.48550/arXiv.2106.09685
HuggingFace. (2024, May 18). The Model Hub. https://huggingface.co/docs/hub/en/models-the-hub
Irwin, J. R., & McClelland, G. H. (2003). Negative Consequences of Dichotomizing Continuous Predictor Variables. Journal of Marketing Research, 40(3), 366–371. https://doi.org/10.1509/jmkr.40.3.366.19237
Kumar, D., Kumar, A., Agarwal, S., & Harshangi, P. (2024). Increased LLM Vulnerabilities from Fine-tuning and Quantization (arXiv:2404.04392). arXiv. http://arxiv.org/abs/2404.04392
Lermen, S., Rogers-Smith, C., & Ladish, J. (2023). LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B (arXiv:2310.20624). arXiv. https://doi.org/10.48550/arXiv.2310.20624
Liu, H., Liu, Z., Tang, R., Yuan, J., Zhong, S., Chuang, Y.-N., Li, L., Chen, R., & Hu, X. (2024). LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario (arXiv:2403.00108). arXiv. http://arxiv.org/abs/2403.00108
Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., & Zhang, Y. (2024). An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning (arXiv:2308.08747). arXiv. http://arxiv.org/abs/2308.08747
Meta. (2024a). Introducing Meta Llama 3: The most capable openly available LLM to date. Meta AI. https://ai.meta.com/blog/meta-llama-3/
Meta. (2024b). Our responsible approach to Meta AI and Meta Llama 3. Meta AI. https://ai.meta.com/blog/metallama-3-meta-ai-responsibility/
Nadeau, D., Kroutikov, M., McNeil, K., & Baribeau, S. (2024). Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations (arXiv:2404.09785). arXiv. http://arxiv.org/abs/2404.09785
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2024). GPT-4 Technical Report (arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback (arXiv:2203.02155). arXiv. https://doi.org/10.48550/arXiv.2203.02155
Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle. (2024). Retrieved 27 September 2024, from https://arxiv.org/html/2407.13833v1
Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (arXiv:2310.03693). arXiv. http://arxiv.org/abs/2310.03693
Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Statistics in Medicine, 25(1), 127–141. https://doi.org/10.1002/sim.2331
Sun, A. Y., Zemour, E., Saxena, A., Vaidyanathan, U., Lin, E., Lau, C., & Mugunthan, V. (2024). Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information? (arXiv:2307.16382). arXiv. http://arxiv.org/abs/2307.16382
Taraghi, M., Dorcelus, G., Foundjem, A., Tambon, F., & Khomh, F. (2024). Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends (arXiv:2401.13177). arXiv. http://arxiv.org/abs/2401.13177
Tian, K., Mitchell, E., Yao, H., Manning, C. D., & Finn, C. (2023). Fine-tuning Language Models for Factuality (arXiv:2311.08401). arXiv. http://arxiv.org/abs/2311.08401
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., . . . Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). arXiv. http://arxiv.org/abs/2307.09288
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need (arXiv:1706.03762). arXiv. http://arxiv.org/abs/1706.03762
Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2020, December 31). Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. arXiv.Org. https://arxiv.org/abs/2012.15761v2
Wan, A., Wallace, E., Shen, S., & Klein, D. (2023). Poisoning Language Models During Instruction Tuning. Proceedings of the 40th International Conference on Machine Learning, 35413–35425. https://proceedings.mlr.press/v202/wan23b.html
Wang, S., Wang, P., Zhou, T., Dong, Y., Tan, Z., & Li, J. (2024). CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models (arXiv:2407.02408). arXiv. https://doi.org/10.48550/arXiv.2407.02408
Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy, I., & Hajishirzi, H. (2023). How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (arXiv:2306.04751). arXiv. https://doi.org/10.48550/arXiv.2306.04751
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., . . . Gabriel, I. (2021). Ethical and social risks of harm from Language Models (arXiv:2112.04359). arXiv. http://arxiv.org/abs/2112.04359
Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (arXiv:2310.02949). arXiv. https://doi.org/10.48550/arXiv.2310.02949
Zeng, Y., & Lee, K. (2024). The Expressive Power of Low-Rank Adaptation (arXiv:2310.17513). arXiv. http://arxiv.org/abs/2310.17513
Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T., & Kang, D. (2024). Removing RLHF Protections in GPT-4 via Fine-Tuning (arXiv:2311.05553). arXiv. http://arxiv.org/abs/2311.05553
Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2024). Instruction Tuning for Large Language Models: A Survey (arXiv:2308.10792). arXiv. http://arxiv.org/abs/2308.10792
Zhao, J., Deng, Z., Madras, D., Zou, J., & Ren, M. (2024). Learning and Forgetting Unsafe Examples in Large Language Models (arXiv:2312.12736). arXiv. http://arxiv.org/abs/2312.12736
Authors:
(1) Will Hawkins, Oxford Internet Institute University of Oxford;
(2) Brent Mittelstadt, Oxford Internet Institute University of Oxford;
(3) Chris Russell, Oxford Internet Institute University of Oxford.
This paper is available on arxiv under CC 4.0 license.