Reinforcement learning from human feedback

machine learning technique

Encyclopedia from Wikipedia, the free encyclopedia

In machine learning, reinforcement learning from human feedback (RLHF) or reinforcement learning from human preferences is a technique that trains a "reward model" directly from human feedback and uses it as a reward function to optimize an agent's policy using reinforcement learning (RL).[1][2] RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.[3][4][5]

The human feedback is collected by asking humans to rank instances of the agent's behavior.[6][7][8] These rankings can then be used to score outputs, for example, using the Elo rating system.[2]

RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.[9][10] Regular reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, generate more verbose responses, and reject questions that are either inappropriate or outside the knowledge space of the model.[11] Some examples of RLHF-trained language models are OpenAI's ChatGPT and its predecessor InstructGPT,[7][12][13][14] as well as DeepMind's Sparrow.[15][16][17]

RLHF has also been applied to other areas such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.[18][19] The agents achieved strong performance in many of the environments tested, often surpassing human performance.[20]

Challenges and limitations

One major challenge of RLHF is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings into light the challenges of alignment and robustness.[21]

See also


  1. ^ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". doi:10.48550/arXiv.1909.08593. {{cite journal}}: Cite journal requires |journal= (help)
  2. ^ a b Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". Retrieved 4 March 2023.
  3. ^ MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2285–2294.
  4. ^ Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). doi:10.1609/aaai.v32i1.11485.
  5. ^ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". doi:10.48550/arXiv.2204.05862. {{cite journal}}: Cite journal requires |journal= (help)
  6. ^ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). "Training language models to follow instructions with human feedback". {{cite journal}}: Cite journal requires |journal= (help)
  7. ^ a b Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
  8. ^ Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
  9. ^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". doi:10.48550/arXiv.2203.02155. {{cite journal}}: Cite journal requires |journal= (help)
  10. ^ Nisan, Stiennon; Long, Ouyang; Jeffrey, Wu; Daniel, Ziegler; Ryan, Lowe; Chelsea, Voss; Alec, Radford; Dario, Amodei; F., Christiano, Paul (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
  11. ^ Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
  12. ^ Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
  13. ^ Heikkilä, Melissa. "How OpenAI is trying to make ChatGPT safer and less biased". MIT Technology Review. Retrieved 4 March 2023.
  14. ^ Douglas Heaven, Will. "ChatGPT is OpenAI's latest fix for GPT-3. It's slick but still spews nonsense". MIT Technology Review. Retrieved 4 March 2023.
  15. ^ Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". doi:10.48550/arXiv.2209.14375. {{cite journal}}: Cite journal requires |journal= (help)
  16. ^ "Why DeepMind isn't deploying its new AI chatbot — and what it means for responsible AI". VentureBeat. 23 September 2022. Retrieved 4 March 2023.
  17. ^ "Building safer dialogue agents". Retrieved 4 March 2023.
  18. ^ "Learning from human preferences". Retrieved 4 March 2023.
  19. ^ "Learning through human feedback". Retrieved 4 March 2023.
  20. ^ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30. Retrieved 4 March 2023.
  21. ^ Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.
Original content from Wikipedia, shared with licence Creative Commons By-Sa - Reinforcement learning from human feedback