A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using unsupervised learning. LLMs emerged around 2018 and perform well at a wide variety of tasks. This has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks.
Though the term large language model has no formal definition, it generally refers to deep learning models having a parameter count on the order of billions or more. LLMs are general purpose models which excel at a wide range of tasks, as opposed to being trained for one specific task (such as sentiment analysis, named entity recognition, or mathematical reasoning). Though trained on simple tasks along the lines of predicting the next word in a sentence, neural language models with sufficient training and parameter counts are found to capture much of the syntax and semantics of human language. In addition, large language models demonstrate considerable general knowledge about the world, and are able to "memorize" a great quantity of facts during training.
Architecture and training
Large language models have most commonly used the transformer architecture, which, since 2018, has become the standard deep learning technique for sequential data (previously, recurrent architectures such as the LSTM were most common). LLMs are trained in an unsupervised manner on unannotated text. A left-to-right transformer is trained to maximize the probability assigned to the next word in the training data, given the previous context. Alternatively, an LLM may use a bidirectional transformer (as in the example of BERT), which assigns a probability distribution over words given access to both preceding and following context. In addition to the task of predicting the next word or "filling in the blanks", LLMs may be trained on auxiliary tasks which test their understanding of the data distribution such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear side-by-side in the training corpus.
The earliest LLMs were trained on corpora having on the order of billions of words. The initial version of GPT was trained in 2018 on BookCorpus, consisting of 985 million words. In the same year, BERT was trained on a combination of BookCorpus and English Wikipedia, totalling 3.3 billion words. In the years since then, training corpora for LLMs have increased by orders of magnitude, reaching up to hundreds of billions or trillions of tokens.
LLMs are computationally expensive to train. A 2020 study estimated the cost of training a 1.5 billion parameter model (1-2 orders of magnitude smaller than the state of the art at the time) at $1.6 million.
A 2020 analysis found that neural language models' capability (as measured by training loss) increased smoothly in a power law relationship with number of parameters, quantity of training data, and computation used for training. These relationships were tested over a wide range of values (up to seven orders of magnitude) and no attenuation of the relationship was observed at the highest end of the range (including for network sizes up to trillions of parameters).
Application to downstream tasks
Between 2018 and 2020, the standard method for harnessing an LLM for a specific NLP task was to fine tune the model with additional task-specific training. It has subsequently been found that more powerful LLMs such as GPT-3 can solve tasks without additional training via "prompting" techniques, in which the problem to be solved is presented to the model as a text prompt, possibly with some textual examples of similar problems and their solutions.
Fine-tuning is the practice of modifying an existing pretrained language model by training it (in a supervised fashion) on a specific task (e.g. sentiment analysis, named entity recognition, or part-of-speech tagging). It is a form of transfer learning. It generally involves the introduction of a new set of weights connecting the final layer of the language model to the output of the downstream task. The original weights of the language model may be "frozen", such that only the new layer of weights connecting them to the output are learned during training. Alternatively, the original weights may receive small updates (possibly with earlier layers frozen).
In the prompting paradigm, popularized by GPT-3, the problem to be solved is formulated via a text prompt, which the model must solve by providing a completion (via inference). In "few-shot prompting", the prompt includes a small number of examples of similar (problem, solution) pairs. For example, a sentiment analysis task of labelling the sentiment of a movie review could be prompted as follows:
Review: This movie stinks. Sentiment: negative Review: This movie is fantastic! Sentiment:
If the model outputs "positive", then it has correctly solved the task. In zero-shot prompting, no solve examples are provided. An example of a zero-shot prompt for the same sentiment analysis task would be "The sentiment associated with the movie review 'This movie is fantastic!' is".
Few-shot performance of LLMs has been shown to achieve competitive results on NLP tasks, sometimes surpassing prior state-of-the-art fine-tuning approaches. Examples of such NLP tasks are translation, question answering, cloze tasks, unscrambling words, and using a novel word in a sentence. The creation and optimisation of such prompts is called prompt engineering and is now an active field of study.
Instruction tuning is a form of fine-tuning designed to facilitate more natural and accurate zero-shot prompting interactions. Given a text input, a pretrained language model will generate a completion which matches the distribution of text on which it was trained. A naive language model given the prompt "Write an essay about the main themes of Hamlet." might provide a completion such as "A late penalty of 10% per day will be applied to submissions received after March 17." In instruction tuning, the language model is trained on many examples of tasks formulated as natural language instructions, along with appropriate responses. Various techniques for instruction tuning have been applied in practice. OpenAI's InstructGPT protocol involves supervised fine-tuning on a dataset of human-generated (prompt, response) pairs, followed by reinforcement learning from human feedback (RLHF), in which a reward function was learned based on a dataset of human preferences. Another technique, "self-instruct", fine-tunes the language model on a training set of examples which are themselves generated by an LLM (bootstrapped from a small initial set of human-generated examples).
List of large language models
|Name||Release date[a]||Developer||Number of parameters[b]||Corpus size||License[c]||Notes|
|BERT||2018||340 million||3.3 billion words||Apache 2.0||early and influential language model|
|GPT-2||2019||OpenAI||1.5 billion||40GB (~10 billion tokens)||MIT||general-purpose model based on transformer architecture|
|Fairseq||2020||Meta||13 billion||MIT||general-purpose literature focused model|
|GPT-3||2020||OpenAI||175 billion||499 billion tokens||public web API||A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.|
|GPT-Neo||March 2021||EleutherAI||2.7 billion||825 GiB||MIT||The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.|
|GPT-J||June 2021||EleutherAI||6 billion||825 GiB||Apache 2.0||GPT-3-style language model|
|Claude||December 2021||Anthropic||52 billion||400 billion tokens||Closed beta||fine-tuned for desirable behavior in conversations|
|GLaM (Generalist Language Model)||December 2021||1.2 trillion||1.6 trillion tokens||Proprietary||sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3|
|LaMDA (Language Models for Dialog Applications)||January 2022||137 billion||1.56T words||Proprietary||specialized for response generation in conversations|
|Megatron-Turing NLG||October 2021||Microsoft and Nvidia||530 billion||338.6 billion tokens||Restricted web access||standard architecture but trained on a supercomputing cluster|
|GPT-NeoX||February 2022||EleutherAI||20 billion||825 GiB||Apache 2.0||based on the Megatron architecture|
|Chinchilla||March 2022||DeepMind||70 billion||1.3 trillion tokens||Proprietary||reduced-parameter model trained on more data|
|PaLM (Pathways Language Model)||April 2022||540 billion||768 billion tokens||Proprietary||aimed to reach the practical limits of model scale|
|OPT (Open Pretrained Transformer)||May 2022||Meta||175 billion||180 billion tokens||Non-commercial research[d]||GPT-3 architecture with some adaptations from Megatron|
|YaLM 100B||June 2022||Yandex||100 billion||1.7TB||Apache 2.0||English-russia model|
|BLOOM||July 2022||Large collaboration led by Hugging Face||175 billion||350 billion tokens (1.6TB)||Responsible AI||Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)|
|AlexaTM (Teacher Models)||November 2022||Amazon||20 billion||1.3 trillion||public web API||bidirectional sequence-to-sequence architecture|
|LLaMA (Large Language Model Meta AI)||February 2023||Meta||65 billion||1.4 trillion||Non-commercial research[e]||trained on a large 20-language corpus to aim for better performance with fewer parameters.|
|GPT-4||March 2023||OpenAI||Unknown[f]||Unknown||public web API||Available for ChatGPT Plus users. Microsoft confirmed that GPT-4 model is used in Bing Chat.|
- ^ This is the date that documentation describing the model's architecture was first released.
- ^ In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
- ^ This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.
- ^ The smaller models including 66B are publicly available, while the 175B model is available on request.
- ^ Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
- ^ As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."
- ^ a b c d e f g Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus.
- ^ Carlini, Nicholas; Tramer, Florian; Wallace, Eric; Jagielski, Matthew; Herbert-Voss, Ariel; Lee, Katherine; Roberts, Adam; Brown, Tom B; Song, Dawn; Erlingsson, Ulfar (2021). Extracting Training Data from Large Language Models (PDF). USENIX Security Symposium. Vol. 6.
- ^ a b c Wei, Jason. "Emergent Abilities of Large Language Models".
- ^ a b c d e Jurafsky, Dan; Martin, James H. (7 January 2023). Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
- ^ a b c Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch.
- ^ a b Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?". Nature.
- ^ a b Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.
- ^ a b Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (Dec 2020). Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.). "Language Models are Few-Shot Learners" (PDF). Advances in Neural Information Processing Systems. Curran Associates, Inc. 33: 1877–1901.
- ^ Bosma, Maarten; Wei, Jason (6 October 2021). "Introducing FLAN: More generalizable Language Models with Instruction Fine-Tuning". Google Research.
- ^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155.
- ^ Wang, Yizhong; Kordi, Yeganeh; Mishra, Swaroop; Liu, Alisa; Smith, Noah A.; Khashabi, Daniel; Hajishirzi, Hannaneh (2022). "Self-Instruct: Aligning Language Model with Self Generated Instructions". arXiv:2212.10560.
- ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
- ^ "BERT". March 13, 2023 – via GitHub.
- ^ "GPT-2: 1.5B Release". OpenAI. 2019-11-05. Archived from the original on 2019-11-14. Retrieved 2019-11-14.
- ^ "Better language models and their implications". openai.com.
- ^ a b "OpenAI's GPT-3 Language Model: A Technical Overview". lambdalabs.com.
- ^ "gpt-2". GitHub. Retrieved 13 March 2023.
- ^ "fairseq/examples/moe_lm at main · facebookresearch/fairseq". GitHub. Retrieved 2023-03-18.
- ^ Requirements and Installation, Meta Research, 2023-03-18, retrieved 2023-03-18
- ^ Dev, KoboldAI (2023-03-18), KoboldAI/KoboldAI-Client, retrieved 2023-03-18
- ^ "ChatGPT: Optimizing Language Models for Dialogue". OpenAI. 2022-11-30. Retrieved 2023-01-13.
- ^ "GPT Neo". March 15, 2023 – via GitHub.
- ^ a b c Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv:2101.00027.
- ^ a b Iyer, Abhishek (15 May 2021). "GPT-3's free alternative GPT-Neo is something to be excited about". VentureBeat.
- ^ "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront". www.forefront.ai. Retrieved 2023-02-28.
- ^ "Product". Anthropic. Retrieved 14 March 2023.
- ^ a b Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. (9 December 2021). "A General Language Assistant as a Laboratory for Alignment". arXiv:2112.00861.
- ^ Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. (15 December 2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv:2212.08073.
- ^ a b Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM". ai.googleblog.com. Retrieved 2023-03-09.
- ^ a b Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything". ai.googleblog.com. Retrieved 2023-03-09.
- ^ Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". Microsoft Research.
- ^ a b Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv:2201.11990.
- ^ Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. (2022-05-01). GPT-NeoX-20B: An Open-Source Autoregressive Language Model. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. Vol. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models. pp. 95–136. Retrieved 2022-12-19.
- ^ a b c Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical analysis of compute-optimal large language model training". Deepmind Blog.
- ^ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan; Guy, Aurelia; Osindero, Simon; Simonyan, Karen; Elsen, Erich; Rae, Jack W.; Vinyals, Oriol; Sifre, Laurent (29 March 2022). "Training Compute-Optimal Large Language Models". arXiv:2203.15556.
- ^ Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Retrieved 2023-03-09.
- ^ "Democratizing access to large-scale language models with OPT-175B". ai.facebook.com.
- ^ Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models". arXiv:2205.01068.
- ^ a b Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay (2022-06-22), YaLM 100B, retrieved 2023-03-18
- ^ "bigscience/bloom · Hugging Face". huggingface.co.
- ^ "20B-parameter Alexa model sets new marks in few-shot learning". Amazon Science. 2 August 2022.
- ^ Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. (3 August 2022). "AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model". arXiv:2208.01448.
- ^ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". aws.amazon.com. 17 November 2022. Retrieved 13 March 2023.
- ^ a b c "Introducing LLaMA: A foundational, 65-billion-parameter large language model". Meta AI. 24 February 2023.
- ^ "GPT-4 Technical Report" (PDF). OpenAI. 2023. Archived (PDF) from the original on March 14, 2023. Retrieved March 14, 2023.
- ^ Lardinois, Frederic (March 14, 2023). "Microsoft's new Bing was using GPT-4 all along". TechCrunch. Archived from the original on March 15, 2023. Retrieved March 14, 2023.