GPT-2018 Paper | NLP Research

Abstract

GPT introduced unsupervised pre-training of large language models using multitask learning. This approach enabled single-model, end-to-end solutions to many tasks by combining training objectives in a simple but effective paradigm.

📚
Authors: Radford et al.
📅
Year: 2018
📄
Publisher: ACL 2018 Conference

Key Contribution

Demonstrated that large-scale unsupervised language models can effectively learn representations for many diverse tasks without task-specific tuning.

GPT architecture showing unlabeled text input to language model

Technical Implementation

Model Architecture

Transformer-based language model with 12 or 48 layers depending on model size. Uses masked language modeling and next sentence prediction objectives.

Training Process

Pretrained on 40GB Wikipedia text corpus. No task-specific tuning. Trained using multinomial logistic regression objective across multiple tasks.

Model Performance

Text Completion

Demonstrates strong performance on open-ended generation tasks with coherent context preservation.

Question Answering

Outperforms many task-specific models on standardized benchmarks without explicit training.

Code Generation

Shows ability to understand and generate code in multiple programming languages.

Language Models Are Unsupervised Multitask Learners