AI Dataset Glossary

Training Data

The labeled dataset used to teach an AI model how to perform a task, by showing it many examples of input-output pairs during the learning process.

Also known as: training dataset, training corpus, labeled data

Training data is the dataset used to teach a machine learning model how to perform a task. The model learns by processing many examples ... typically input-output pairs ... and adjusting its internal parameters to minimize prediction error. The larger, cleaner, and more representative the training data, the better the resulting model generally performs.

For large language models, training data consists of massive text corpora: web pages, books, code repositories, scientific papers, and more. For specialized models (classification, object detection, recommendation), training data consists of domain-specific labeled examples.

Training data quality is often more important than training data quantity. Noisy, inconsistent, or biased training data produces models with corresponding limitations. This is why high-quality training datasets ... especially those that are curated, labeled by domain experts, and documented for provenance ... are valuable data products.