AI Dataset Glossary

Vector Embedding

A numerical representation of text, images, or other content as a high-dimensional vector, enabling machines to measure semantic similarity between items.

Also known as: embedding, text embedding, semantic vector

A vector embedding is a numerical representation of a piece of content ... text, an image, a document ... as a list of numbers (a vector) in a high-dimensional space. Embedding models are trained so that items with similar meaning are represented by vectors that are close together in this space.

Vector embeddings are the foundation of semantic search and RAG systems. When you embed both a user query and a set of documents, you can find the documents most semantically relevant to the query by measuring the distance between their vectors ... even if the documents use different words than the query.

For AI dataset work, embeddings matter because they enable retrieval beyond keyword matching. A RAG system that uses embeddings can find relevant documents based on meaning and context, not just literal word overlap. This is why dataset preparation for RAG includes an embedding step: each chunk of content is converted to a vector so the retrieval system can search by similarity.