New embedding models and API updates: OpenAI
What are embeddings?
OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input.
Embedding models
OpenAI offers two powerful third-generation embedding model (denoted by -3
in the model ID). You can read the embedding v3 announcement blog post for more details.
MODEL GENERATIONTOKENIZERMAX INPUT TOKENSKNOWLEDGE CUTOFFV3cl100k_base8191Sep 2021V2cl100k_base8191Sep 2021
Usage is priced per input token, below is an example of pricing pages of text per US dollar (assuming ~800 tokens per page):
MODELROUGH PAGES PER DOLLAREXAMPLE PERFORMANCE ON MTEB EVALtext-embedding-3-small62,50062.3%text-embedding-3-large9,61564.6%text-embedding-ada-00212,50061.0%
Use cases
Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.
Obtaining the embeddings
The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:
PRODUCT IDUSER IDSCORESUMMARYTEXTB001E4KFG0A3SGXH7AUHU8GW5Good Quality Dog FoodI have bought several of the Vitality canned...B00813GRG4A1D87F6ZCVE5NK1Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut...
We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding.