Text embeddings have revolutionized natural language processing by providing dense vector representations that capture semantic meaning. In the previous tutorial, you learned how to generate these embeddings using transformer models. In this post, you will learn the advanced applications of text embeddings that go beyond basic tasks like semantic search and document clustering.
Specifically, you will learn:
- How to build recommendation systems using text embeddings
- How to implement cross-lingual applications with multilingual embeddings
- How to create text classification systems with embedding-based features
- How to develop zero-shot learning applications
- How to visualize and analyze text embeddings
Let’s get started.
Example Applications of Text Embedding
Photo by Christina Winter. Some rights reserved.
Overview
This post is divided into five parts; they are:
- Recommendation Systems
- Cross-Lingual Applications
- Text Classification
- Zero-Shot Classification
- Visualizing Text Embeddings
Recommendation Systems
A simple recommendation system can be created by finding a few of the most similar items to the target item. In the example of natural language processing, you can find some similar articles as “you may also like” while the user is reading an article.
There are many ways to implement this. But the easiest way is to check how similar are the two articles. You can just convert all the articles into a context embedding. The two articles with the highest similarity in the context embedding are similar in content. It may not be what you expect for a recommendation, but it is sometimes useful and it is a good starting point.
Let’s implement this as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity
# Define a corpus of articles (title and content) articles = [ { “title”: “Understanding Deep Learning”, “content”: (“Deep learning is a subset of machine learning where artificial neural networks, “ “algorithms inspired by the human brain, learn from large amounts of data.”) }, { “title”: “Introduction to Natural Language Processing”, “content”: (“Natural Language Processing (NLP) is a field of AI that gives machines the “ “ability to read, understand, and derive meaning from human languages.”) }, { “title”: “The Future of Computer Vision”, “content”: (“Computer vision is an interdisciplinary field that deals with how computers can “ “gain high-level understanding from digital images or videos.”) }, { “title”: “Reinforcement Learning Explained”, “content”: (“Reinforcement learning is an area of machine learning concerned with how “ “software agents ought to take actions in an environment so as to maximize some “ “notion of cumulative reward.”) }, { “title”: “Neural Networks and Their Applications”, “content”: (“Neural networks are a set of algorithms, modeled loosely after the human brain, “ “that are designed to recognize patterns in data.”) } ]
model = SentenceTransformer(“all-MiniLM-L6-v2”)
def create_article_embeddings(articles, model): “”“create embeddings for articles”“” texts = [f“{article[“title“]}. {article[“content“]}” for article in articles] embeddings = model.encode(texts) return embeddings
def get_recommendations(article_id, articles, embeddings, top_n=2): “”“get recommendations for a given article ID based on cosine similarity”“” similarities = cosine_similarity([embeddings[article_id]], embeddings)[0] similar_indices = np.argsort(similarities)[::–1][1:top_n+1] return [articles[idx] for idx in similar_indices]
# Create embeddings for all articles, and get recommendation for first article embeddings = create_article_embeddings(articles, model) recommendations = get_recommendations(0, articles, embeddings)
# Print the recommendations print(f‘Recommendations for “{articles[0][“title”]}”:’) for i, rec in enumerate(recommendations): print(f“{i+1}. {rec[“title“]}”) |
You set up a corpus at the beginning of the code because it is a toy example. In practice, you may want to retrieve the corpus from a database or from a file system.
In this program, you used the all-MiniLM-L6-v2
model and instantiated it with SentenceTransformer
. This is a pre-trained model that can encode a text into a context embedding. You take all the articles defined in the corpus and convert each of them into a context embedding in the function create_article_embeddings()
. The output is a vector of vectors, or a matrix. In this particular implementation, there are 5 items in the corpus, and the embedding vector has 384 dimensions. The output embeddings
is a matrix of shape (5, 384).
In get_recommendations()
, you calculate the cosine similarity between one embedding and all others. The function cosine_similarity()
from scikit-learn requires two lists of vectors and returns a matrix saying how similar each pair of vectors is. Since you are comparing one to all others, the output matrix has only a single row. Then in np.argsort(similarities)
, you obtained the indices of the similarity score in ascending order. Since cosine similarity is 1 when the vectors are identical and 0 when they are orthogonal (i.e., totally different), you reverse the result to order the similarity score in descending order. Then, the most similar items are those at the beginning of this list, Except the first one, which is the article itself.
Once you obtained the indices of the most similar items, you used a for-loop to print the recommendations.
When you run this code, you will get:
Recommendations for “Understanding Deep Learning”: 1. Neural Networks and Their Applications 2. Reinforcement Learning Explained |
These recommendations will be based on semantic similarity rather than just keyword matching, so you will get articles about neural networks or machine learning even if they don’t contain the exact phrase “deep learning.” This approach can be extended to more complex recommendation systems by incorporating user preferences, collaborative filtering, or hybrid approaches.
Cross-Lingual Applications
One of the powerful features of modern transformer models is their ability to generate embeddings for text in multiple languages. This enables cross-lingual applications where you can compare or process text across different languages.
Let’s implement a simple cross-lingual semantic search system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import numpy as np from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity
corpus = [ { “language”: “English”, “text”: (“Machine learning is a field of study that gives computers the ability to learn “ “without being explicitly programmed.”) }, { “language”: “Spanish”, “text”: (“El aprendizaje automático es un campo de estudio que da a las computadoras la “ “capacidad de aprender sin ser programadas explícitamente.”) }, { “language”: “French”, “text”: (“L’apprentissage automatique est un domaine d’étude qui donne aux ordinateurs “ “la capacité d’apprendre sans être explicitement programmés.”) }, { “language”: “German”, “text”: (“Maschinelles Lernen ist ein Studienbereich, der Computern die Fähigkeit gibt, “ “zu lernen, ohne explizit programmiert zu werden.”) }, { “language”: “Italian”, “text”: (“Il machine learning è un campo di studio che conferisce ai computer la capacità “ “di apprendere senza essere esplicitamente programmati.”) }, { “language”: “English”, “text”: (“Natural language processing is a subfield of linguistics, computer science, “ “and artificial intelligence.”) }, { “language”: “English”, “text”: (“Computer vision is an interdisciplinary field that deals with how computers can “ “gain high-level understanding from digital images or videos.”) } ]
model = SentenceTransformer(“paraphrase-multilingual-MiniLM-L12-v2”)
# Generate embeddings for the corpus texts = [doc[“text”] for doc in corpus] embeddings = model.encode(texts)
# Define a query in English and generate an embedding query = “What is machine learning?” query_embedding = model.encode(query)
# Sort the embeddings of the corpus by descending similarity similarities = cosine_similarity([query_embedding], embeddings)[0] ranked_indices = np.argsort(similarities)[::–1]
# Print ranked results print(f“Query: {query}\n”) for i, idx in enumerate(ranked_indices[:3]): # Show top 3 results print(f“{i+1}. [{corpus[idx][“language“]}] {corpus[idx][“text“]} (Similarity: {similarities[idx]:.4f})”) |
In this example, we’re using a multilingual Sentence Transformer model (paraphrase-multilingual-MiniLM-L12-v2
) to create embeddings for documents in different languages. The corpus contains various languages and various topics. The program above is to implement a question-answering system, but the question may find the answer in a different language.
The example above is very similar to the one in the previous section. The corpus is first converted into embeddings. Then the query in its embedding form is compared with the corpus by cosine similarity. The top 3 results are printed. Running this code will give you:
Query: What is machine learning?
1. [Italian] Il machine learning è un campo di studio che conferisce ai computer la capacità di apprendere senza essere esplicitamente programmati. (Similarity: 0.8129) 2. [English] Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. (Similarity: 0.7788) 3. [French] L’apprentissage automatique est un domaine d’étude qui donne aux ordinateurs la capacité d’apprendre sans être explicitement programmés. (Similarity: 0.7470) |
The top answer is in Italian, while the question, “What is machine learning?” is in English. This works because the embedding vector represents the semantic meaning of the text, regardless of the language. This cross-lingual capability is particularly useful for applications like multilingual search engines.
Text Classification
Imagine you have a lot of text data, and it is growing every day. This may be because you are collecting new articles or emails. You want to classify them into different categories. This can be done by using text embeddings.
This is a task similar to “topic modeling”. Topic modeling is an unsupervised learning task that groups text documents into different topics. It uses algorithms like Latent Dirichlet Allocation (LDA) to find the signature keywords for classification. Here is a supervised approach: You have a predefined set of categories and some examples (maybe you do the classification manually). Then you add new text to the collection with the classification done automatically.
Text embeddings can help by extracting the semantic meaning of the text into vectors. Then you can train a machine learning model to classify the vectors into categories. It works better this way because the vector represents the meaning of the text rather than the text itself. Hence, it is better than using bag-of-words or TF-IDF features.
There are many ways to implement the machine learning classifier. A simple one is a logistic regression from scikit-learn. Let’s implement this in the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
articles = [ # Business articles {“text”: “The stock market reached a new high today, with technology stocks leading the gains.”, “category”: “Business”}, {“text”: “The government announced a new tax policy that will affect small businesses.”, “category”: “Business”}, {“text”: “The central bank has decided to keep interest rates unchanged.”, “category”: “Business”}, {“text”: “Quarterly earnings reports exceeded expectations for most Fortune 500 companies.”, “category”: “Business”}, {“text”: “Inflation rates have decreased for the third consecutive month.”, “category”: “Business”}, {“text”: “The merger between two major corporations has been approved by regulators.”, “category”: “Business”}, {“text”: “Unemployment rates have fallen to a five-year low according to new data.”, “category”: “Business”}, {“text”: “The cryptocurrency market experienced significant volatility this week.”, “category”: “Business”},
# Health articles {“text”: “A new study shows that regular exercise can reduce the risk of heart disease.”, “category”: “Health”}, {“text”: “A clinical trial for a new cancer treatment has shown promising results.”, “category”: “Health”}, {“text”: “A balanced diet and regular sleep are essential for maintaining good health.”, “category”: “Health”}, {“text”: “Medical researchers have identified a new gene linked to Alzheimer’s disease.”, “category”: “Health”}, {“text”: “The WHO has issued new guidelines for managing diabetes in elderly patients.”, “category”: “Health”}, {“text”: “A new technique for early detection of breast cancer has been developed.”, “category”: “Health”}, {“text”: “Studies show that mindfulness meditation can help reduce stress and anxiety.”, “category”: “Health”}, {“text”: “Public health officials warn of a potential flu outbreak this winter season.”, “category”: “Health”},
# Technology articles {“text”: “The latest smartphone from Apple features a better camera and longer battery life.”, “category”: “Technology”}, {“text”: “The new electric car from Tesla has a range of over 400 miles.”, “category”: “Technology”}, {“text”: “The latest update to the operating system includes new security features.”, “category”: “Technology”}, {“text”: “A new artificial intelligence system can detect diseases from medical images.”, “category”: “Technology”}, {“text”: “The tech company unveiled its new virtual reality headset at the annual conference.”, “category”: “Technology”}, {“text”: “Researchers have developed a quantum computer that can solve complex problems.”, “category”: “Technology”}, {“text”: “The new social media platform has gained millions of users in just a few months.”, “category”: “Technology”}, {“text”: “Cybersecurity experts warn of a new type of malware targeting smart home devices.”, “category”: “Technology”},
# Science articles {“text”: “Scientists have discovered a new species of frog in the Amazon rainforest.”, “category”: “Science”}, {“text”: “Astronomers have observed a supernova in a distant galaxy.”, “category”: “Science”}, {“text”: “Researchers have developed a new method for measuring ocean temperatures.”, “category”: “Science”}, {“text”: “A fossil discovery suggests that dinosaurs may have been warm-blooded.”, “category”: “Science”}, {“text”: “Climate scientists report that Arctic ice is melting at an unprecedented rate.”, “category”: “Science”}, {“text”: “Physicists have confirmed the existence of a new subatomic particle.”, “category”: “Science”}, {“text”: “A study of coral reefs shows signs of recovery in protected marine areas.”, “category”: “Science”}, {“text”: “Biologists have sequenced the genome of an endangered species of tiger.”, “category”: “Science”} ]
# Prepare data for classification training model = SentenceTransformer(“all-MiniLM-L6-v2”) texts = [article[“text”] for article in articles] X = model.encode(texts) y = [article[“category”] for article in articles]
# Normalize features scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
# Split data into training and testing sets with stratification X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.2, random_state=42, stratify=y )
# Train a logistic regression classifier with regularization classifier = LogisticRegression(C=1.0, class_weight=“balanced”, max_iter=1000) classifier.fit(X_train, y_train)
# Evaluate the classifier y_pred = classifier.predict(X_test) print(classification_report(y_test, y_pred))
# Classify new articles new_articles = [ “The company reported a 20% increase in quarterly profits.”, “A new vaccine has been approved for use against the flu.”, “The new laptop features a faster processor and more memory.”, “The Mars rover has sent back new images of the planet\”s surface.” ] new_embeddings = model.encode(new_articles) new_embeddings_scaled = scaler.transform(new_embeddings) new_predictions = classifier.predict(new_embeddings_scaled) for article, prediction in zip(new_articles, new_predictions): print(f“Article: {article}\nPredicted Category: {prediction}\n”) |
When you run this, you will get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
precision recall f1-score support
Business 1.00 1.00 1.00 2 Health 0.50 1.00 0.67 1 Science 1.00 1.00 1.00 2 Technology 1.00 0.50 0.67 2
accuracy 0.86 7 macro avg 0.88 0.88 0.83 7 weighted avg 0.93 0.86 0.86 7
Article: The company reported a 20% increase in quarterly profits. Predicted Category: Business
Article: A new vaccine has been approved for use against the flu. Predicted Category: Health
Article: The new laptop features a faster processor and more memory. Predicted Category: Technology
Article: The Mars rover has sent back new images of the planet”s surface. Predicted Category: Science |
In this example, the corpus is annotated with one of the four categories: business, health, technology, or science. The text is converted into embeddings, which, together with the category label, are used to train a logistic regression classifier.
The classifier is trained with 80% of the corpus and then evaluated with the remaining 20%. The results are printed in the form of a classification report. You can see that Business and Science are classified accurately, but Health and Technology are not so good. When you finish the training, you can use the trained classifier on the new articles. The workflow is the same as in training: Encode the text into embeddings, then scale the embeddings using the trained scaler, and finally, use the trained classifier to predict the category.
Note that you can use other classifiers like random forest or K-Nearest Neighbors. You can try them and see which one works better.
Zero-Shot Classification
In the previous example, you trained a classifier to classify the text into one of the predefined categories. If the category labels are meaningful text, why can’t you use the meaning of the label for classification? In this way, you can simply convert the text into embeddings and then compare it with the category labels’ embeddings. The text is then tagged with the most similar category label.
This is the idea of zero-shot learning. It is not a supervised learning task. Indeed, you never train a new model, but the classification and information retrieval tasks can still be done.
Let’s implement a zero-shot text classifier using text embeddings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import torch from sentence_transformers import SentenceTransformer, util
texts = [ “The stock market reached a new high today, with technology stocks leading the gains.”, “A new study shows that regular exercise can reduce the risk of heart disease.”, “The latest smartphone from Apple features a better camera and longer battery life.”, “Scientists have discovered a new species of frog in the Amazon rainforest.” ] categories = [“Business”, “Health”, “Technology”, “Science”]
# Load a pre-trained Sentence Transformer model model = SentenceTransformer(“all-MiniLM-L6-v2”) text_embeddings = model.encode(texts, convert_to_tensor=True) category_embeddings = model.encode(categories, convert_to_tensor=True)
# Calculate cosine similarity between texts and categories similarities = util.cos_sim(text_embeddings, category_embeddings)
# Get the most similar category for each text best_categories = torch.argmax(similarities, dim=1) for i, text in enumerate(texts): category = categories[best_categories[i]] similarity = similarities[i][best_categories[i]].item() print(f“Text: {text}”) print(f“Category: {category} (Similarity: {similarity:.4f})\n”) |
The output is:
Text: The stock market reached a new high today, with technology stocks leading the gains. Category: Technology (Similarity: 0.2624)
Text: A new study shows that regular exercise can reduce the risk of heart disease. Category: Health (Similarity: 0.3297)
Text: The latest smartphone from Apple features a better camera and longer battery life. Category: Technology (Similarity: 0.1623)
Text: Scientists have discovered a new species of frog in the Amazon rainforest. Category: Science (Similarity: 0.1940) |
The result may not be as good as the previous example because the category labels are sometimes ambiguous, and you do not have a model trained for this task. Nevertheless, it produces meaningful results.
Zero-shot learning is particularly useful for tasks where labeled training data is scarce or unavailable. It can be applied to a wide range of NLP tasks, including classification, entity recognition, and question-answering.
Visualizing Text Embeddings
Not a particular application, but visualizing text embeddings can sometimes provide insights into the semantic relationships between texts. Since embeddings typically have hundreds of dimensions, you need dimensionality reduction techniques to visualize them in 2D or 3D.
PCA is probably the most popular dimensionality reduction technique. However, for visualization, t-SNE (t-Distributed Stochastic Neighbor Embedding) usually works better. Let’s implement a visualization of text embeddings using t-SNE:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import matplotlib.pyplot as plt import numpy as np from sentence_transformers import SentenceTransformer from sklearn.manifold import TSNE
texts_with_categories = [ {“text”: “The stock market reached a new high today.”, “category”: “Business”}, {“text”: “Investors are optimistic about the economy.”, “category”: “Business”}, {“text”: “The company reported strong quarterly earnings.”, “category”: “Business”}, {“text”: “The central bank has decided to keep interest rates unchanged.”, “category”: “Business”}, {“text”: “A new study shows that regular exercise can reduce the risk of heart disease.”, “category”: “Health”}, {“text”: “A balanced diet is essential for maintaining good health.”, “category”: “Health”}, {“text”: “The new vaccine has been approved for use against the flu.”, “category”: “Health”}, {“text”: “Sleep is important for physical and mental health.”, “category”: “Health”}, {“text”: “The latest smartphone features a better camera and longer battery life.”, “category”: “Technology”}, {“text”: “The new laptop has a faster processor and more memory.”, “category”: “Technology”}, {“text”: “The software update includes new security features.”, “category”: “Technology”}, {“text”: “5G networks promise faster internet speeds for mobile devices.”, “category”: “Technology”}, {“text”: “Scientists have discovered a new species in the Amazon rainforest.”, “category”: “Science”}, {“text”: “Astronomers have observed a supernova in a distant galaxy.”, “category”: “Science”}, {“text”: “The Mars rover has sent back new images of the planet’s surface.”, “category”: “Science”}, {“text”: “Researchers have developed a new method for measuring ocean temperatures.”, “category”: “Science”} ]
# Extract texts and categories texts = [item[“text”] for item in texts_with_categories] categories = [item[“category”] for item in texts_with_categories]
# Generate embeddings, then reduce dimension with t-SNE model = SentenceTransformer(“all-MiniLM-L6-v2”) embeddings = model.encode(texts)
tsne = TSNE(n_components=2, perplexity=5, random_state=42) reduced_embeddings = tsne.fit_transform(embeddings)
# Define colors for categories unique_categories = list(set(categories)) colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_categories))) category_to_color = {category: color for category, color in zip(unique_categories, colors)}
# Create a scatter plot plt.figure(figsize=(10, 8)) for i, (x, y) in enumerate(reduced_embeddings): category = categories[i] color = category_to_color[category] plt.scatter(x, y, color=color, alpha=0.7) plt.annotate(texts[i][:20] + “…”, (x, y), fontsize=8)
# Add legend, mark the axes for category, color in category_to_color.items(): plt.scatter([], [], color=color, label=category) plt.legend() plt.xlabel(“t-SNE Dimension 1”) plt.ylabel(“t-SNE Dimension 2”) plt.title(“t-SNE Visualization of Text Embeddings”) plt.tight_layout() plt.show() |
You used scikit-learn’s t-SNE implementation. It is easy to use as all you need to do is to pass the rows of embedding vectors to the tsne.fit_transform()
method. The output embeddings
is a $N \times 2$ array (i.e., coordinates in 2D space).
Then, you used a for-loop to plot each transformed embedding into a point in the scatter plot. Each point is colored based on the category annotated in the original text. To avoid cluttering the plot, legends are created afterward in another for-loop. The plot produced is like the following:
The visualization puts texts with similar meanings together; this means the labels are useful to represent the semantic meaning of the texts. You can look at the plot and check if the points from the same category are clustered close enough to tell if your embeddings are good.
Other dimensionality reduction techniques exist, such as PCA (Principal Component Analysis) or UMAP (Uniform Manifold Approximation and Projection). You can try these to see if the visualization still makes sense.
Further Readings
Below are some further readings that you may find useful:
Summary
In this tutorial, you learned a few applications of text embeddings. Particularly, you learned how to:
- Build a recommendation system using similarity in the embedding space
- Implement cross-lingual applications with multilingual embeddings
- Train a text classification system using embedding as features
- Develop a zero-shot text labeling application using similarity metrics in embedding space
- Visualize and analyze text embeddings
Text embeddings are simple yet powerful tools for a wide range of NLP tasks. They enable machines to understand and process text in a way that captures semantic meaning.