Example Applications of Text Embedding


Text embeddings have revolutionized natural language processing by providing dense vector representations that capture semantic meaning. In the previous tutorial, you learned how to generate these embeddings using transformer models. In this post, you will learn the advanced applications of text embeddings that go beyond basic tasks like semantic search and document clustering.
Specifically, you will learn:

  • How to build recommendation systems using text embeddings
  • How to implement cross-lingual applications with multilingual embeddings
  • How to create text classification systems with embedding-based features
  • How to develop zero-shot learning applications
  • How to visualize and analyze text embeddings

Let’s get started.

Example Applications of Text Embedding
Photo by Christina Winter. Some rights reserved.

Overview

This post is divided into five parts; they are:

  • Recommendation Systems
  • Cross-Lingual Applications
  • Text Classification
  • Zero-Shot Classification
  • Visualizing Text Embeddings

Recommendation Systems

A simple recommendation system can be created by finding a few of the most similar items to the target item. In the example of natural language processing, you can find some similar articles as “you may also like” while the user is reading an article.

There are many ways to implement this. But the easiest way is to check how similar are the two articles. You can just convert all the articles into a context embedding. The two articles with the highest similarity in the context embedding are similar in content. It may not be what you expect for a recommendation, but it is sometimes useful and it is a good starting point.

Let’s implement this as follows:

You set up a corpus at the beginning of the code because it is a toy example. In practice, you may want to retrieve the corpus from a database or from a file system.

In this program, you used the all-MiniLM-L6-v2 model and instantiated it with SentenceTransformer. This is a pre-trained model that can encode a text into a context embedding. You take all the articles defined in the corpus and convert each of them into a context embedding in the function create_article_embeddings(). The output is a vector of vectors, or a matrix. In this particular implementation, there are 5 items in the corpus, and the embedding vector has 384 dimensions. The output embeddings is a matrix of shape (5, 384).

In get_recommendations(), you calculate the cosine similarity between one embedding and all others. The function cosine_similarity() from scikit-learn requires two lists of vectors and returns a matrix saying how similar each pair of vectors is. Since you are comparing one to all others, the output matrix has only a single row. Then in np.argsort(similarities), you obtained the indices of the similarity score in ascending order. Since cosine similarity is 1 when the vectors are identical and 0 when they are orthogonal (i.e., totally different), you reverse the result to order the similarity score in descending order. Then, the most similar items are those at the beginning of this list, Except the first one, which is the article itself.

Once you obtained the indices of the most similar items, you used a for-loop to print the recommendations.

When you run this code, you will get:

These recommendations will be based on semantic similarity rather than just keyword matching, so you will get articles about neural networks or machine learning even if they don’t contain the exact phrase “deep learning.” This approach can be extended to more complex recommendation systems by incorporating user preferences, collaborative filtering, or hybrid approaches.

Cross-Lingual Applications

One of the powerful features of modern transformer models is their ability to generate embeddings for text in multiple languages. This enables cross-lingual applications where you can compare or process text across different languages.

Let’s implement a simple cross-lingual semantic search system:

In this example, we’re using a multilingual Sentence Transformer model (paraphrase-multilingual-MiniLM-L12-v2) to create embeddings for documents in different languages. The corpus contains various languages and various topics. The program above is to implement a question-answering system, but the question may find the answer in a different language.

The example above is very similar to the one in the previous section. The corpus is first converted into embeddings. Then the query in its embedding form is compared with the corpus by cosine similarity. The top 3 results are printed. Running this code will give you:

The top answer is in Italian, while the question, “What is machine learning?” is in English. This works because the embedding vector represents the semantic meaning of the text, regardless of the language. This cross-lingual capability is particularly useful for applications like multilingual search engines.

Text Classification

Imagine you have a lot of text data, and it is growing every day. This may be because you are collecting new articles or emails. You want to classify them into different categories. This can be done by using text embeddings.

This is a task similar to “topic modeling”. Topic modeling is an unsupervised learning task that groups text documents into different topics. It uses algorithms like Latent Dirichlet Allocation (LDA) to find the signature keywords for classification. Here is a supervised approach: You have a predefined set of categories and some examples (maybe you do the classification manually). Then you add new text to the collection with the classification done automatically.

Text embeddings can help by extracting the semantic meaning of the text into vectors. Then you can train a machine learning model to classify the vectors into categories. It works better this way because the vector represents the meaning of the text rather than the text itself. Hence, it is better than using bag-of-words or TF-IDF features.

There are many ways to implement the machine learning classifier. A simple one is a logistic regression from scikit-learn. Let’s implement this in the code:

When you run this, you will get:

In this example, the corpus is annotated with one of the four categories: business, health, technology, or science. The text is converted into embeddings, which, together with the category label, are used to train a logistic regression classifier.

The classifier is trained with 80% of the corpus and then evaluated with the remaining 20%. The results are printed in the form of a classification report. You can see that Business and Science are classified accurately, but Health and Technology are not so good. When you finish the training, you can use the trained classifier on the new articles. The workflow is the same as in training: Encode the text into embeddings, then scale the embeddings using the trained scaler, and finally, use the trained classifier to predict the category.

Note that you can use other classifiers like random forest or K-Nearest Neighbors. You can try them and see which one works better.

Zero-Shot Classification

In the previous example, you trained a classifier to classify the text into one of the predefined categories. If the category labels are meaningful text, why can’t you use the meaning of the label for classification? In this way, you can simply convert the text into embeddings and then compare it with the category labels’ embeddings. The text is then tagged with the most similar category label.

This is the idea of zero-shot learning. It is not a supervised learning task. Indeed, you never train a new model, but the classification and information retrieval tasks can still be done.

Let’s implement a zero-shot text classifier using text embeddings:

The output is:

The result may not be as good as the previous example because the category labels are sometimes ambiguous, and you do not have a model trained for this task. Nevertheless, it produces meaningful results.

Zero-shot learning is particularly useful for tasks where labeled training data is scarce or unavailable. It can be applied to a wide range of NLP tasks, including classification, entity recognition, and question-answering.

Visualizing Text Embeddings

Not a particular application, but visualizing text embeddings can sometimes provide insights into the semantic relationships between texts. Since embeddings typically have hundreds of dimensions, you need dimensionality reduction techniques to visualize them in 2D or 3D.

PCA is probably the most popular dimensionality reduction technique. However, for visualization, t-SNE (t-Distributed Stochastic Neighbor Embedding) usually works better. Let’s implement a visualization of text embeddings using t-SNE:

You used scikit-learn’s t-SNE implementation. It is easy to use as all you need to do is to pass the rows of embedding vectors to the tsne.fit_transform() method. The output embeddings is a $N \times 2$ array (i.e., coordinates in 2D space).

Then, you used a for-loop to plot each transformed embedding into a point in the scatter plot. Each point is colored based on the category annotated in the original text. To avoid cluttering the plot, legends are created afterward in another for-loop. The plot produced is like the following:

The visualization puts texts with similar meanings together; this means the labels are useful to represent the semantic meaning of the texts. You can look at the plot and check if the points from the same category are clustered close enough to tell if your embeddings are good.

Other dimensionality reduction techniques exist, such as PCA (Principal Component Analysis) or UMAP (Uniform Manifold Approximation and Projection). You can try these to see if the visualization still makes sense.

Further Readings

Below are some further readings that you may find useful:

Summary

In this tutorial, you learned a few applications of text embeddings. Particularly, you learned how to:

  • Build a recommendation system using similarity in the embedding space
  • Implement cross-lingual applications with multilingual embeddings
  • Train a text classification system using embedding as features
  • Develop a zero-shot text labeling application using similarity metrics in embedding space
  • Visualize and analyze text embeddings

Text embeddings are simple yet powerful tools for a wide range of NLP tasks. They enable machines to understand and process text in a way that captures semantic meaning.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here