What Are Vision-Language Models (VLMs): The AI That’s Learning to See and Speak | by Ro Me | Sep, 2024


With ChatGPT, we saw LLMs, but now, Vision-Language Models (VLMs) are taking the stage — AI that can both see and speak. AI is evolving quickly, but Vision-Language Models (VLMs) stand out as one of the coolest innovations around. These models combine two areas of AI: computer vision and natural language processing (NLP) — allowing them to both “see” images and “speak” or understand text.

VLhuh? So, VLMs are AI systems that process both images and text. Imagine being able to upload a photo to an app and having the system not only describe it but also answer questions about what it sees, like Is that a man or a woman? How tall is the person? Are they carrying anything suspicious (for CCTV detection), or is there a distance between two objects for a car to determine whether it can go through? They say good technology is indistinguishable from magic, and you might wonder how Teslas can autonomously drive you from one city to another. That’s the power of VLMs — magic. Chances are, you’re already using them in your day-to-day life, and I’ll show you how.

Here are three definitions of VLMs with increasing complexity, from middle schooler to PhD level:

  1. For a Middle School Student: VLMs are how computers or phones can look at pictures and understand what they are, just like how we see things and describe them. For example, if you show it a picture of a dog, it can tell you, “That’s a dog,” and even describe what the dog is doing or how big the dog is.
  2. For a Normal Everyday Adult: VLMs are a branch of AI that brings together the ability to understand images as well as text. They can look at a photo and describe it in words or read a text and relate it to what’s in an image. This tech is used in areas like self-driving cars to recognize road signs or navigate between vehicles, but also on social media platforms where it automatically generates image captions for videos or images to get recommended to you on your For You page.
  3. For a PhD-Level Audience: VLMs integrate multimodal data processing, combining deep learning models for computer vision and natural language processing to create joint embeddings. They use techniques like contrastive learning to align visual and textual aspects, allowing for tasks like image captioning, visual-question answering, and multimodal retrieval.

Visual Language Models work by pairing images with text during training. This allows the models to learn how visual features relate to the specific language. During this process, the model is given thousands of pairs of images and their corresponding text, like a picture of a cat next to the word “cat.” Over time, the VLM learns to associate specific features (fluffy pointed ears, whiskers, etc.) with words (cat) and concepts (animal). Through this, the model can recognize and caption new images or answer questions about photos (like “What type of cat is this?”).

The two main components of a Vision-Language Model are the vision encoder and the text decoder.

  1. Vision Encoder: This part is responsible for analyzing the image and processing it by breaking it down into smaller pieces (often called “patches” or “tokens”). It then translates those elements into data the model can understand.
  2. Text Decoder: After the image has been processed by the encoder, this is where the text decoder comes into play. This component is trained to turn the visual data into relevant text. With its training on huge amounts of paired images and text, the text decoder generates a descriptive sentence based on the visual data, like “A dog is sitting on the grass in a park,” by recognizing objects, actions, and contexts within the image and mapping them to the appropriate words it learned from the pairings.

In terms of inner workings, VLMs start by analyzing an image and breaking it down into smaller and smaller parts, better known as “features,” which can include objects, shapes, colors, and pixels. The VLM processes these features using its vision encoder, converting the visual data into numbers the machine can understand. For instance, the model might assign the shape of a cat’s ears the value 42, while a slightly different ear shape from another image could be assigned 43. When the model encounters a similar shape again, it recognizes the pattern and maps it back to the value 42, allowing it to understand and categorize visual features consistently across different images. Rather than assigning simple integers like “42” or “43,” the encoder might represent the shape of a cat’s ear as a vector, something like [0.32, -1.23, 2.54, …]. This vector allows the model to understand complex shapes as points in a multidimensional space.

After this, the text decoder connects this information to words or sentences by using what the model learned from huge amounts of images paired with text. That’s how it identifies features like ears and fur and links them to the word “cat.”

These models apply the same logic with data that includes images and captions about their placement or exact distance, enabling spatial reasoning and allowing them to understand how objects are arranged with respect to each other. So you can ask it questions like, “Is the cat sitting next to a tree?” “Which cat is bigger?” “How far is the cat from the door?” or “How far is the car from the house?” By answering such questions, the model uses spatial data to make appropriate decisions — this is exactly what I’m currently researching!**

Social media platforms like Instagram, Facebook, and TikTok use Vision-Language Models (VLMs) to generate captions for images, recommending them to the appropriate audiences or suggesting hashtags for you. These systems analyze visual content to identify key elements like a dog, beach, or food, and that’s how millions of pieces of content are recommended to the appropriate audience on every user’s ‘For You’ page. They also rely on these models for content moderation, as they help identify inappropriate or unsafe content like explicit images to ensure they don’t appear in places where it could be harmful, especially for younger audiences.

When it comes to identifying sensitive or inappropriate content, such as nudity, VLMs analyze images by breaking them down into visual features like skin tones, shapes, and specific body part proportions. The vision encoder detects these patterns and compares them against pre-existing databases of flagged visual content. For instance, it can distinguish between artistic nudity or skin exposure in everyday images versus explicit content that violates community guidelines, ensuring that such images are flagged and blocked for moderation before they reach users’ feeds.

In autonomous driving, like with Teslas and Waymo, VLMs allow these vehicles to interpret really difficult and nuanced road environments through the visual input from their cameras. The vision encoder processes these images to identify things like traffic signs, road markings, people, and moving cars (it does this by continuously checking every photo every few seconds, but this is a more complex VLM that identifies how things move over time, which we’re not covering today). The text decoder then generates commands like “Stop at the red light” or “Avoid the pedestrian.” They can also understand spatial relationships, like the distance between your car and an obstacle, to know when to stop.

(Fun fact: autonomous vehicles like Waymo use a combination of LiDAR, radar, and high-resolution cameras to make real-time decisions for safe navigation, even in challenging conditions like low light, fog, or rain when cameras may be less effective.)

Here are some really creative and practical day-to-day use cases people are already using — and you can too — with ChatGPT’s GPT-4 (paid version) by simply pasting images into the chat:

For example, if you pay for GPT-4, it has a vision feature that can read really tiny, faded, or hard-to-read text from pictures you give it. Imagine trying to read labels or a serial number on a tire that’s so worn out, but no worries — GPT can come in and identify the correct information like make and size.

Another use case is figuring out illegible writing, like that prescription with messy handwriting your doctor gave you. GPT-4’s Optical Character Recognition (OCR) capabilities go beyond everyday text, even deciphering complex and historical content. Researchers have even used it to translate and analyze centuries-old manuscripts, like the notes of Robert Hooke (the inventor of the microscope).

If you find a photo (could be stock) you really like or want to use but don’t have the rights to, you can use GPT-4V to analyze the photo and give you prompts to paste it back into the chat and have it recreate them using DALL-E 3.

If you’re a programmer, you can reconstruct entire website dashboards or systems with just a drawing, automating the entire process.

Finally, for students, GPT-4V can interpret really complex visuals and infographics, like biology diagrams, and break them down into smaller explanations you can interact with. Some homework or school documents are screenshots of a book and not directly copyable, but you can use GPT-4 to interpret this and get more help. As you can tell, the possibilities are really endless.

  1. Computational Inefficiency
    One of the big challenges with VLMs is how computationally demanding they are. These models need significant amounts of memory and processing power to work well. This is one reason why access to ChatGPT’s vision feature requires a paid subscription. They’re usually built using large transformer architectures (a type of AI model designed to handle large amounts of data efficiently by processing all the input at once, not in steps). They can recognize patterns and relationships between different parts of the data (like words in a sentence or objects in an image), but this leads to high latency, making it quite a challenge to deploy them in real-time applications like autonomous driving or healthcare. One big area of research right now is using techniques like model pruning and knowledge distillation to reduce the size of these models and their computational demand.
  2. Robustness Issues
    Another key issue that I’m actively researching is robustness, which means that these models can perform well in controlled and clear environments but fail when faced with real-world adversity and variability. For instance, minor changes they weren’t trained on — like new lighting, angles, or unseen objects — can throw off the model’s accuracy. This can be a huge deal when it comes to autonomous driving, where there isn’t room for error. Another area of research is trying to improve the generalization of VLMs so they can handle a wider variety of inputs.
  1. Expanding Modalities (Audio, Sensors)

The future of VLMs lies in expanding beyond just visual and textual inputs. Many models are beginning to incorporate multimodal data, like audio and sensor inputs, allowing for more interesting and unique use cases. This could be particularly useful in applications like emotion recognition in virtual assistants, where a model could interpret the user’s tone of voice (audio), facial expressions (vision), and words (text) to respond more empathetically.

2. Spatial Reasoning

Another important aspect is improving spatial reasoning in VLMs — enabling models to not only identify objects in an image but also understand their exact relative position to other objects. This spatial understanding is crucial in fields like robotics and autonomous navigation, where machines need to make sense of complex environments and interact effectively with their surroundings.

Even if you forget everything, here are the four key takeaways you should remember:

  1. VLMs are the ultimate combination of vision and language: These models can “see” images and “speak” by connecting visual data to text, enabling AI to describe, explain, and make decisions based on both factors.
  2. They are already part of your life: Whether through social media auto-captions or content moderation, VLMs help categorize and filter what you see every day online.
  3. Current challenges include computational inefficiency and robustness: These models require significant processing power and may struggle with real-world variability, posing hurdles for applications like real-time decision-making with limited computing resources.
  4. The future of VLMs is multimodal: Expanding VLMs to include audio, sensors, and other inputs will create more advanced systems capable of understanding the world in richer, more interesting ways.

As AI evolves, it’s not just about how far we can push these systems but how we, as users and creators, will shape the ways they understand our world.

The real question, however, is: With so much focus on protecting your data privacy on the internet, what are your concerns about the potential risks of image data? How can we ensure that AI’s ability to ‘see’ doesn’t introduce new privacy violations or surveillance risks that go beyond just text?

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here