Blogs & Webinars

Session 42: Vision-Language Models: The 2026 Multimodal Stack

Session recording: Vision-Language Models with Abheetha Pradhan — from CLIP to native multimodal transformers, production use cases, and OCR vs VLM tradeoffs.

Welcome to session 42! We have Abheetha Pradhan joining us this week for a deep dive into Vision-Language Models.

Abheetha is a Product Manager at GumboTech with deep expertise in both engineering and product, specializing in Vision AI — particularly OCR and document understanding. She's also a long-term member and volunteer for the Applied AI Club community.

Session Overview

This session reviews the evolution and current state of Vision-Language Models (VLMs) — covering the architectural journey from CNNs to native multimodal transformers, when to use VLMs vs traditional OCR, and real production use cases from Pinterest, Instacart, and CWAT.

Key Takeaways

  • Architectural Evolution: VLMs evolved from dual-encoder models (e.g., CLIP) to native multimodal transformers (e.g., Gemini 3 Pro), which are trained on all data types (text, image, video) in a unified space, enabling deeper reasoning.
  • VLM vs. Traditional OCR: Use traditional OCR for deterministic workflows on structured documents where cost and edge deployment are priorities. Use VLMs for complex, unstructured tasks requiring reasoning, despite higher cost and GPU needs.
  • Hybrid Pipelines are Key: Practical applications often combine traditional CV (segmentation, OCR) for pre-processing with VLMs for high-level reasoning, as seen in Instacart's flyer automation.
  • Future Direction: The field is moving toward structured world models and embodied AI, which will enable not just understanding but also action within simulated environments.

Topics Covered

The Evolution of Vision AI

  • Early CV (2012–2017): CNNs and ResNets for object detection and segmentation
  • Dual Encoders (2017+): CLIP's zero-shot classification using separate text and image encoders mapped to a shared space
  • Adapter Architectures: BLIP, Flamingo — vision encoder projecting embeddings into an LLM
  • Native Multimodal Transformers (2023+): Gemini 3 Pro, Qwen3-VL, GPT-5 — ingesting all data types in a unified space

VLM vs. Traditional OCR: When to Use Which

  • VLMs excel at unstructured documents, complex reasoning, and handling data loss via dynamic tiling
  • Traditional OCR wins on structured documents, deterministic workflows, low cost, and edge deployment (~50 MB RAM)
  • The choice depends on your cost constraints, deployment environment, and document complexity

Production Use Cases

  • Pinterest Pin Landing: VLM extracts image attributes → LLM refines into topics → CLIP matches content via similarity scoring
  • Instacart Digital Flyer Automation: Hybrid pipeline using SAM segmentation → Paddle OCR → LLM query generation → ANN matching → LLM ranking
  • CWAT Concept-First Labeling: SAM3-based approach where a single prompt defines a concept, then the model finds, segments, and tracks all instances across video

Q&A Highlights

  • Handwritten documents: VLMs (Gemini Flash, Qwen SLMs) generally outperform OCR
  • Deepfake detection: Fine-tuned model achieved 97% accuracy but requires continuous retraining
  • Model selection: Use leaderboards like Arena.ai and MMU Pro to compare performance
  • Learning resources: YouTube, blogs, and model leaderboards are the best sources

Here's the entire recording of the session.