Session 42: Vision-Language Models: The 2026 Multimodal Stack

Welcome to session 42! We have Abheetha Pradhan joining us this week for a deep dive into Vision-Language Models.

Abheetha is a Product Manager at GumboTech with deep expertise in both engineering and product, specializing in Vision AI — particularly OCR and document understanding. She's also a long-term member and volunteer for the Applied AI Club community.

Session Overview

This session reviews the evolution and current state of Vision-Language Models (VLMs) — covering the architectural journey from CNNs to native multimodal transformers, when to use VLMs vs traditional OCR, and real production use cases from Pinterest, Instacart, and CWAT.

Key Takeaways

Architectural Evolution: VLMs evolved from dual-encoder models (e.g., CLIP) to native multimodal transformers (e.g., Gemini 3 Pro), which are trained on all data types (text, image, video) in a unified space, enabling deeper reasoning.
VLM vs. Traditional OCR: Use traditional OCR for deterministic workflows on structured documents where cost and edge deployment are priorities. Use VLMs for complex, unstructured tasks requiring reasoning, despite higher cost and GPU needs.
Hybrid Pipelines are Key: Practical applications often combine traditional CV (segmentation, OCR) for pre-processing with VLMs for high-level reasoning, as seen in Instacart's flyer automation.
Future Direction: The field is moving toward structured world models and embodied AI, which will enable not just understanding but also action within simulated environments.

Topics Covered

The Evolution of Vision AI

Early CV (2012–2017): CNNs and ResNets for object detection and segmentation
Dual Encoders (2017+): CLIP's zero-shot classification using separate text and image encoders mapped to a shared space
Adapter Architectures: BLIP, Flamingo — vision encoder projecting embeddings into an LLM
Native Multimodal Transformers (2023+): Gemini 3 Pro, Qwen3-VL, GPT-5 — ingesting all data types in a unified space

VLM vs. Traditional OCR: When to Use Which

VLMs excel at unstructured documents, complex reasoning, and handling data loss via dynamic tiling
Traditional OCR wins on structured documents, deterministic workflows, low cost, and edge deployment (~50 MB RAM)
The choice depends on your cost constraints, deployment environment, and document complexity

Production Use Cases

Pinterest Pin Landing: VLM extracts image attributes → LLM refines into topics → CLIP matches content via similarity scoring
Instacart Digital Flyer Automation: Hybrid pipeline using SAM segmentation → Paddle OCR → LLM query generation → ANN matching → LLM ranking
CWAT Concept-First Labeling: SAM3-based approach where a single prompt defines a concept, then the model finds, segments, and tracks all instances across video

Q&A Highlights

Handwritten documents: VLMs (Gemini Flash, Qwen SLMs) generally outperform OCR
Deepfake detection: Fine-tuned model achieved 97% accuracy but requires continuous retraining
Model selection: Use leaderboards like Arena.ai and MMU Pro to compare performance
Learning resources: YouTube, blogs, and model leaderboards are the best sources

Here's the entire recording of the session.