Welcome to session 42! We have Abheetha Pradhan joining us this week for a deep dive into Vision-Language Models.
Abheetha is a Product Manager at GumboTech with deep expertise in both engineering and product, specializing in Vision AI — particularly OCR and document understanding. She's also a long-term member and volunteer for the Applied AI Club community.
Session Overview
This session reviews the evolution and current state of Vision-Language Models (VLMs) — covering the architectural journey from CNNs to native multimodal transformers, when to use VLMs vs traditional OCR, and real production use cases from Pinterest, Instacart, and CWAT.
Key Takeaways
- Architectural Evolution: VLMs evolved from dual-encoder models (e.g., CLIP) to native multimodal transformers (e.g., Gemini 3 Pro), which are trained on all data types (text, image, video) in a unified space, enabling deeper reasoning.
- VLM vs. Traditional OCR: Use traditional OCR for deterministic workflows on structured documents where cost and edge deployment are priorities. Use VLMs for complex, unstructured tasks requiring reasoning, despite higher cost and GPU needs.
- Hybrid Pipelines are Key: Practical applications often combine traditional CV (segmentation, OCR) for pre-processing with VLMs for high-level reasoning, as seen in Instacart's flyer automation.
- Future Direction: The field is moving toward structured world models and embodied AI, which will enable not just understanding but also action within simulated environments.
Topics Covered
The Evolution of Vision AI
- Early CV (2012–2017): CNNs and ResNets for object detection and segmentation
- Dual Encoders (2017+): CLIP's zero-shot classification using separate text and image encoders mapped to a shared space
- Adapter Architectures: BLIP, Flamingo — vision encoder projecting embeddings into an LLM
- Native Multimodal Transformers (2023+): Gemini 3 Pro, Qwen3-VL, GPT-5 — ingesting all data types in a unified space
VLM vs. Traditional OCR: When to Use Which
- VLMs excel at unstructured documents, complex reasoning, and handling data loss via dynamic tiling
- Traditional OCR wins on structured documents, deterministic workflows, low cost, and edge deployment (~50 MB RAM)
- The choice depends on your cost constraints, deployment environment, and document complexity
Production Use Cases
- Pinterest Pin Landing: VLM extracts image attributes → LLM refines into topics → CLIP matches content via similarity scoring
- Instacart Digital Flyer Automation: Hybrid pipeline using SAM segmentation → Paddle OCR → LLM query generation → ANN matching → LLM ranking
- CWAT Concept-First Labeling: SAM3-based approach where a single prompt defines a concept, then the model finds, segments, and tracks all instances across video
Q&A Highlights
- Handwritten documents: VLMs (Gemini Flash, Qwen SLMs) generally outperform OCR
- Deepfake detection: Fine-tuned model achieved 97% accuracy but requires continuous retraining
- Model selection: Use leaderboards like Arena.ai and MMU Pro to compare performance
- Learning resources: YouTube, blogs, and model leaderboards are the best sources
Here's the entire recording of the session.