OCR-based text representations

Gold definitionUpdated Apr 2, 2026

Definition

OCR-based text representations treat text as a visual signal, rendering item descriptions into images and encoding them with vision-based OCR models. This approach aims to improve semantic ID learning, especially for symbolic, attribute-centric text and in multimodal generative recommendation systems.

At a glance

Executive summary

OCR-based text representations convert text into images and then use vision AI to understand them, rather than traditional text processing. This helps AI models better understand product descriptions with lots of numbers and symbols, and makes it easier to combine text and image information in recommendation systems.

TL;DR

It's a way to make AI understand text better, especially tricky product descriptions, by treating the text like a picture and using image recognition.

Key points

Renders text descriptions into images and encodes them using vision-based OCR models.
Addresses issues with standard text encoders for symbolic/attribute-centric text and mismatched text/image embeddings in multimodal systems.
Used by researchers and engineers working on Generative Recommendation (GR) models, multimodal AI, and semantic ID learning.
Unlike standard text encoders optimized for natural language, it processes text visually to better handle symbolic data and align with image embeddings.
Explores novel visual approaches to text representation to enhance robustness and multimodal integration.

Use cases

Product Recommendation Systems: Improving the understanding of item descriptions containing complex attributes, part numbers, or specifications (e.g., 'Intel i9-13900K, 32GB DDR5 RAM, RTX 4090').
Multimodal Search and Retrieval: Enabling more effective fusion of product images with their symbolic text descriptions (e.g., finding a specific electronic component based on both its visual appearance and technical specs).
Catalog Data Processing: Generating robust embeddings for large-scale e-commerce catalogs where item data is often attribute-centric and less like natural language.
Automated Data Entry/Extraction: Potentially improving the robustness of systems that extract structured information from semi-structured text by leveraging visual cues.

Also known as

OCR-text, visual text encoding, image-based text representation