OCR-based text representations treat text as a visual signal, rendering item descriptions into images and encoding them with vision-based OCR models. This approach aims to improve semantic ID learning, especially for symbolic, attribute-centric text and in multimodal generative recommendation systems.
OCR-based text representations convert text into images and then use vision AI to understand them, rather than traditional text processing. This helps AI models better understand product descriptions with lots of numbers and symbols, and makes it easier to combine text and image information in recommendation systems.
OCR-text, visual text encoding, image-based text representation
Was this definition helpful?