ARXIV:2604.04482 · EDUCATIONAL TECHNOLOGY · SUBMITTED 07 APR · 20:12 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Dominik Glandorf · Fares Fawzi · Tanja Käser · arXiv

A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive…

METHOD

Full abstract

Learners' use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors' ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts.…

WHY NOW

Educational Technology moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

Segment

Educational Technology

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "ffe44a36-ab22-42ba-93c2-690aca102e80", "arxiv_id": "2604.04482", "canonical_route": "/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models/paper-pack", "build_passport": "/api/v1/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models", "normalized_query": "2604.04482", "route": "/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "paper_ref": "scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "name": "Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models", "description": "A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models#scholarlyArticle", "headline": "Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models", "description": "A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.", "url": "https://sciencetostartup.com/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models", "sameAs": "https://arxiv.org/abs/2604.04482", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04482" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-06T07:12:46.000Z", "author": [ { "@type": "Person", "name": "Dominik Glandorf", "affiliation": { "@type": "Organization", "name": "EPFL, Switzerland" } }, { "@type": "Person", "name": "Fares Fawzi", "affiliation": { "@type": "Organization", "name": "EPFL, Switzerland" } }, { "@type": "Person", "name": "Tanja Käser", "affiliation": { "@type": "Organization", "name": "EPFL, Switzerland" } } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Educational Technology" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Educational Technology", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Scalable and Explainable Learner-Video Interaction Predictio", "item": "https://sciencetostartup.com/paper/scalable-and-explainable-learner-video-interaction-prediction-using-multimodal-large-language-models" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"Scalable and Explainable Learner-Video Interaction Predictio\"?", "acceptedAnswer": { "@type": "Answer", "text": "A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "This technology could be productized as a SaaS that integrates into learning management systems, providing real-time analytics and recommendations for video content optimization based on predicted learner interactions." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "An educational platform feature that gives instructors insights into potential cognitive load challenges in their video content, enabling pre-emptive content adjustments." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This technology could replace manual annotation processes or generalized engagement metrics by providing targeted insights into learner-video interactions." } } ] } ] }

Competitive landscape

A predictive tool for optimizing educational videos by analyzing learner interactions using multimodal LLM embeddings.

Segment

Educational Technology

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline