ARXIV:2604.02324 · LLM VOCABULARY EXTENSION · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen · Zhoutong Fu · Chengming Jiang · Haichao Zhang · Ran Zhou · Tan Wang · +9 at arXiv

A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space. The standard practice initializes these new tokens as…

METHOD

Full abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a…

WHY NOW

LLM Vocabulary Extension moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

Segment

LLM Vocabulary Extension

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "1deb1f6a-40c7-412a-8272-db70209af77f", "arxiv_id": "2604.02324", "canonical_route": "/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "endpoints": { "paper_pack": "/api/v1/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation/paper-pack", "build_passport": "/api/v1/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation", "normalized_query": "2604.02324", "route": "/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "paper_ref": "grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation#webpage", "url": "https://sciencetostartup.com/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "name": "Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation", "description": "A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation#scholarlyArticle", "headline": "Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation", "description": "A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.", "url": "https://sciencetostartup.com/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation", "sameAs": "https://arxiv.org/abs/2604.02324", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02324" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T17:59:19.000Z", "author": [ { "@type": "Person", "name": "Daiwei Chen" }, { "@type": "Person", "name": "Zhoutong Fu" }, { "@type": "Person", "name": "Chengming Jiang" }, { "@type": "Person", "name": "Haichao Zhang" }, { "@type": "Person", "name": "Ran Zhou" }, { "@type": "Person", "name": "Tan Wang" }, { "@type": "Person", "name": "Chunnan Yao" }, { "@type": "Person", "name": "Guoyao Li" }, { "@type": "Person", "name": "Rui Cai" }, { "@type": "Person", "name": "Yihan Cao" }, { "@type": "Person", "name": "Ruijie Jiang" }, { "@type": "Person", "name": "Fedor Borisyuk" }, { "@type": "Person", "name": "Jianqiang Shen" }, { "@type": "Person", "name": "Jingwei Wu" }, { "@type": "Person", "name": "Ramya Korlakai Vinayak" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Vocabulary Extension" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Vocabulary Extension", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Grounded Token Initialization for New Vocabulary in LMs for ", "item": "https://sciencetostartup.com/paper/grounded-token-initialization-for-new-vocabulary-in-lms-for-generative-recommendation" } ] } ] }

Competitive landscape

A novel method for initializing new vocabulary tokens in language models that significantly improves performance on generative recommendation tasks by grounding them in meaningful semantic space.

Segment

LLM Vocabulary Extension

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline