ARXIV:2604.00715 · LLM PRETRAINING & RAG · SUBMITTED 03 APR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Karan Singh · Michael Yu · Varun Gangal · Zhuofu Tao · Sachin Kumar · Emmy Liu · +1 at arXiv

A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

Evidence 37 refs | 5 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly…

METHOD

Full abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. A public repository is linked, so…

WHY NOW

LLM Pretraining & RAG moved forward this cycle; last verified April 2026. Public score 4.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

Evidence37 refs | 5 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

Segment

LLM Pretraining & RAG

Adoption evidence

Public code linked for build inspection

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "a12bd697-7f8f-4188-b5c5-ee0a991a82f6", "arxiv_id": "2604.00715", "canonical_route": "/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "endpoints": { "paper_pack": "/api/v1/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining/paper-pack", "build_passport": "/api/v1/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining", "normalized_query": "2604.00715", "route": "/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "paper_ref": "to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining#webpage", "url": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "name": "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining", "description": "A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining#scholarlyArticle", "headline": "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining", "description": "A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.", "url": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining", "sameAs": "https://arxiv.org/abs/2604.00715", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.00715" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-01T10:26:03.000Z", "author": [ { "@type": "Person", "name": "Karan Singh" }, { "@type": "Person", "name": "Michael Yu" }, { "@type": "Person", "name": "Varun Gangal" }, { "@type": "Person", "name": "Zhuofu Tao" }, { "@type": "Person", "name": "Sachin Kumar" }, { "@type": "Person", "name": "Emmy Liu" }, { "@type": "Person", "name": "Steven Y. Feng" } ], "codeRepository": "https://github.com/DegenAI-Labs/RAG-scaling-laws", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Pretraining & RAG" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining#software", "name": "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining - Source Code", "description": "A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.", "codeRepository": "https://github.com/DegenAI-Labs/RAG-scaling-laws", "url": "https://github.com/DegenAI-Labs/RAG-scaling-laws" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Pretraining & RAG", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "To Memorize or to Retrieve: Scaling Laws for RAG-Considerate", "item": "https://sciencetostartup.com/paper/to-memorize-or-to-retrieve-scaling-laws-for-rag-considerate-pretraining" } ] } ] }

Competitive landscape

A framework for understanding the trade-offs between pretraining and retrieval for language models, guiding optimal data allocation.

Segment

LLM Pretraining & RAG

Adoption evidence

Public code linked for build inspection

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline