ARXIV:2603.18534 · LLM TRAINING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim · Suhas Kotha · Yejin Choi · Tatsunori Hashimoto · Nick Haber · Percy Liang · arXiv

This research proposes a novel method for data-efficient pre-training of language models by scaling synthetic data generation into 'megadocs', significantly improving data efficiency and long-context performance.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain This research proposes a novel method for data-efficient pre-training of language models by scaling synthetic data generation into 'megadocs', significantly improving data efficiency and long-context performance.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute…

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainThis research proposes a novel method for data-efficient pre-training of language models by scaling synthetic data generation into 'megadocs', significantly improving data efficiency and long-context performance.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim · Suhas Kotha · Yejin Choi · Tatsunori Hashimoto · Nick Haber · Percy Liang · arXiv

Competitive landscape

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "f56ba1e9-a410-49c1-b707-849ae1473b22", "arxiv_id": "2603.18534", "canonical_route": "/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "data-efficient-pre-training-by-scaling-synthetic-megadocs", "endpoints": { "paper_pack": "/api/v1/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs/paper-pack", "build_passport": "/api/v1/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Data-efficient pre-training by scaling synthetic megadocs", "normalized_query": "2603.18534", "route": "/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs", "paper_ref": "data-efficient-pre-training-by-scaling-synthetic-megadocs", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs#webpage", "url": "https://sciencetostartup.com/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs", "name": "Data-efficient pre-training by scaling synthetic megadocs", "description": "This research proposes a novel method for data-efficient pre-training of language models by scaling synthetic data generation into 'megadocs', significantly improving data efficiency and long-context performance.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs#scholarlyArticle", "headline": "Data-efficient pre-training by scaling synthetic megadocs", "description": "This research proposes a novel method for data-efficient pre-training of language models by scaling synthetic data generation into 'megadocs', significantly improving data efficiency and long-context performance.", "url": "https://sciencetostartup.com/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs", "sameAs": "https://arxiv.org/abs/2603.18534", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.18534" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-19T06:30:33.000Z", "author": [ { "@type": "Person", "name": "Konwoo Kim" }, { "@type": "Person", "name": "Suhas Kotha" }, { "@type": "Person", "name": "Yejin Choi" }, { "@type": "Person", "name": "Tatsunori Hashimoto" }, { "@type": "Person", "name": "Nick Haber" }, { "@type": "Person", "name": "Percy Liang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Data-efficient pre-training by scaling synthetic megadocs", "item": "https://sciencetostartup.com/paper/data-efficient-pre-training-by-scaling-synthetic-megadocs" } ] } ] }

Competitive landscape

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Data-efficient pre-training by scaling synthetic megadocs

Data-efficient pre-training by scaling synthetic megadocs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline