ARXIV:2603.26164 · LLM TRAINING · SUBMITTED 30 MAR · 21:54 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang · Zhengyang Zhao · Meiyi Qiang · Mingrui Chen · Lu Ma · Rongyi Yu · +19 at arXiv

DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

Evidence 58 refs | 9 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in…

METHOD

Full abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original…

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainDataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

Evidence58 refs | 9 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "86bd0891-ae1f-4d7f-9189-0ae7f3ab3a45", "arxiv_id": "2603.26164", "canonical_route": "/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models/paper-pack", "build_passport": "/api/v1/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models", "normalized_query": "2603.26164", "route": "/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "paper_ref": "dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "name": "DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models", "description": "DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models#scholarlyArticle", "headline": "DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models", "description": "DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.", "url": "https://sciencetostartup.com/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models", "sameAs": "https://arxiv.org/abs/2603.26164", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26164" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T08:28:02.000Z", "author": [ { "@type": "Person", "name": "Hao Liang" }, { "@type": "Person", "name": "Zhengyang Zhao" }, { "@type": "Person", "name": "Meiyi Qiang" }, { "@type": "Person", "name": "Mingrui Chen" }, { "@type": "Person", "name": "Lu Ma" }, { "@type": "Person", "name": "Rongyi Yu" }, { "@type": "Person", "name": "Hengyi Feng" }, { "@type": "Person", "name": "Shixuan Sun" }, { "@type": "Person", "name": "Zimo Meng" }, { "@type": "Person", "name": "Xiaochen Ma" }, { "@type": "Person", "name": "Xuanlin Yang" }, { "@type": "Person", "name": "Qifeng Cai" }, { "@type": "Person", "name": "Ruichuan An" }, { "@type": "Person", "name": "Bohan Zeng" }, { "@type": "Person", "name": "Zhen Hao Wong" }, { "@type": "Person", "name": "Chengyu Shen" }, { "@type": "Person", "name": "Runming He" }, { "@type": "Person", "name": "Zhaoyang Han" }, { "@type": "Person", "name": "Yaowei Zheng" }, { "@type": "Person", "name": "Fangcheng Fu" }, { "@type": "Person", "name": "Conghui He" }, { "@type": "Person", "name": "Bin Cui" }, { "@type": "Person", "name": "Zhiyu Li" }, { "@type": "Person", "name": "Weinan E" }, { "@type": "Person", "name": "Wentao Zhang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DataFlex: A Unified Framework for Data-Centric Dynamic Train", "item": "https://sciencetostartup.com/paper/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models" } ] } ] }

Competitive landscape

DataFlex unifies and streamlines data-centric dynamic training for LLMs, offering improved performance and efficiency with a drop-in framework.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline