ARXIV:2604.17930 · LLM TRAINING DATA · SUBMITTED 21 APR · 20:33 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala · Sumit Bhatia · arXiv

This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling. In this work, we investigate whether these failures stem from inherent architectural…

METHOD

Full abstract

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on…

WHY NOW

LLM Training Data moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainThis research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

Segment

LLM Training Data

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "56f24e12-e976-4b43-8ad6-08e94ad91f1e", "arxiv_id": "2604.17930", "canonical_route": "/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "endpoints": { "paper_pack": "/api/v1/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck/paper-pack", "build_passport": "/api/v1/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?", "normalized_query": "2604.17930", "route": "/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "paper_ref": "heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck#webpage", "url": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "name": "Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?", "description": "This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck#scholarlyArticle", "headline": "Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?", "description": "This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.", "url": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck", "sameAs": "https://arxiv.org/abs/2604.17930", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.17930" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-20T08:11:04.000Z", "author": [ { "@type": "Person", "name": "H S V N S Kowndinya Renduchintala" }, { "@type": "Person", "name": "Sumit Bhatia" } ], "codeRepository": "https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training Data" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck#software", "name": "Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck? - Source Code", "description": "This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.", "codeRepository": "https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence", "url": "https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training Data", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Heterogeneity in Formal Linguistic Competence of Language Mo", "item": "https://sciencetostartup.com/paper/heterogeneity-in-formal-linguistic-competence-of-language-models-is-data-the-real-bottleneck" } ] } ] }

Competitive landscape

This research demonstrates that targeted data augmentation can significantly improve LLM formal linguistic competence, suggesting data composition is key for human-scale modeling.

Segment

LLM Training Data

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline