ARXIV:2603.16354 · LOW-RESOURCE LANGUAGE DEVELOPMENT · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

arXiv

Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

Blocked on Code›Score8.0Evidence unverified

Opportunity summary

Pain Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible…

METHOD

Full abstract

We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears…

WHY NOW

Low-Resource Language Development moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainBuild cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

Segment

Low-Resource Language Development

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8e7e3702-f814-4091-a413-24990f374890", "arxiv_id": "2603.16354", "canonical_route": "/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "endpoints": { "paper_pack": "/api/v1/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development/paper-pack", "build_passport": "/api/v1/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development", "normalized_query": "2603.16354", "route": "/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "paper_ref": "pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development#webpage", "url": "https://sciencetostartup.com/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "name": "PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development", "description": "Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development#scholarlyArticle", "headline": "PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development", "description": "Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.", "url": "https://sciencetostartup.com/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development", "sameAs": "https://arxiv.org/abs/2603.16354", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.16354" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-17T10:36:18.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Low-Resource Language Development" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Low-Resource Language Development", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, an", "item": "https://sciencetostartup.com/paper/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Now is the time because global AI adoption is accelerating, but low-resource languages like Pashto are being left behind, creating a competitive gap. With increasing digitalization in Pashto-speaking regions and growing demand for localized services, there's a first-mover advantage in deploying AI solutions that leverage this newly available, high-quality corpus before competitors catch up." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A Pashto-language customer service chatbot for telecommunications companies in Afghanistan and Pakistan, handling billing inquiries, plan changes, and technical support through voice or text interfaces, trained on this corpus to improve accuracy and reduce reliance on human agents." } } ] } ] }

Competitive landscape

Build cutting-edge NLP models for Pashto using the largest available Pashto language corpus, PashtoCorp.

Segment

Low-Resource Language Development

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline