Evidence Receipt. Related Resources.
Evidence Receipt. Related Resources.
Compared to this week’s papers
Verification pending
Use This Via API or MCP
Signal Canvas is the citation-first public layer for turning one paper into a structured commercialization narrative. Use it to hand off into REST, MCP, Build Loop, and launch-pack execution without losing source lineage.
Use This Via API or MCP
Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.
Page Freshness
Canonical route: /signal-canvas/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development
This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.
Agent Handoff
Canonical ID pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development | Route /signal-canvas/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-developmentMCP example
{
"tool": "search_signal_canvas",
"arguments": {
"mode": "paper",
"paper_ref": "pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development",
"query_text": "Summarize PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development"
}
}source_context
{
"surface": "signal_canvas",
"mode": "paper",
"query": "PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development",
"normalized_query": "2603.16354",
"route": "/signal-canvas/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development",
"paper_ref": "pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development",
"topic_slug": null,
"benchmark_ref": null,
"dataset_ref": null
}Claims: 8
References: Pending verification
Proof: Verification pending
Freshness state: computing
Source paper: PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
PDF: https://arxiv.org/pdf/2603.16354v1
Source count: Pending verification
Coverage: 17%
Last proof check: 2026-04-02T02:30:40.136Z
Signal Canvas receipt window
/buildability/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development
Subject: PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
Verdict
Preparing verified analysis
Dimensions overall score 8.0
No public code linked for this paper yet.
At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus.
Explicitly stated in the abstract with clear numeric comparisons.
partial
Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06).
Explicitly stated in the abstract with precise numeric results.
partial
On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x.
Explicitly stated in the abstract with precise numeric results.
partial
A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%.
Explicitly stated in the abstract with clear numeric evidence.
partial
The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering.
Directly stated in the abstract describing the method.
partial
On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark.
Directly stated in the abstract with a specific result and claim of being first.
partial
Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP.
Directly stated in the abstract as context, though the 'severely underrepresented' claim is qualitative but strongly implied by the research gap being addressed.
partial
the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary.
Explicitly stated in the abstract with specific numeric results.
partial
Related resources will appear here when this paper maps cleanly to topic, benchmark, or dataset surfaces.
Use an AI coding agent to implement this research.
Lightweight coding agent in your terminal.
Agentic coding tool for terminal workflows.
AI agent mindset installer and workflow scaffolder.
AI-first code editor built on VS Code.
Free, open-source editor by Microsoft.
6mo ROI
0.5-1x
3yr ROI
6-15x
GPU-heavy products have higher costs but premium pricing. Expect break-even by 12mo, then 40%+ margins at scale.
Watch
Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.
Time to first demo
Insufficient data
No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.
Structured compute envelope
Insufficient data
No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.
Receipt path
/buildability/pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development
Paper ref
pashtocorp-a-1-25-billion-word-corpus-evaluation-suite-and-reproducible-pipeline-for-low-resource-language-development
arXiv id
2603.16354
Generated at
2026-04-02T02:30:40.136Z
Evidence freshness
stale
Last verification
2026-04-02T02:30:40.136Z
Sources
0
References
0
Coverage
17%
Lineage hash
266d304f71d8c04e9981f9e866ee0b1d7b34cd00349af312bca11e0e7c72c75a
Canonical opportunity-kernel lineage hash.
External signature
unsigned_external
No founder, registry, pilot, or production-adoption signature is attached to this receipt.
Verification
not_verified
Verification is blocked until an external signature is provided.
Verification pending / evidence receipt incomplete
repo_url
references