ARXIV:2603.11743 · TRANSLATION QUALITY ESTIMATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

arXiv

A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora…

METHOD

Full abstract

Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance.

WHY NOW

Translation Quality Estimation moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainA semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

Segment

Translation Quality Estimation

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "b15e2558-d77a-46b9-892f-7b40728e5110", "arxiv_id": "2603.11743", "canonical_route": "/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "endpoints": { "paper_pack": "/api/v1/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-/paper-pack", "build_passport": "/api/v1/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair", "normalized_query": "2603.11743", "route": "/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "paper_ref": "semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-#webpage", "url": "https://sciencetostartup.com/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "name": "Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair", "description": "A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-#scholarlyArticle", "headline": "Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair", "description": "A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.", "url": "https://sciencetostartup.com/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-", "sameAs": "https://arxiv.org/abs/2603.11743", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.11743" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-12T09:48:34.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Translation Quality Estimation" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Translation Quality Estimation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Semi-Synthetic Parallel Data for Translation Quality Estimat", "item": "https://sciencetostartup.com/paper/semi-synthetic-parallel-data-for-translation-quality-estimation-a-case-study-of-dataset-building-for-an-under-resourced-" } ] } ] }

Competitive landscape

A semi-synthetic dataset for improving translation quality estimation in under-resourced language pairs.

Segment

Translation Quality Estimation

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline