ARXIV:2603.10547 · DATA INTEGRATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Automatic End-to-End Data Integration using Large Language Models

arXiv

An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

Blocked on Code›Score8.0Evidence unverified

Opportunity summary

Pain An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input…

METHOD

Full abstract

Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \$10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines.

WHY NOW

Data Integration moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainAn automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

Segment

Data Integration

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c2aaf4ac-9286-46be-8af2-dd2053081ff7", "arxiv_id": "2603.10547", "canonical_route": "/paper/automatic-end-to-end-data-integration-using-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "automatic-end-to-end-data-integration-using-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/automatic-end-to-end-data-integration-using-large-language-models/paper-pack", "build_passport": "/api/v1/paper/automatic-end-to-end-data-integration-using-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Automatic End-to-End Data Integration using Large Language Models", "normalized_query": "2603.10547", "route": "/paper/automatic-end-to-end-data-integration-using-large-language-models", "paper_ref": "automatic-end-to-end-data-integration-using-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/automatic-end-to-end-data-integration-using-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/automatic-end-to-end-data-integration-using-large-language-models", "name": "Automatic End-to-End Data Integration using Large Language Models", "description": "An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/automatic-end-to-end-data-integration-using-large-language-models#scholarlyArticle", "headline": "Automatic End-to-End Data Integration using Large Language Models", "description": "An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.", "url": "https://sciencetostartup.com/paper/automatic-end-to-end-data-integration-using-large-language-models", "sameAs": "https://arxiv.org/abs/2603.10547", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.10547" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-11T08:56:55.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Data Integration" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Data Integration", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Automatic End-to-End Data Integration using Large Language M", "item": "https://sciencetostartup.com/paper/automatic-end-to-end-data-integration-using-large-language-models" } ] } ] }

Competitive landscape

An automatic data integration pipeline using GPT-5.2 that reduces manual effort and costs significantly.

Segment

Data Integration

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Automatic End-to-End Data Integration using Large Language Models

Automatic End-to-End Data Integration using Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline