Recent advancements in synthetic data generation are addressing critical gaps in various fields, particularly where real data is scarce or sensitive. The integration of vision and language models is enhancing the interpretability of synthetic data, as seen in efforts to create large-scale datasets for remote sensing, which combine real and synthetic images to improve model performance. In the financial sector, customizable dataset generators are enabling researchers to simulate complex anti-money laundering scenarios, allowing for more robust model evaluations. Meanwhile, the use of large language models to synthesize realistic digital footprints is expanding the scope of behavioral studies and personalized applications. Additionally, frameworks that incorporate causal structures into synthetic data generation are improving the fidelity of tabular datasets, which is crucial for maintaining data integrity in predictive analytics. These developments collectively signify a shift towards more nuanced, application-specific synthetic data solutions that can effectively tackle real-world challenges across industries.
The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while ...
Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typ...
Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the ...
Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present uniq...
Efficient luggage trolley management is critical for reducing congestion and ensuring asset availability in modern airports. Automated detection systems face two main challenges. First, strict securit...
Financial datasets often suffer from bias that can lead to unfair decision-making in automated systems. In this work, we propose FairFinGAN, a WGAN-based framework designed to generate synthetic finan...
The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) of...
AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, ...
Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide ...
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However,...