A Survey of LLM × DATA

A collection of papers and projects related to LLMs and corresponding data-centric methods.

Other publicly-available materials: [Slide]

If you find our survey useful, please cite the paper:

@article{LLMDATASurvey,
    title={A Survey of LLM × DATA},
    author={Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu},
    year={2025},
    journal={arXiv preprint arXiv:2505.18458},
    url={https://arxiv.org/abs/2505.18458}
}
@article{tangllmasanalyst,
    title={LLM/Agent-as-Data-Analyst: A Survey},
    author={Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu},
    year={2025},
    journal={arXiv preprint arXiv:2509.23988},
    url={https://arxiv.org/abs/2509.23988}
}

🌤 The IaaS Concept of DATA4LLM

The IaaS concept for LLM data (phonetically echoing Infrastructure as a Service) defines the characteristics of high-quality datasets along four key dimensions: (1) Inclusiveness ensures broad coverage across domains, tasks, sources, languages, styles, and modalities. (2) Abundance emphasizes sufficient and well-balanced data volume to support scaling, fine-tuning, and continual learning without overfitting. (3) Articulation requires clear, coherent, and instructive content with step-by-step reasoning to enhance model understanding and task performance. (4) Sanitization involves rigorous filtering to remove private, toxic, unethical, and misleading content, ensuring data safety, neutrality, and compliance.

🌟 LLM/Agent-as-Data-Analyst

We observe the evolution of LLM/Agent-as-Data-Analyst techniques follows a five-dimension trajectory: (1) Data Modality (homogeneous → heterogeneous); (2) Analysis Functionality (literal → semantic); (3) Knowledge Scope (closed-world →open-world); (4) Tool Integration (tool-coupled → tool-assisted); (5) Development Autonomy (manual → fully autonomous).

Datasets

CommonCrawl: A massive web crawl dataset covering diverse languages and domains; widely used for LLM pretraining. [Source]
The Stack: A large-scale dataset of permissively licensed source code in multiple programming languages; used for code LLMs. [HuggingFace]
RedPajama: A replication of LLaMA’s training data recipe with open datasets; spans web, books, arXiv, and more. [Github]
SlimPajama-627B-DC: A deduplicated and filtered subset of RedPajama (627B tokens); optimized for clean and efficient training. [HuggingFace]
Alpaca-CoT: Instruction-following dataset enhanced with Chain-of-Thought (CoT) reasoning prompts; used for dialogue fine-tuning. [Github]
LLaVA-Pretrain: A multimodal dataset with image-text pairs for training visual language models like LLaVA. [HuggingFace]
Wikipedia: Structured and encyclopedic content; a foundational source for general-purpose language models. [HuggingFace]
C4: A cleaned version of CommonCrawl data, widely used in models like T5 for high-quality web text. [HuggingFace]
BookCorpus: Contains free fiction books; often used to teach models long-form language understanding. [HuggingFace]
Arxiv: Scientific paper corpus from arXiv, covering physics, math, CS, and more; useful for academic language modeling. [HuggingFace]
PubMed: Biomedical literature dataset from the PubMed database; key resource for medical domain models. [Source]
StackExchange: Community Q&A data covering domains like programming, math, philosophy, etc.; useful for QA and dialogue tasks. [Source]
OpenWebText2: A high-quality open-source web text dataset based on URLs commonly cited on Reddit; GPT-style training corpus. [Source]
OpenWebMath: A dataset of math questions and answers; designed to improve mathematical reasoning in LLMs. [HuggingFace]
Falcon-RefinedWeb: Filtered web data used in training Falcon models; emphasizes data quality through rigorous preprocessing. [HuggingFace]
CCI 3.0: A large-scale multi-domain Chinese web corpus, suitable for training high-quality Chinese LLMs. [HuggingFace]
OmniCorpus: A unified multimodal dataset (text, image, audio) designed for general-purpose AI training. [Github]
WanJuan3.0: A diverse and large-scale Chinese dataset including news, fiction, QA, and more; released by OpenDataLab. [Source]

0 Data Characteristics across LLM Stages

⬆️top

Data for Pretraining

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, et al. NeurIPS 2023. [Paper]
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Yukun Zhu, Ryan Kiros, Richard Zemel, et al. ICCV 2015. [Paper]

Data for Continual Pre-training

MedicalGPT: Training Medical GPT Model
Ming Xu. [Github]
BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark
Dakuan Lu, Hengkui Wu, Jiaqing Liang, et al. arXiv 2023. [Paper]

Data for Supervised Fine-Tuning (SFT)

General Instruction Following

Free dolly: Introducing the world’s first truly open instruction-tuned llm
Mike Conover, Matt Hayes, Ankit Mathur, et al. 2023. [Source]

Specific Domain Usage

MedicalGPT: Training Medical GPT Model [Github]
DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services
Shengbin Yue, Wei Chen, Siyuan Wang, et al. arXiv 2023. [Paper]

Data for Reinforcement Learning (RL)

RLHF

MedicalGPT: Training Medical GPT Model [Github]
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Ganqu Cui, Lifan Yuan, Ning Ding, et al. ICML 2024. [Paper]

RoRL

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. arXiv 2025. [Paper]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team. arXiv 2025. [Paper]

Data for Retrieval-Augmented Generation (RAG)

DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue
Feiyuan Zhang, Dezhi Zhu, James Ming, et al. arXiv 2025. [Paper]
Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation
Junde Wu, Jiayuan Zhu, Yunli Qi, et al. arXiv 2024. [Paper]
ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization
Yunxiao Shi, Xing Zi, Zijing Shi, et al. arXiv 2024. [Paper]
PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents
Saber Zerhoudi, Michael Granitzer. arXiv 2024. [Paper]
DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services [Paper]

Data for LLM Evaluation

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, et al. CVPR 2024. [Paper]
LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models
Haitao Li, You Chen, Qingyao Ai, et al. NeurIPS 2024. [Paper]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams
Di Jin, Eileen Pan, Nassim Oufattole, et al. AAAI 2021. [Paper]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, et al. arXiv 2021. [Paper]

Data for LLM Agents

STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li. arXiv 2025. [Paper]
Large Language Model-Based Agents for Software Engineering: A Survey
Junwei Liu, Kaixin Wang, Yixuan Chen, et al. arXiv 2024. [Paper]
Advancing LLM Reasoning Generalists with Preference Trees
Lifan Yuan, Ganqu Cui, Hanbin Wang, et al. arXiv 2024. [Paper]
Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents
Zhengliang Shi, Shen Gao, Lingyong Yan, et al. arXiv 2024. [Paper]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, et al. EMNLP 2023. [Paper]

1 Data Processing for LLM

⬆️top

1.1 Data Acquisition

Data Sources

Public Data

Project Gutenberg: A large collection of free eBooks from the public domain; supports training language models on long-form literary text. [Source]
Open Library: A global catalog of books with metadata and some open-access content; useful for multilingual and knowledge-enhanced language modeling. [Source]
GitHub: The world’s largest open-source code hosting platform; supports training models for code generation and understanding. [Source]
GitLab: A DevOps platform for hosting both private and open-source projects; provides high-quality programming and documentation data. [Source]
Bitbucket: A source code hosting platform by Atlassian; suitable for mining enterprise-level software development data. [Source]
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, et al. LREC-COLING 2024. [Paper]
The Stack: 3 TB of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, et al. arXiv 2022. [Paper]
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Linting Xue, Noah Constant, Adam Roberts, et al. NAACL 2021. [Paper]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, et al. JMLR 2020. [Paper]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, et al. arXiv 2019. [Paper]
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books [Paper]

Data Acquisition Methods

Website Crawling

Beautiful Soup: A Python-based library for parsing HTML and XML documents; supports extracting structured information from static web pages. [Source]
Selenium: A browser automation tool that enables interaction with dynamic web pages; suitable for scraping JavaScript-heavy content. [Github]
Playwright: A browser automation framework developed by Microsoft; supports multi-browser environments and is ideal for high-quality, concurrent web scraping tasks. [Source]
Puppeteer: A Node.js library that provides a high-level API to control headless Chrome or Chromium; useful for scraping complex pages, taking screenshots, or generating PDFs. [Source]
An Empirical Comparison of Web Content Extraction Algorithms
Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein. SIGIR 2023. [Paper]
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction
Adrien Barbaresi. ACL 2021 Demo. [Paper]
Fact or Fiction: Content Classification for Digital Libraries
Aidan Finn, N. Kushmerick, Barry Smyth. DELOS Workshops / Conferences 2001. [Paper]

Layout Analysis

PaddleOCR: An open-source Optical Character Recognition (OCR) toolkit based on the PaddlePaddle deep learning framework; supports multilingual text detection and recognition, ideal for extracting text from images and document layout analysis. [Github]
YOLOv10: Real-Time End-to-End Object Detection
Ao Wang, Hui Chen, Lihao Liu, et al. NeurIPS 2024. [Paper]
UMIE: Unified Multimodal Information Extraction with Instruction Tuning
Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou. AAAI 2024. [Paper]
ChatEL: Entity linking with chatbots
Yifan Ding, Qingkai Zeng, Tim Weninger. LREC | COLING 2024. [Paper]
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Haoran Wei, Lingyu Kong, Jinyue Chen, et al. ECCV 2024. [Paper]
General OCR Theory: Towards OCR - 2.0 via a Unified End - to - end Model
Haoran Wei, Chenglong Liu, Jinyue Chen, et al. arXiv 2024. [Paper]
Focus Anywhere for Fine-grained Multi-page Document Understanding
Chenglong Liu, Haoran Wei, Jinyue Chen, et al. arXiv 2024. [Paper]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, et al. arXiv 2024. [Paper]
WebIE: Faithful and Robust Information Extraction on the Web
Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, et al. ACL 2023. [Paper]
ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking
Tom Ayoola, Shubhi Tyagi, Joseph Fisher, et al. NAACL 2022 Industry Track. [Paper]
Alignment-Augmented Consistent Translation for Multilingual Open Information Extraction
Keshav Kolluru, Muqeeth Mohammed, Shubham Mittal, et al. ACL 2022. [Paper]
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang, Tengchao Lv, Lei Cui, et al. ACM Multimedia 2022. [Paper]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, et al. ICML 2021. [Paper]
Tesseract: an open-source optical character recognition engine
Anthony Kay. Linux Journal, Volume 2007. [Paper]

1.2 Data Deduplication

⬆️top

Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models
Wenbei Xie. arXiv 2023. [Paper]
Scaling Laws and Interpretability of Learning from Repeated Data
Danny Hernandez, Tom Brown, Tom Conerly, et al. arXiv 2022. [Paper]

Exact Substring Matching

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
Guosheng Dong, Da Pan, Yiding Sun, et al. arXiv 2024. [Paper]
Deduplicating Training Data Makes Language Models Better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, et al. ACL 2022. [Paper]
Suffix arrays: a new method for on-line string searches
Udi Manber, Gene Myers. SIAM Journal on Computing 1993. [Paper]

Approximate Hashing-based Deduplication

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [Paper]
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
Arham Khan, Robert Underwood, Carlo Siebenschuh, et al. arXiv 2024. [Paper]
SimiSketch: Efficiently Estimating Similarity of streaming Multisets
Fenghao Dong, Yang He, Yutong Liang, et al. arXiv 2024. [Paper]
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
Igor Nunes, Mike Heddes, Pere Vergés, et al. KDD 2023. [Paper]
Formalizing BPE Tokenization
Martin Berglund (Umeå University), Brink van der Merwe (Stellenbosch University). NCMA 2023. [Paper]
SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen, Tianhua Tao, Liqun Ma, et al. arXiv 2023. [Paper]
Deduplicating Training Data Makes Language Models Better [Paper]
Noise-Robust De-Duplication at Scale
Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell. arXiv 2022. [Paper]
In Defense of Minhash over Simhash
Anshumali Shrivastava, Ping Li. AISTATS 2014. [Paper]
Similarity estimation techniques from rounding algorithms
Moses S. Charikar. STOC 2002. [Paper]
On the Resemblance and Containment of Documents
A. Broder. Compression and Complexity of SEQUENCES 1997. [Paper]

Approximate Frequency-based Down-Weighting

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training
Nan He, Weichen Xiong, Hanwen Liu, et al. ACL 2024. [Paper]

Embedding-Based Clustering

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle. CVPR 2024. [Paper]
Effective Pruning of Web-Scale Datasets Based on Complexity of Concept Clusters
Amro Abbas, Evgenia Rusak, Kushal Tirumala, et al. ICLR 2024. [Paper]
D4: Improving LLM Pretraining via Document De-Duplication and Diversification
Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari Morcos. NeurIPS 2023. [Paper]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas, Kushal Tirumala, Dániel Simig, et al. ICLR 2023. [Paper]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, et al. arXiv 2022. [Paper]
Learning Transferable Visual Models From Natural Language Supervision [Paper]
OpenCLIP
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, et al. 2021. [Paper]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, et al. NeurIPS 2021. [Paper]

Non-Text Data Deduplication

DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, et al. NeurIPS 2023. [Paper]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication [Paper]
Learning Transferable Visual Models From Natural Language Supervision [Paper]
Contrastive Learning with Large Memory Bank and Negative Embedding Subtraction for Accurate Copy Detection
Shuhei Yokoo. arXiv 2021. [Paper]

1.3 Data Filtering

⬆️top

Sample-level Filtering

(1) Statistical Evaluation

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, et al. ICLR 2025. [Paper]
Data-efficient Fine-tuning for LLM-based Recommendation
Xinyu Lin, Wenjie Wang, Yongqi Li, et al. SIGIR 2024. [Paper]
SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning
Yexiao He, Ziyao Wang, Zheyu Shen, et al. NeurIPS 2024. [Paper]
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
Yu Yang, Siddhartha Mishra, Jeffrey Chiang, et al. NeurIPS 2024. [Paper]
Effective Pruning of Web-Scale Datasets Based on Complexity of Concept Clusters [Paper]
WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions
Can Xu, Qingfeng Sun, Kai Zheng, et al. ICLR 2024. [Paper]
Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
Ming Li, Yong Zhang, Shwai He, et al. ACL 2024. [Paper]
Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models
Dheeraj Mekala, Alex Nguyen, Jingbo Shang. ACL 2024. [Paper]
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, et al. ACL 2024. [Paper]
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Ming Li, Yong Zhang, Zhitao Li, et al. NAACL 2024. [Paper]
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush, Christopher Potts, Tatsunori Hashimoto. arXiv 2024. [Paper]
Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs
The Mosaic Research Team. 2023. [Paper]
Instruction Tuning with GPT-4
Baolin Peng, Chunyuan Li, Pengcheng He, et al. arXiv 2023. [Paper]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, et al. arXiv 2023. [Paper]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, et al. arXiv 2021. [Paper]
Language Models are Unsupervised Multitask Learners
Alec Radford, Jeffrey Wu, Rewon Child, et al. OpenAI blog 2019. [Paper]
Bag of Tricks for Efficient Text Classification
Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov. EACL 2017. [Paper]
The Shapley Value: Essays in Honor of Lloyd S. Shapley
A. E. Roth, Ed. Cambridge: Cambridge University Press, 1988. [Source]

(2) Model Scoring

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen. ICLR 2025. [Paper]
SCAR: Data Selection via Style-Consistency-Aware Response Ranking for Efficient Instruction Tuning of Large Language Models
Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, et al. ACL 2025. [Paper] [Github]
QuRating: Selecting High-Quality Data for Training Language Models
Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen. ICML 2024. [Paper]
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
Wei Liu, Weihao Zeng, Keqing He, et al. ICLR 2024. [Paper]
LAB: Large-Scale Alignment for ChatBots
Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, et al. arXiv 2024. [Paper]
Biases in Large Language Models: Origins, Inventory, and Discussion
Roberto Navigli, Simone Conia, Björn Ross. ACM JDIQ, 2023. [Paper]

(3) Hybrid Methods

Emergent and predictable memorization in large language models
Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, et al. NeurIPS 2023. [Paper]
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Max Marion, Ahmet Üstün, Luiza Pozzobon, et al. arXiv 2023. [Paper]
Instruction Mining: Instruction Data Selection for Tuning Large Language Models
Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun. arXiv 2023. [Paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, et al. arXiv 2023. [Paper]
MoDS: Model-oriented Data Selection for Instruction Tuning
Qianlong Du, Chengqing Zong, Jiajun Zhang. arXiv 2023. [Paper]
Economic Hyperparameter Optimization With Blended Search Strategy
Chi Wang, Qingyun Wu, Silu Huang, Amin Saied. ICLR 2021. [Paper]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. NAACL 2019. [Paper]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Ozan Sener, Silvio Savarese. ICLR 2018. [Paper]

Content-level Filtering

spaCy: An industrial-strength Natural Language Processing (NLP) library that supports tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more; well-suited for fast and accurate text processing and information extraction. [Source]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, et al. ICLR 2025. [Paper]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, et al. arXiv 2025. [Paper]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan et al. arXiv 2025. [Paper]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, Lidong Bing. EMNLP 2023 (System Demonstrations). [Paper]
Analyzing Leakage of Personally Identifiable Information in Language Models
Nils Lukas, Ahmed Salem, Robert Sim, et al. IEEE S&P 2023. [Paper]
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4
Zhengliang Liu, Yue Huang, Xiaowei Yu, et al. arXiv 2023. [Paper]
Baichuan 2: Open Large-scale Language Models
Aiyuan Yang, Bin Xiao, Bingning Wang, et al. arXiv 2023. [Paper]
Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives
Haoning Wu, Erli Zhang, Liang Liao, et al. arXiv 2022. [Paper]
YOLOX: Exceeding YOLO Series in 2021
Zheng Ge, Songtao Liu, Feng Wang, et al. arXiv 2021. [Paper]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs [Paper]
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP
Alan Akbik, Tanja Bergmann, Duncan Blythe, et al. NAACL 2019 Demos. [Paper]

1.4 Data Selection

⬆️top

A Survey on Data Selection for Language Models
Alon Albalak, Yanai Elazar, Sang Michael Xie, et al. arXiv 2024. [Paper]
A Survey on Data Selection for LLM Instruction Tuning
Jiahao Wang, Bolin Zhang, Qianlong Du, et al. arXiv 2024. [Paper]

Similarity-based Data Selection

spaCy: [Source]
Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis
Ruiyang Qin, Jun Xia, Zhenge Jia, et al. DAC 2024. [Paper]
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, et al. NeurIPS 2024. [Paper]
Efficient Continual Pre-training for Building Domain Specific Large Language Models
Yong Xie, Karan Aggarwal, Aitzaz Ahmad. Findings of ACL 2024. [Paper]
Data Selection for Language Models via Importance Resampling
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang. NeurIPS 2023. [Paper]

Optimization-based Data Selection

DSDM: model-aware dataset selection with datamodels
Logan Engstrom, Axel Feldmann, Aleksander Mądry. ICML 2024. [Paper]
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, et al. ICML 2024. [Paper]
TSDS: Data Selection for Task-Specific Model Finetuning
Zifan Liu, Amin Karbasi, Theodoros Rekatsinas. arXiv 2024. [Paper]
Datamodels: Understanding Predictions with Data and Data with Predictions
Andrew Ilyas, Sung Min Park, Logan Engstrom, et al. ICML 2022. [Paper]

Model-based Data Selection

Autonomous Data Selection with Language Models for Mathematical Texts
Yifan Zhang, Yifan Luo, Yang Yuan, et al. ICLR 2024. [Paper]

1.5 Data Mixing

⬆️top

Mixtera: A Data Plane for Foundation Model Training Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic. arXiv 2025. [Paper]
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Clara Na, Ian Magnusson, Ananya Harsh Jha, et al. EMNLP 2024. [Paper]
Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models
Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, et al. COLING 2024. [Paper]

Heuristic Optimization

BiMix: Bivariate Data Mixing Law for Language Model Pretraining
Ce Ge, Zhijian Ma, Daoyuan Chen, et al. arXiv 2024. [Paper]
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
Steven Feng, Shrimai Prabhumoye, Kezhi Kong, et al. arXiv 2024. [Paper]
SlimPajama-DC: Understanding Data Combinations for LLM Training [Paper]
Evaluating Large Language Models Trained on Code [Paper]
Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, et al. NAACL 2019. [Paper]
A mathematical theory of communication
C. E. Shannon. The Bell system technical journal 1948. [Paper]

Bilevel Optimization

ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
Rui Pan, Jipeng Zhang, Xingyuan Pan, et al. ACL 2025. [Paper]
DoGE: Domain Reweighting with Generalization Estimation
Simin Fan, Matteo Pagliardini, Martin Jaggi. ICML 2024. [Paper]
An overview of bilevel optimization
Benoît Colson, Patrice Marcotte, Gilles Savard. AOR 2007. [Paper]

Distributionally Robust Optimization

Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval
Guangyuan Ma, Yongliang Ma, Xing Wu, et al. AAAI 2025. [Paper]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Sang Michael Xie, Hieu Pham, Xuanyi Dong, et al. NeurIPS 2023. [Paper]
Qwen Technical Report
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, et al. arXiv 2023. [Paper]

Model-Based Optimization

RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, et al. ICLR 2025. [Paper]
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye, Peiju Liu, Tianxiang Sun, et al. ICLR 2025. [Paper]
CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
Jiawei Gu, Zacc Yang, Chuanghao Ding, et al. EMNLP 2024. [Paper]
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu. arXiv 2024. [Paper]
BiMix: Bivariate Data Mixing Law for Language Model Pretraining [Paper]
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Haoran Que, Jiaheng Liu, Ge Zhang, et al. arXiv 2024. [Paper]
Data Proportion Detection for Optimized Data Management for Large Language Models
Hao Liang, Keshi Zhao, Yajie Yang, et al. arXiv 2024. [Paper]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining [Paper]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. NeurIPS 2022. [Paper]
LightGBM: a highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, et al. NeurIPS 2017. [Paper]

1.6 Data Distillation and Synthesis

⬆️top

How to Synthesize Text Data without Model Collapse?
Xuekai Zhu, Daixuan Cheng, Hengli Li, et al. ICML 2025. [Paper]
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Chulin Xie, Zinan Lin, Arturs Backurs, et al. ICML 2024. [Paper]
LLM See, LLM Do: Leveraging Active Inheritance to Target Non-Differentiable Objectives
Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, et al. EMNLP 2024. [Paper]
WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions [Paper]
Augmenting Math Word Problems via Iterative Question Composing
Haoxiong Liu, Yifan Zhang, Yifan Luo, et al. arXiv 2024. [Paper]

Knowledge Distillation

Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation
Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, et al. ACL 2024. [Paper]
PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning
Xuekai Zhu, Biqing Qi, Kaiyan Zhang, et al. NAACL 2024. [Paper]
Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data
Anup Shirgaonkar, Nikhil Pandey, Nazmiye Ceren Abay, et al. arXiv 2024. [Paper]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. arXiv 2021. [Paper]
Dialogue chain-of-thought distillation for commonsense-aware conversational agents
Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, et al. arXiv 2023. [Paper]
MCC-KD: Multi-CoT consistent knowledge distillation
Hongzhan Chen, Siyue Wu, Xiaojun Quan, et al. arXiv 2023. [Paper]
Large language models are reasoning teachers
Namgyu Ho, Laura Schmid, Se-Young Yun. arXiv 2023. [Paper]
Leveraging training data in few-shot prompting for numerical reasoning
Zhanming Jie, Wei Lu. arXiv 2023. [Paper]
Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks
Minki Kang, Seanie Lee, Jinheon Baek, et al. NeurIPS 2023. [Paper]
Symbolic chain-of-thought distillation: Small models can also "think" step-by-step
Liunian Harold Li, Jack Hessel, Youngjae Yu, et al. arXiv 2024. [Paper]
Explanations from large language models make small reasoners better
Shiyang Li, Jianshu Chen, Yelong Shen, et al. arXiv 2022. [Paper]
Distilling reasoning capabilities into smaller language models
Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan. arXiv 2023. [Paper]
SCOTT: Self-consistent chain-of-thought distillation
Peifeng Wang, Zhengyang Wang, Zheng Li, et al. arXiv 2023. [Paper]
Democratizing reasoning ability: Tailored learning from large language model
Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, et al. arXiv 2023. [Paper]

Pre-training Data Augmentation

BERT-Tiny-Chinese: A lightweight Chinese BERT pre-trained model released by CKIP Lab, with a small number of parameters; suitable for use as an encoder in pre-training data augmentation tasks to enhance efficiency for compact models. [Source]
Case2Code: Scalable Synthetic Data for Code Generation
Yunfan Shao, Linyang Li, Yichuan Ma, et al. COLING 2025. [Paper]
Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages
Zui Chen, Tianqiao Liu, Mi Tian, et al. ICLR 2025. [Paper]
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
Kun Zhou, Beichen Zhang, Jiapeng Wang, et al. arXiv 2024. [Paper]
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao, Haiping Wu, Weijian Xu, et al. CVPR 2024. [Paper]
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, et al. CVPR 2024. [Paper]
Magicoder: Empowering Code Generation with OSS-Instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, et al. ICML 2024. [Paper]
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Daixuan Cheng, Yuxian Gu, Shaohan Huang, et al. EMNLP 2024. [Paper]
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [Paper]
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Pratyush Maini, Skyler Seto, Richard Bai, et al. ACL 2024. [Paper]
VeCLIP: Improving CLIP Training via Visual-Enriched Captions
Zhengfeng Lai, Haotian Zhang, Bowen Zhang, et al. ECCV 2024. [Paper]
Diffusion Models and Representation Learning: A Survey
Michael Fuest, Pingchuan Ma, Ming Gui, et al. arXiv 2024. [Paper]
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
Qingqing Cao, Mahyar Najibi, Sachin Mehta. arXiv 2024. [Paper]
Qwen2 Technical Report
An Yang, Baosong Yang, Binyuan Hui, et al. arXiv 2024. [Paper]
TinyLlama: An Open-Source Small Language Model [Paper]
On the Diversity of Synthetic Data and its Impact on Training Large Language Models
Hao Chen, Abdul Waheed, Xiang Li, et al. arXiv 2024. [Paper]
Towards Effective and Efficient Continual Pre-training of Large Language Models
Jie Chen, Zhipeng Chen, Jiapeng Wang, et al. arXiv 2024. [Paper]
Improving CLIP Training with Language Rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, et al. NeurIPS 2023. [Paper]
Effective Data Augmentation With Diffusion Models
Brandon Trabucco, Kyle Doherty, Max Gurinas, et al. arXiv 2023. [Paper]
Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. arXiv 2023. [Paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models [Paper]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, et al. arXiv 2023. [Paper]
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Jesse Dodge, Maarten Sap, Ana Marasović, et al. EMNLP 2021. [Paper]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling [Paper]
First Steps of an Approach to the ARC Challenge based on Descriptive Grid Models and the Minimum Description Length Principle
Sébastien Ferré (Univ Rennes, CNRS, IRISA). arXiv 2021. [Paper]
TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, et al. Findings of EMNLP 2020. [Paper]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, et al. ACL 2019. [Paper]

SFT Data Augmentation

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
Yiming Huang, Xiao Liu, Yeyun Gong, et al. arXiv 2024. [Paper]
Augmenting Math Word Problems via Iterative Question Composing [Paper]
AgentInstruct: Toward Generative Teaching with Agentic Flows
Arindam Mitra, Luciano Del Corro, Guoqing Zheng, et al. arXiv 2024. [Paper]
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models
Haoran Li, Qingxiu Dong, Zhengyang Tang, et al. arXiv 2024. [Paper]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, et al. ACL 2023. [Paper]

SFT Reasoning Data Augmentation

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
LIMO: Less is More for Reasoning
Yixin Ye, Zhen Huang, Yang Xiao, et al. arXiv 2025. [Paper]
LLMs Can Easily Learn to Reason from Demonstrations: Structure, Not Content, Is What Matters!
Dacheng Li, Shiyi Cao, Tyler Griggs, et al. arXiv 2025. [Paper]
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Maohao Shen, Guangtao Zeng, Zhenting Qi, et al. arXiv 2025. [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Zhenyu Hou, Xin Lv, Rui Lu, et al. arXiv 2025. [Paper]
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
Yinya Huang, Xiaohan Lin, Zhengying Liu, et al. ICLR 2024. [Paper]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, et al. ACL 2024. [Paper]
NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions
Jia Li, Edward Beeching, Lewis Tunstall, et al. 2024. [Paper]
QwQ: Reflect Deeply on the Boundaries of the Unknown
Qwen Team. 2024. [Source]
Let's Verify Step by Step
Hunter Lightman, Vineet Kosaraju, Yura Burda, et al. arXiv 2023. [Paper]

Reinforcement Learning

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. NeurIPS 2023. [Paper]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, et al. arXiv 2022. [Paper]

Retrieval-Augmentation Generation

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data
Shenglai Zeng, Jiankun Zhang, Pengfei He, et al. arXiv 2024. [Paper]

1.7 End-to-End Data Processing Pipelines

⬆️top

1.7.1 Typical data processing frameworks

Mixtera: A Data Plane for Foundation Model Training
Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, et al. arXiv 2025. [Paper]
Data-Juicer: A One-Stop Data Processing System for Large Language Models
Daoyuan Chen, Yilun Huang, Zhijian Ma, et al. SIGMOD 2024. [Paper]
An Integrated Data Processing Framework for Pretraining Foundation Models
Yiding Sun, Feng Wang, Yutao Zhu, et al. SIGIR 2024. [Paper]
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, et al. arXiv 2024. [Paper]

1.7.2 Typical data pipelines

Common Crawl: A large-scale publicly accessible web crawl dataset that provides massive raw webpages and metadata. It serves as a crucial raw data source in typical pretraining data pipelines, where it undergoes multiple processing steps such as cleaning, deduplication, and formatting to produce high-quality corpora for downstream model training. [Source]
The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction [Paper]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. arXiv 2021. [Paper]
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek, Marie - Anne Lachaux, Alexis Conneau, et al. LREC 2020. [Paper]
Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
Bag of Tricks for Efficient Text Classification [Paper]

1.7.3 Orchestration of data pipelines

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [Paper]

2 Data Storage for LLM

⬆️top

2.1 Data Formats

Training Data Format

TFRecord: A binary data storage format recommended by TensorFlow, suitable for efficient storage and reading of large-scale training data. [Source]
MindRecord: An efficient data storage format used by MindSpore, supporting multi-platform data management. [Source]
tf.data.Dataset: An abstract interface in TensorFlow representing collections of training data, enabling flexible data manipulation. [Source]
COCO JSON: COCO JSON format uses structured JSON to store images and their corresponding labels, widely used in computer vision datasets. [Source]

Model Data Format

PyTorch-specific formats (.pt, .pth): PyTorch’s .pt and .pth formats are used to save model parameters and architecture, supporting model storage and loading. [Source]
TensorFlow(SavedModel, .ckpt): TensorFlow’s SavedModel and checkpoint formats save complete model information, facilitating model reproduction and deployment. [Source]
Hugging Face Transformers library: Hugging Face offers a unified model format interface to facilitate saving and usage of various pretrained models. [Source]
Pickle (.pkl): Pickle format is used for serializing models and data, suitable for quick saving and loading. [Source]
ONNX: An open cross-platform model format supporting model conversion and deployment across different frameworks. [Source]
An Empirical Study of Safetensors' Usage Trends and Developers' Perceptions
Beatrice Casey, Kaia Damian, Andrew Cotaj, et al. arXiv 2025. [Paper]

2.2 Data Distribution

⬆️top

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, et al. SIGSPATIAL 2024. [Paper]

Distributed Storage Systems

JuiceFS: A high-performance cloud-native distributed file system designed for efficient storage and access of large-scale data. [Github]
3FS: A distributed file system designed for deep learning and large-scale data processing, emphasizing high throughput and reliability. [Github]
S3: A widely used cloud storage service offering secure, scalable, and highly available object storage solutions. [Source]
Hdfs architecture guide. Hadoop apache project
D. Borthakur et al. Hadoop apache project, 53(1-13):2, 2008. [Source]

Heterogeneous Storage Systems

ProTrain: Efficient LLM Training via Memory-Aware Techniques
Hanmei Yang, Jin Zhou, Yao Fu, et al. arXiv 2024. [Paper]
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, et al. SC 2021. [Paper]
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, et al. USENIX ATC 2021. [Paper]
ZeRO: memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, et al. SC 2020. [Paper]
vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, et al. MICRO-49 2016. [Paper]

2.3 Data Organization

⬆️top

Survey of Hallucination in Natural Language Generation
Ziwei Ji, Nayeon Lee, Rita Frieske, et al. ACM Computing Surveys (2022). [Paper]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. NeurIPS 2020. [Paper]

Vector-Based Organization

STELLA: A large-scale Chinese vector database supporting efficient vector search and semantic retrieval applications. [Source]
Milvus: An open-source vector database focused on large-scale, high-performance similarity search and analysis. [Source]
Weaviate: Weaviate offers a cloud-native vector search engine supporting intelligent search and knowledge graph construction for multimodal data. [Source]
LanceDB: An efficient vector database designed for large-scale machine learning and recommendation systems. [Source]
Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation
Zijie Zhong, Hanwen Liu, Xiaoya Cui, et al. COLING 2025. [Paper]
Dense X Retrieval: What Retrieval Granularity Should We Use?
Tong Chen, Hongwei Wang, Sihao Chen, et al. EMNLP 2024. [Paper]
Scalable and Domain-General Abstractive Proposition Segmentation
Mohammad Javad Hosseini, Yang Gao, Tim Baumgärtner, et al. Findings of EMNLP 2024. [Paper]
A Hierarchical Context Augmentation Method to Improve Retrieval-Augmented LLMs on Scientific Papers
Tian-Yi Che, Xian-Ling Mao, Tian Lan, et al. KDD 2024. [Paper]
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlyu Chen, Shitao Xiao, Peitian Zhang, et al. Findings of ACL 2024. [Paper]
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
Kaikai An, Fangkai Yang, Liqun Li, et al. arXiv 2024. [Paper]
GleanVec: Accelerating Vector Search with Minimalist Nonlinear Dimensionality Reduction
Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, et al. arXiv 2024. [Paper]
The Faiss Library
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, et al. arXiv 2024. [Paper]
Similarity Search in the Blink of an Eye with Compressed Indices
Cecilia Aguerrebere, Ishwar Singh Bhati, Mark Hildebrand, et al. VLDB Endowment 2023. [Paper]
LeanVec: Searching Vectors Faster by Making Them Fit
Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, et al. arXiv 2023. [Paper]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, et al. arXiv 2023. [Paper]

Graph-Based Organization

ArangoDB: A multi-model database that supports graph, document, and key-value data, suitable for handling complex relational queries. [Source]
MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
Tianyu Fan, Jingyuan Wang, Xubin Ren, et al. arXiv 2025. [Paper]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, et al. arXiv 2024. [Paper]
LightRAG: Simple and Fast Retrieval-Augmented Generation
Zirui Guo, Lianghao Xia, Yanhua Yu, et al. arXiv 2024. [Paper]
Graph Databases Assessment: JanusGraph, Neo4j, and TigerGraph
Jéssica Monteiro, et al. Perspectives and Trends in Education and Technology 2023. [Paper]
Empirical Evaluation of a Cloud-Based Graph Database: the Case of Neptune
Ghislain Auguste Atemezing. KGSWC 2021. [Paper]

2.4 Data Movement

⬆️top

Caching Data

CacheLib: An open-source, high-performance embedded caching library developed by Meta to accelerate data access and increase system throughput. [Source]
Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
Mark Zhao, Satadru Pan, Niket Agarwal, et al. USENIX ATC 2023. [Paper]
Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
Rong Gu, Kai Zhang, Zhihao Xu, et al. ICDE 2022. [Paper]
Quiver: An Informed Storage Cache for Deep Learning
Abhishek Kumar, Muthian Sivathanu. USENIX FAST 2020. [Paper]

Data/Operator Offloading

cedar: Optimized and Unified Machine Learning Input Data Pipelines
Mark Zhao, et al. Proceedings of the VLDB Endowment, Volume 18, Issue 2, 2025. [Paper]
Pecan: cost-efficient ML data preprocessing with automatic transformation ordering and hybrid placement
Dan Graur, Oto Mraz, Muyu Li, et al. USENIX ATC 2024. [Paper]
tf.data service: A Case for Disaggregating ML Input Data Processing
Andrew Audibert, Yang Chen, Dan Graur, et al. SoCC 2023. [Paper]
Cachew: Machine Learning Input Data Processing as a Service
Dan Graur, Damien Aymon, Dan Kluser, et al. USENIX ATC 2022. [Paper]
Borg: the next generation
Muhammad Tirmazi, Adam Barker, Nan Deng, et al. EuroSys 2020. [Paper]

Overlapping of storage and computing

Optimizing RLHF Training for Large Language Models with Stage Fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, et al. NSDI 2025. [Paper]
SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters
Hanyu Zhao, Zhenhua Han, Zhi Yang, et al. EuroSys 2023. [Paper]
Optimization by Simulated Annealing
S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi. Science, 220(4598):671–680, 1983. [Paper]

2.5 Data Fault Tolerance

⬆️top

Checkpoints

PaddleNLP: PaddleNLP supports checkpoint saving and resuming during training, enabling fault tolerance and recovery for long-running training tasks. [Source]
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Ziheng Jiang, Haibin Lin, Yinmin Zhong, et al. USENIX NSDI 2024. [Paper]
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Borui Wan, Mingji Han, Yiyao Sheng, et al. arXiv 2024. [Paper]
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Zhuang Wang, Zhen Jia, Shuai Zheng, et al. SOSP 2023. [Paper]
CheckFreq: Frequent, Fine-Grained DNN Checkpointing
Jayashree Mohan, Amar Phanishayee, Vijay Chidambaram. USENIX FAST 2021. [Paper]

Redundant Computations

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, et al. SOSP 2024. [Paper]
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, et al. NSDI 2023 . [Paper]
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
Insu Jang, Zhenning Yang, Zhen Zhang, et al. SOSP 2023. [Paper]

2.6 KV Cache

⬆️top

Cache Space Management

Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. SOSP 2023. [Paper]
VTensor: Using Virtual Tensors to Build a Layout-oblivious AI Programming Framework
Feng Yu, Jiacheng Zhao, Huimin Cui, et al. PACT 2020. [Paper]

KV Placement

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Bin Gao, Zhuomin He, Puru Sharma, et al. USENIX ATC 2024. [Paper]
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Chao Jin, Zili Zhang, Xuanlin Jiang, et al. arXiv 2024. [Paper]

KV Shrinking

Adaptive KV-Cache Compression without Manually Setting Budget
Chenxia Tang, Jianchun Liu, Hongli Xu, et al. arXiv 2025. [Paper]
Fast State Restoration in LLM Serving with HCache
Shiwei Gao, Youmin Chen, Jiwu Shu. EuroSys 2025. [Paper]
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Yuhan Liu, Hanchen Li, Yihua Cheng, et al. SIGCOMM 2024. [Paper]
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Akide Liu, Jing Liu, Zizheng Pan, et al. NeurIPS 2024. [Paper]
Animating rotation with quaternion curves
Ken Shoemake. ACM SIGGRAPH Computer Graphics, Volume 19, Issue 3. 1985. [Paper]

KV Indexing

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Lu Ye, Ze Tao, Yong Huang, et al. ACL 2024. [Paper]
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
Zhen Zheng, Xin Ji, Taosong Fang, et al. arXiv 2024. [Paper]

3 Data Serving for LLM

⬆️top

3.1 Data Shuffling

Data Shuffling for Training

Mixtera: A Data Plane for Foundation Model Training
Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, et al. arXiv 2025. [Paper]
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo, Xin Zhang, Xiao Liu, et al. ACL 2025. [Paper]
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition
Guanting Dong, Hongyi Yuan, Keming Lu, et al. ACL 2024. [Paper]
Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models
Minghao Wu, Thuy-Trang Vu, Lizhen Qu, et al. EMNLP 2024. [Paper]
Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning
Jisu Kim, Juhwan Lee. arXiv 2024. [Paper]
NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks
Jean-michel Attendu, Jean-philippe Corbeil. SustaiNLP @ ACL 2023. [Paper]
Efficient Online Data Mixing For Language Model Pre-Training
Alon Albalak, Liangming Pan, Colin Raffel, et al. arXiv 2023. [Paper]
Data Pruning via Moving-one-Sample-out
Haoru Tan, Sitong Wu, Fei Du, et al. NeurIPS 2023. [Paper]
BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning
Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, et al. ENLSP @ NeurIPS2022. [Paper]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. arXiv 2020. [Paper]
Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory
James L. McClelland, Bruce L. McNaughton, Randall C. O’Reilly. Psychological Review 1995. [Paper]
Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem
M. McCloskey, N. J. Cohen. Psychology of Learning and Motivation 1989. [Paper]

Data Selection for RAG

Cohere rerank: Cohere's rerank model reorders initial retrieval results to improve relevance to the query, making it a key component for building high-quality RAG systems. [Source]
ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval
Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, et al. NAACL 2025. [Paper]
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, et al. arXiv 2025. [Paper]
ARAGOG: Advanced RAG Output Grading
Matouš Eibich, Shivay Nagpal, Alexander Fred-Ojala. arXiv 2024. [Paper]
Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples!
Yubo Ma, Yixin Cao, YongChing Hong, et al. Findings of EMNLP 2023. [Paper]
Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model
Jiaxi Cui, Munan Ning, Zongjian Li, et al. arXiv 2023. [Paper]
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models
Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin. arXiv 2023. [Paper]

3.2 Data Compression

⬆️top

RAG Knowledge Compression

Context Embeddings for Efficient Answer Generation in RAG
David Rau, Shuai Wang, Hervé Déjean, et al. WSDM 2025. [Paper]
xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token
Xin Cheng, Xun Wang, Xingxing Zhang, et al. NeurIPS 2024. [Paper]
RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation
Fangyuan Xu, Weijia Shi, Eunsol Choi. ICLR 2024. [Paper]
Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation
Kaize Shi, Xueyao Sun, Qing Li, et al. arXiv 2024. [Paper]
Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation
Dongwon Jung, Qin Liu, Tenghao Huang, et al. arXiv 2024. [Paper]

Prompt Compression

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, et al. ACL 2024. [Paper]
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, et al. Findings of ACL 2024. [Paper]
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, et al. EMNLP 2023. [Paper]
Learning to Compress Prompts with Gist Tokens
Jesse Mu, Xiang Lisa Li, Noah Goodman. NeurIPS 2023. [Paper]
Adapting Language Models to Compress Contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, et al. EMNLP 2023. [Paper]

3.3 Data Packing

⬆️top

Short Sequence Insertion

Fewer Truncations Improve Language Modeling
Hantian Ding, Zijian Wang, Giovanni Paolini, et al. ICML 2024. [Paper]
Bucket Pre-training is All You Need
Hongtao Liu, Qiyao Peng, Qing Yang, et al. arXiv 2024. [Paper]

Sequence Combination Optimization

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, et al. NeurIPS 2024. [Paper]
Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance
Mario Michael Krell, Matej Kosec, Sergio P. Perez, et al. arXiv 2021. [Paper]

Semantic-Based Packing

Structured Packing in LLM Training Improves Long Context Utilization
Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, et al. AAAI 2025. [Paper]
In-context Pretraining: Language Modeling Beyond Document Boundaries
Weijia Shi, Sewon Min, Maria Lomeli, et al. ICLR 2024. [Paper]

3.4 Data Provenance

⬆️top

A comprehensive survey on data provenance: : State-of-the-art approaches and their deployments for IoT security enforcement
Md Morshed Alam, Weichao Wang. Journal of Computer Security, Volume 29, Issue 4. 2021. [Paper]

Embedding Markers

Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature
Tong Zhou, Xuandong Zhao, Xiaolin Xu, et al. NeurIPS 2024. [Paper]
Undetectable Watermarks for Language Models
Miranda Christ, et al. in Proceedings of the 37th Annual Conference on Learning Theory (COLT 2024). [Paper]
An Unforgeable Publicly Verifiable Watermark for Large Language Models
Aiwei Liu, Leyi Pan, Xuming Hu, et al. ICLR 2024. [Paper]
A Watermark for Large Language Models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al. ICML 2023. [Paper]
Publicly-Detectable Watermarking for Language Models
Jaiden Fairoze, Sanjam Garg, Somesh Jha, et al. arXiv 2023. [Paper]

Statistical Provenance

A Watermark for Large Language Models [Paper]

4 LLM for Data Management

⬆️top

4.1 LLM for Data Manipulation

4.1.1 LLM for Data Cleaning

Data Standardization

Exploring the Feasibility of Automated Data Standardization using Large Language Models for Seamless Positioning

Lee, Max JL, et al. 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 2024. [Paper]
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Simran Arora, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 2, 2024. [Paper]
CleanAgent: Automating Data Standardization with LLM-based Agents
Danrui Qi, Jiannan Wang. arXiv 2024. [Paper]
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
Lan Li, Liri Fang, Vetle I. Torvik. arXiv 2024. [Paper]
LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing
Luyi Ma, et al. 1st IEEE International Workshop on Data Engineering and Modeling for AI (DEMAI), IEEE BigData 2023. [Paper]
Large language models as data preprocessors.

Zhang, Haochen, et al. arXiv 2023. [Paper]

Data Error Processing

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets
Tommaso Bendinelli, Artur Dox, Christian Holz. ICLR 2025 Workshop on Foundation Models in the Wild. [Paper]
Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

Bendinelli, Tommaso, Artur Dox, and Christian Holz. arXiv 2025. [Paper]
ZeroED: Hybrid Zero-shot Error Detection through Large Language Model Reasoning

Ni, Wei, et al. arXiv 2025. [Paper]
GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models
Mengyi Yan, et al. Proceedings of the ACM on Management of Data, Volume 2, Issue 6, 2024. [Paper]
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
Juhwan Choi, Jungmin Yun, Kyohoon Jin, et al. EMNLP 2024. [Paper]
Data Cleaning Using Large Language Models
Shuo Zhang, Zezhou Huang, Eugene Wu. arXiv 2024. [Paper]
LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs
Fabian Biester, Mohamed Abdelaal, Daniel Del Gaudio. arXiv 2024. [Paper]
Anomaly Detection of Tabular Data Using LLMs

Li, Aodong, et al. arXiv 2024. [Paper]
Cleaning Semi-Structured Errors in Open Data Using Large Language Models

M. Mondal, J. Audiffren, L. Dolamic, et al, 2024 11th IEEE Swiss Conference on Data Science (SDS). [Paper]
IterClean: An Iterative Data Cleaning Framework with Large Language Models
Wei Ni, et al. Proceedings of the ACM Turing Award Celebration Conference - China 2024. [Paper]

Data Imputation

Does Prompt Design Impact Quality of Data Imputation by LLMs?

Srinivasan, Shreenidhi, and Lydia Manikonda. arXiv 2025. [Paper]
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing

Wang, Jianwei, et al. arXiv 2025. [Paper]
RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes
Zan Ahmad Naeem, et al. VLDB Endowment 2024. [Paper]
Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

Ding, Bosheng, et al. arXiv 2024. [Paper]
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models

Hayat, Ahatsham, and Mohammad Rashedul Hasan. arXiv 2024. [Paper]

4.1.2 LLM for Data Integration

Entity Matching

A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models

Zhang, Zeyu, et al. International Conference on Extending Database Technology (EDBT) 2025. [Paper]
Large Language Models for Data Discovery and Integration: Challenges and Opportunities

Freire, Juliana, et al. IEEE Data Eng. Bull. 49(1): 3-31 (2025). [Paper]
Entity matching using large language models
Ralph Peeters, Christian Bizer. EDBT 2025. [Paper]
Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching
Tianshu Wang, Hongyu Lin, Xiaoyang Chen, et al. COLING 2025. [Paper]
Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration
Meihao Fan, Xiaoyue Han, Ju Fan, et al. ICDE 2024. [Paper]
KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs
Yongqin Xu, Huan Li, Ke Chen, Lidan Shou. arXiv 2024. [Paper]
Jellyfish: A Large Language Model for Data Preprocessing
Haochen Zhang, Yuyang Dong, Chuan Xiao, et al. EMNLP 2024. [Paper]
Fine-tuning Large Language Models for Entity Matching

Steiner, Aaron, Ralph Peeters, et al. arXiv 2024. [Paper]

Schema Matching

SCHEMORA: Schema Matching via Multi-stage Recommendation and Metadata Enrichment using Off-the-Shelf LLMs
Osman Erman Gungor, Derak Paulsen, William Kang. arXiv 2025. [Paper]
Towards Scalable Schema Mapping using Large Language Models

Buss, Christopher, et al. arXiv 2025. [Paper]
Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching
Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, et al. arXiv 2025. [Paper]
Interactive Data Harmonization with LLM Agents
Aécio Santos, Eduardo H. M. Pena, Roque Lopez, et al. arXiv 2025. [Paper]
Schema Matching with Large Language Models: an Experimental Study
Marcel Parciak, Brecht Vandevoort, Frank Neven, et al. TaDA 2024 Workshop, collocated with VLDB 2024. [Paper]
Magneto: Combining Small and Large Language Models for Schema Matching
Yurong Liu, Eduardo Pena, Aecio Santos, et al. VLDB Endowment 2024. [Paper]
Agent-OM: Leveraging LLM Agents for Ontology Matching Zhangcheng Qiang, et al. Proceedings of the VLDB Endowment, Volume 18, Issue 3, 2024. [Paper]
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching

Seedat, Nabeel, and Mihaela van der Schaar. arXiv 2024. [Paper]
TableGPT2: A Large Multimodal Model with Tabular Data Integration

Su, Aofeng, et al. arXiv 2024. [Paper]

4.1.3 LLM for Data Discovery

ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models
Benjamin Feuer, Yurong Liu, Chinmay Hegde, et al. VLDB 2024. [Paper]

Data Profiling

Flexible metadata harvesting for ecology using large language models
Zehao Lu, Thijs L van der Plas, Parinaz Rashidi, et al. arXiv 2025. [Paper]
Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System
Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, et al. SIGMOD 2025. [Paper]
AutoDDG: Automated Dataset Description Generation using Large Language Models
Haoxiang Zhang, Yurong Liu, Wei-Lun (Allen) Hung, et al. arXiv 2025. [Paper]
LEDD: Large Language Model-Empowered Data Discovery in Data Lakes
Qi An, Chihua Ying, Yuqing Zhu, et al. arXiv 2025. [Paper]
LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

Thorat, Pankaj, et al. arXiv 2025. [Paper]
Cocoon: Semantic Table Profiling Using Large Language Models

Huang, Zezhou, et al. Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. 2024. [Paper]

Data Annotation

LLMs as Data Annotators: How Close Are We to Human Performance
Haq, Muhammad Uzair Ul, Davide Rigoni, et al. arXiv 2025. [Paper]
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index
Yuxiang Guo, Zhonghao Hu, Yuren Mao, et al. VLDB 2025. [Paper]
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration
Moe Kayali, Fabian Wenz, Nesime Tatbul, et al. CIDR 2025. [Paper]
Evaluating Knowledge Generation and Self-Refinement Strategies for LLM-based Column Type Annotation
Keti Korini, Christian Bizer. arXiv 2025. [Paper]
Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models
Ting Cai, Stephen Sheen, AnHai Doan. arXiv 2025. [Paper]
An LLM Agent-Based Complex Semantic Table Annotation Approach
Yilin Geng, Shujing Wang, Chuan Wang, et al. arXiv 2025. [Paper]
Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning

Alizadeh, Meysam, et al. Journal of Computational Social Science 8.1 (2025): 1-25. [Paper]
Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation

Xia, Mingxuan, et al. arXiv 2025. [Paper]
Evaluating how LLM annotations represent diverse views on contentious topics

Brown, Megan A., et al. arXiv 2025. [Paper]
CHORUS: Foundation Models for Unified Data Discovery and Exploration
Moe Kayali, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 8, 2024. [Paper]
RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph
Lindsey Linxi Wei, Guorui Xiao, Magdalena Balazinska. arXiv 2024. [Paper]
AutoLabel: Automated Textual Data Annotation Method Based on Active Learning and Large Language Model
Ming, Xuran, et al. International Conference on Knowledge Science, Engineering and Management. 2024. [Paper]
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection

Horych, Tomas, et al. arXiv 2024. [Paper]
Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost

Bansal, Parikshit, and Amit Sharma. arXiv 2023. [Paper]

4.2 LLM for Data System Optimization

⬆️top

4.2.1 LLM for Configuration Tuning

ELMo-Tune-V2: LLM-Assisted Full-Cycle Auto-Tuning to Optimize LSM-Based Key-Value Stores
Viraj Thakkar, Qi Lin, Kenanya Keandra Adriel Prasetyo, et al. arXiv 2025. [Paper]
MLETune: Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning
Wenlong Dong, Wei Liu, Rui Xi, et al. ICPADS 2024. [Paper]

Tuning Task-Aware Prompt Engineering

λ-Tune: Harnessing Large Language Models for Automated Database System Tuning
Victor Giannankouris, Immanuel Trummer. SIGMOD 2025. [Paper]
LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model
Xinxin Zhao, Haoyang Li, Jing Zhang, et al. arXiv 2025. [Paper]
LATuner: An LLM-Enhanced Database Tuning System Based on Adaptive Surrogate Model
Chongjiong Fan, Zhicheng Pan, Wenwen Sun, et al. ECML PKDD 2024. [Paper]
Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation
Yiyan Li, Haoyang Li, Zhao Pu, et al. arXiv 2024. [Paper]

RAG Based Tuning Experience Enrichment

Automatic Database Configuration Debugging using Retrieval-Augmented Language Models
Sibei Chen, Ju Fan, Bin Wu, et al. Proceedings of the ACM on Management of Data, Volume 3, Issue 1, 2025. [Paper]
GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization
Jiale Lao, Yibo Wang, Yufei Li, et al. VLDB 2024. [Paper]

Training Enhanced Tuning Goal Alignment

E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model
Xinmei Huang, Haoyang Li, Jing Zhang, et al. VLDB 2025. [Paper]
DB-GPT: Large Language Model Meets Database
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. Data Science and Engineering 2024. [Paper]
HEBO: Heteroscedastic Evolutionary Bayesian Optimisation
Alexander I. Cowen-Rivers, Wenlong Lyu, Zhi Wang, et al. NeurIPS 2020. [Paper]

4.2.2 LLM for Query Optimization

Optimization-Aware Prompt Engineering

E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency
Dongjie Xu, Yue Cui, Weijie Shi, et al. arXiv 2025. [Paper]
LLM4Hint: Leveraging Large Language Models for Hint Recommendation in Offline Query Optimization
Suchen Liu, Jun Gao, Yinjun Han, et al. arXiv 2025. [Paper]
QUITE: A Query Rewrite System Beyond Rules with LLM Agents
Yuyang Song, Hanxu Yan, Jiale Lao, et al. arXiv 2025. [Paper]
Can Large Language Models Be Query Optimizer for Relational Databases?
Jie Tan, Kangfei Zhao, Rui Li, et al. arXiv 2025. [Paper]
A Query Optimization Method Utilizing Large Language Models
Zhiming Yao, Haoyang Li, Jing Zhang, et al. arXiv 2025. [Paper]
Query Rewriting via LLMs
Sriram Dharwada, Himanshu Devrani, Jayant Haritsa, et al. arXiv 2025. [Paper]
DB-GPT: Large Language Model Meets Database [Paper]
LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency
Zhaodonghui Li, Haitao Yuan, Huiming Wang, et al. VLDB 2024. [Paper]
The Unreasonable Effectiveness of LLMs for Query Optimization
Peter Akioyamen, Zixuan Yi, Ryan Marcus. ML for Systems Workshop at NeurIPS 2024. [Paper]
R-Bot: An LLM-based Query Rewrite System
Zhaoyan Sun, Xuanhe Zhou, Guoliang Li. arXiv 2024. [Paper]
Query Rewriting via Large Language Models
Jie Liu, Barzan Mozafari. arXiv 2024. [Paper]

4.2.3 LLM for Anomaly Diagnosis

Manually Crafted Prompts for Anomaly Diagnosis

DBG-PT: A Large Language Model Assisted Query Performance Regression Debugger
Victor Giannakouris, Immanuel Trummer. Proceedings of the VLDB Endowment, Volume 17, Issue 12, 2024. [Paper]

RAG Based Diagnosis Experience Enrichment

DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs
Wei Zhou, Peng Sun, Xuanhe Zhou, et al. arXiv 2025. [Paper]
Query Performance Explanation through Large Language Model for HTAP Systems
Haibo Xiu, Li Zhang, Tieying Zhang, et al. ICDE 2025. [Paper]
D-Bot: Database Diagnosis System using Large Language Models
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 10. 2024. [Paper]
LLM As DBA
Xuanhe Zhou, Guoliang Li, Zhiyuan Liu. arXiv 2023. [Paper]

Multi-Agent Mechanism for Collaborative Diagnosis

GaussMaster: An LLM-based Database Copilot System
Wei Zhou, Ji Sun, Xuanhe Zhou, et al. arXiv 2025. [Paper]
D-Bot: Database Diagnosis System using Large Language Models [Paper]
Panda: Performance Debugging for Databases using LLM Agents
Vikramank Singh, Kapil Eknath Vaidya, Vinayshekhar Bannihatti Kumar, et al. CIDR 2024. [Paper]
LLM As DBA [Paper]

Localized LLM Enhancement via Specialized FineTuning

D-Bot: Database Diagnosis System using Large Language Models [Paper]
LLM for Data Management
Guoliang Li, Xuanhe Zhou, Xinyang Zhao. PVLDB 17(12). 2024. [Paper]
LLM-Enhanced Data Management
Xuanhe Zhou, Xinyang Zhao, Guoliang Li. arXiv 2024. [Paper]

5 LLM as Data Analyst

5.1 LLM for Structured Data Analysis

5.1.1 Relational Data

A relational model of data for large shared data banks. [Paper]
Multilinear tensor regression for longitudinal relational data [Paper]
Probabilistic classification and clustering in relational data [Paper]
Outlier detection in relational data: A case study in geographical information systems [Paper]

NL2SQL

Finsql: Model-agnostic llms-based text-to-sql framework for financial analysis [Paper]
Pet-sql: A prompt-enhanced two-round refinement of text-to-sql with cross-consistency [Paper]
Chess: Contextual harnessing for efficient sql synthesis [Paper]
Codes: Towards building open-source language models for text-to-sql [Paper]
Combining small language models and large language models for zero-shot nl2sql [Paper]
Cracking SQL Barriers: An llm-based dialect translation system [Paper]
Cracksql: A hybrid sql dialect translation system powered by large language models [Paper]
Din-sql: Decomposed in-context learning of text-to-sql with self-correction [Paper]
Opensearch-sql: Enhancing text-to-sql with dynamic few-shot and consistency alignment [Paper]
Bridging the semantic gap between text and table: A case study on nl2sql [Paper]
The dawn of natural language to sql: Are we fully ready? [Paper]
A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going? [Paper]
Natural Language to SQL: State of the Art and Open Problems [Paper]
A survey on employing large language models for text-to-sql tasks [Paper]

NL2Code

Natural language to code generation in interactive data science notebooks [Paper]
Palm: Scaling language modeling with pathways [Paper]
Contextualized data-wrangling code generation in computational notebooks [Paper]
Data interpreter: An llm agent for data science [Paper]
Collaboration between intelligent agents and large language models: A novel approach for enhancing code generation capability [Paper]
BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. [Paper]

LLM for Semantic Analysis.

Multi-Step QA.

Tat-llm: A specialized language model for discrete reasoning over financial tab- ular and textual data [Paper]
S3HQA: A three-stage approach for multi-hop text-table hybrid question answering [Paper]
Plugging schema graph into multi-table qa: A human-guided framework for reducing llm reliance. [Paper]
TaPERA: Enhancing faithfulness and interpretability in long-form table QA by content planning and execution-based reasoni [Paper]
Reactable: Enhancing react for table question answering [Paper]
Chain-of-table: Evolving tables in the reasoning chain for table understanding [Paper]

End-to-End QA

Table-gpt: Table-tuned gpt for diverse table tasks [Paper]
Tablegpt2: A large multimodal model with tabular data integration [Paper]
Cabinet: Content relevance based noise reduction for table question answering [Paper]
Tablemaster: A recipe to advance table understanding with language models [Paper]
Mmqa: Evaluating llms with multi-table multi-hop complex questions. [Paper]
Multimodal table understanding [Paper]
Improved baselines with visual instruction tuning [Paper]
Tabpedia: Towards comprehensive visual table understanding with concept synergy [Paper]
Judging llm-as-a-judge with mt-bench and chatbot arena. [Paper]

LLM for Time Series Analysis.

Time series databases and influxdb [Paper]
Towards cross-modality modeling for time series analytics: A survey in the llm era [Paper]
A comparison of arima and lstm in forecasting time series [Paper]
Association between forecasting models’ precision and nonlinear patterns of daily river flow time series [Paper]
The performance of lstm and bilstm in forecasting time series [Paper]
Hmckrautoencoder: An interpretable deep learning framework for time series analysis. [Paper]

TS2NL.

Can large language models be anomaly detectors for time series? [Paper]
Timerag: Boosting llm time series forecasting via retrieval-augmented generation. [Paper]
Dynamic time warping algorithm review. [Paper]
Temporal data meets llm–explainable financial time series forecasting. [Paper]
Exploring large language models for climate forecasting [Paper]
Timecap: Learning to contextualize, augment, and predict time series events with large language model agents [Paper]
Explainable multi-modal time series prediction with llm-in-the-loop [Paper]
From news to forecast: Integrating event analysis in llm-based time series forecasting with reflection [Paper]

Alignment.

Time-llm: Time series forecasting by reprogramming large language models [Paper]
Seed: A structural encoder for embedding-driven decoding in time series prediction with llms [Paper]
Timecma: Towards llm-empowered multivariate time series forecasting via cross-modality alignment [Paper]
Calf: Aligning llms for time series forecasting via cross-modal fine-tuning [Paper]
S2IP-LLM: Semantic space informed prompt learning with LLM for time series forecasting [Paper]
Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters [Paper]
Large language models are few-shot multivariate time series classifiers. [Paper]

5.1.2 Graph Data Analysis

A comparison of current graph database models [Paper]

Natural Language To Graph Analysis Query.

Nat-nl2gql: A novel multi-agent framework for translating natural language to graph query language [Paper]
r3-NL2GQL: A model coordination and knowledge graph alignment approach for NL2GQL [Paper]
Aligning large language models to a domain-specific graph database for nl2gql [Paper]
Graph learning in the era of llms: A survey from the perspective of data, models, and tasks [Paper]
Leveraging biomolecule and natural language through multi-modal learning: A survey [Paper]

LLM-based Semantic Analysis.

Retrieval-Then-Reasoning.

Subgraph retrieval enhanced model for multi-hop knowledge base question answering [Paper]
Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph [Paper]
G-retriever: Retrieval-augmented generation for textual graph understanding and question answering [Paper]

Execution-Then-Reasoning

Interactive-kbqa: Multi-turn inter-actions for knowledge base question answering with large language models [Paper]
Mcts-kbqa: Monte carlo tree search for knowledge base question answering [Paper]
Flexkbqa: A flexible llm-powered framework for few-shot knowledge base question answering [Paper]https://ojs.aaai.org/index.php/AAAI/article/view/29823

Graph Task Based Fine-tuning Methods.

Language is all a graph needs [Paper]
Instruct-graph: Boosting large language models via graph-centric instruction tuning and preference alignment [Paper]
Direct preference optimization: Your language model is secretly a reward model [Paper]
Graphgpt: Graph instruction tuning for large language models [Paper]
Inductive representation learning on large graphs [Paper]
Semi-supervised classification with graph convolutional networks. [Paper]
Glam: Fine-tuning large language models for domain knowledge graph alignment via neighborhood partitioning and generative sub-graph encoding [Paper]

Agent Based Methods.

Structgpt: A general framework for large language model to reason over structured data [Paper]
Kbqa-o1: Agentic knowledge base question answering with monte carlo tree search. [Paper]
Call me when necessary: Llms can efficiently and faithfully reason over structured environments [Paper]

5.1.3 Structured Data Generation for LLM

Compositional Semantic Parsing on Semi-Structured Tables [Paper]
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task [Paper]

Relational Data Generation.

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers [Paper]
Relational data generation with graph neural networks and latent diffusion models [Paper]
Synthetic data generation of many-to-many datasets via random graph generation. [Paper]
Mixed-type tabular data synthesis with score-based diffusion in latent space [Paper]
Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task [Paper]
Codes: Towards building open-source language models for text-to-sql [Paper]
Itf-gan: Synthetic time series dataset generation and manipulation by interpretable features [Paper]
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning [Paper]

Graph Data Generation.

A framework for large-scale synthetic graph dataset generation [Paper]
A temporal knowledge graph generation dataset supervised distantly by large language models [Paper]

5.2 LLM for Semi-Structured Data Analysis

5.2.1 Markup Language

Markup Extraction.

Language models enable simple systems for generating structured views of heterogeneous data lakes [Paper]
Webformer: The web-page transformer for structure information extraction [Paper]

Markup Query.

XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler [Paper]
Bridging the gap: Enabling natural language queries for nosql databases through text-to-nosql translation [Paper]

Markup Understanding.

Dom-lm: Learning generalizable representations for html documents [Paper]
Markuplm: Pre-training of text and markup language for visually-rich document understanding [Paper]
Hierarchical multimodal pre-training for visually rich webpage understanding [Paper]

5.2.2 Semi-Structured Table

Table Representation.

Tuta: Tree-based transformers for generally structured table pre-training [Paper]
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering [Paper]
Reasoning and Retrieval for Complex Semi-structured Tables via Reinforced Relational Data Transformation [Paper]
Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples [Paper]
Can an LLM find its way around a Spreadsheet? [Paper]

Table Prompting.

SpreadsheetLLM: encoding spreadsheets for large language models [Paper]
HySem: A context length optimized LLM pipeline for unstructured tabular extraction [Paper]

Table Querying.

SpreadsheetLLM: encoding spreadsheets for large language models [Paper]
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering [Paper]

5.3 LLM for Unstructured Data Analysis

5.3.1 Chart Analysis

Traditional Approaches

DVQA: Understanding Data Visualizations viaQuestion Answering [Paper]

Chart Captioning

Describing Complex Charts in Natural Language A Caption Generation System [Paper]
An Architecture for Data-to-Text Systems[Paper]
Chartthinker: A contextual chain-of-thought approach to optimized chart summarization[Paper]
Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model[Paper]
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback[Paper]
Unichart: A universal vision-language pretrained model for chart comprehension and reasoning[Paper]

Chart Question Answering

ChartLlama: A Multimodal LLM for Chart Undestanding and Generation [Paper]
ChartBench: A Benchmark for Complex Visual easoning in Charts [Paper]
Evochart: A benchmark and a self-training approach towards real-world chart understanding[Paper]
Chartinsights: Evaluating multimodal large language models for low-level chart question answering[Paper]
Vizability: Enhancing chart accessibility with llm-based conversational interaction[Paper]
Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction[Paper]
ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding[Paper]
ChartGemma: Visual Instruction-tuning for Cart Reasoning in the Wild [Paper]
mPLUG-Owl: Modularization Empowers Large Laguage Models with Multimodality [Paper]

Chart-to-Code

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation[Paper]
Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback[Paper]
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation[Paper]

5.3.2 Video Analysis

Temporally-Anchored Approaches

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability [Paper]
Seq2time: Sequential knowledge transfer for video llm temporal grounding [Paper]
Tempme: Video temporal token merging for efficient text-video retrieval [Paper]
Video token merging for long-form video understanding [Paper]
Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models [Paper]

Instruction-Aware Relative Temporal Localization

From image to video, what do we need in multimodal llms? [Paper]
LLMs meet long video: Advancing long video comprehension with an interactive visual adapter in llms [Paper]

Video Emotional Analysis

Predicting Team Well-Being through Face Video Analysis with AI [Paper]
AI based multimodal emotion and behavior analysis of interviewee [Paper]

Object Detection

Videorefer suite: Advancing spatial-temporal object understanding with video llm [Paper]
Video summarisation with incident and context information using generative ai [Paper]
Abnormal event detection in surveillance videos through LSTM auto-encoding and local minima assistance [Paper]

Gesture and Behavior Detection

Utilizing multimodal large language models for video analysis of posture in studying collaborative learning: A case study [Paper]
Artificial intelligence–powered 3D analysis of video-based caregiver-child interactions [Paper]

Video Data for LLM

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding [Paper]
Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis [Paper]
Text2video-zero: Text-to-image diffusion models are zero-shot video generators [Paper]
Align your latents: High-resolution video synthesis with latent diffusion models [Paper]
Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation [Paper]
DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models [Paper]
Disco: Disentangled control for realistic human dance generation [Paper]
Imagen video: High definition video generation with diffusion models [Paper]
Make-a-video: Text-to-video generation without text-video data [Paper]

5.3.3 Document Analysis

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [Paper]
SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding [Paper]
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding [Paper]
DocFormer: End-to-End Transformer for Document Understanding [Paper]
VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification [Paper]
Efficient End-to-End Visual Document Understanding with Rationale Distillation [Paper]
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking [Paper]
MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection [Paper]
Unifying Layout Generation with a Decoupled Diffusion Model [Paper]
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation [Paper]
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents [Paper]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision [Paper]
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [Paper]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [Paper]
MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis [Paper]
CREPE: Coordinate-Aware End-to-End Document Parser [Paper]
LTSim: Layout Transportation-based Similarity Measure for Evaluating Layout Generation [Paper]
AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models [Paper]
Automatic generation of scientific papers for data augmentation in document layout analysis [Paper]
PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation [Paper]
LayoutCoT: Chain-of-Thought Prompting for Layout Generation [Paper]
SciPostLayout: A Dataset for Layout Analysis and Layout Generation of Scientific Posters [Paper]
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [Paper]
DLAFormer: An End-to-End Transformer For Document Layout Analysis [Paper]
DocLLM: A layout-aware generative language model for multimodal document understanding [Paper]
LayoutLM: Pre-training of Text and Layout for Document Image Understanding [Paper]
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [Paper]
Corrective Retrieval Augmented Generation [Paper]
RAFT: Adapting Language Model to Domain Specific RAG [Paper]
VASCAR: Content-Aware Layout Generation via Visual-Spatial Self-Correction [Paper]

5.3.4 Program Analysis

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) [Paper]
Teaching Large Language Models to Self-Debug [Paper]
Syntax-directed variational autoencoder for structured data [Paper]
Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning [Paper]
FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion [Paper]
Composing graphical models with neural networks for structured representations and fast inference [Paper]
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [Paper]
REPOFUSE: Repository-Level Code Completion with Fused Dual Context [Paper]
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [Paper]
Software Vulnerability Detection with GPT and In-Context Learning [Paper]
Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks [Paper]
WizardCoder: Empowering Code Large Language Models with Evol-Instruct [Paper]
SCLA: Automated Smart Contract Summarization via LLMs and Semantic Augmentation [Paper]
Self-Instruct: Aligning Language Models with Self-Generated Instructions [Paper]
Magicoder: Empowering Code Generation with OSS-Instruct [Paper]
Repoformer: Selective Retrieval for Repository-Level Code Completion [Paper]
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [Paper]
Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code [Paper]
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [Paper]
Large Language Model for Vulnerability Detection: Emerging Results and Future Directions [Paper]

5.3.5 3D Model Analysis

3D-Language Fusion

3d-llm: Injecting the 3d world into large language models [Paper]
3ur-llm: An end-to-end multimodal large language model for 3d scene understanding [Paper]
Towards 3d molecule-text interpretation in language models [Paper]
Proteinchat: Towards achieving chatgpt-like functionalities on protein 3d structures [Paper]
Protchatgpt: Towards understanding proteins with large language models [Paper]

3D-Derived Task Enhancement

Do Large Language Models Truly Understand Geometric Structures? [Paper]
3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model [Paper]
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules [Paper]
ProtChat: An AI Multi-Agent for Automated Protein Analysis Leveraging GPT-4 and Protein Language Model [Paper]
A multimodal protein representation framework for quantifying transferability across biochemical downstream tasks [Paper]

Cross-modal Capability Refinement

Self-supervised image-based 3d model retrieval [Paper]
Llmi3d: Empowering llm with 3d perception from a single 2d image [Paper]

3-D data for LLM

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation [Paper]
Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d [Paper]
Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d [Paper]
Zero-1-to-3: Zero-shot one image to 3d object [Paper]
Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation [Paper]
Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner [Paper]
Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer [Paper]
Meshanything: Artist-created mesh generation with autoregressive transformers [Paper]
Llama-mesh: Unifying 3d mesh generation with language models [Paper]

5.4 LLM for Heterogeneous Data Analysis

5.4.1 LLM for Modality Alignment

Unicorn: a unified multi-tasking matching model [Paper]
Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes. [Paper]

5.4.2 LLM for Heterogeneous Data Retrieval

Lotus: Enabling semantic queries with llms over tables of unstructured and structured data [Paper]
Towards Operationalizing Heterogeneous Data Discovery [Paper]
CAESURA: Language Models as Multi-Modal Query Planners [Paper]

5.4.2 Heterogeneous Data Analysis Agents

Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent [Paper]
An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [Paper]
Must: An effective and scalable framework for multimodal search of target modality [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

weAIDB/awesome-data-llm

Folders and files

Latest commit

History

Repository files navigation

A Survey of LLM × DATA

🌤 The IaaS Concept of DATA4LLM

🌟 LLM/Agent-as-Data-Analyst

Table of Contents

Datasets

0 Data Characteristics across LLM Stages

Data for Pretraining

Data for Continual Pre-training

Data for Supervised Fine-Tuning (SFT)

General Instruction Following

Specific Domain Usage

Data for Reinforcement Learning (RL)

RLHF

RoRL

Data for Retrieval-Augmented Generation (RAG)

Data for LLM Evaluation

Data for LLM Agents

1 Data Processing for LLM

1.1 Data Acquisition

Data Sources

Public Data

Data Acquisition Methods

Website Crawling

Layout Analysis

1.2 Data Deduplication

Exact Substring Matching

Approximate Hashing-based Deduplication

Approximate Frequency-based Down-Weighting

Embedding-Based Clustering

Non-Text Data Deduplication

1.3 Data Filtering

Sample-level Filtering

(1) Statistical Evaluation

(2) Model Scoring

(3) Hybrid Methods

Content-level Filtering

1.4 Data Selection

Similarity-based Data Selection

Optimization-based Data Selection

Model-based Data Selection

1.5 Data Mixing

Heuristic Optimization

Bilevel Optimization

Distributionally Robust Optimization

Model-Based Optimization

1.6 Data Distillation and Synthesis

Knowledge Distillation

Pre-training Data Augmentation

SFT Data Augmentation

SFT Reasoning Data Augmentation

Reinforcement Learning

Retrieval-Augmentation Generation

1.7 End-to-End Data Processing Pipelines

1.7.1 Typical data processing frameworks

1.7.2 Typical data pipelines

1.7.3 Orchestration of data pipelines

2 Data Storage for LLM

2.1 Data Formats

Training Data Format

Model Data Format

2.2 Data Distribution

Distributed Storage Systems

Heterogeneous Storage Systems

2.3 Data Organization

Vector-Based Organization

Graph-Based Organization

2.4 Data Movement

Caching Data

Data/Operator Offloading

Overlapping of storage and computing

2.5 Data Fault Tolerance

Checkpoints

Redundant Computations

2.6 KV Cache

Cache Space Management