Me
Chengliang Chai
School of Computer Science,BIT

About

I am Chengliang Chai, an Associate Professor (Special Researcher) and Ph.D. supervisor at the School of Computer Science, Beijing Institute of Technology. I received my Bachelor's degree from Harbin Institute of Technology in 2015 and completed my Ph.D. at Tsinghua University in 2020. From 2020 to 2022, I conducted postdoctoral research at Tsinghua University. My research focuses on Artificial intelligence large models, data science, data lakes,and database systems.
I'm always looking for related colaboration. If you are interested to chat with me, feel free to drop me an email.

Interests

  • Data-centric AI: Algorithms, compute, and data are the three pillars of modern AI. I focus on the data: how to systematically empower large language models (LLMs) with better data. My work covers dataset discovery and selection, scalable cleaning and fusion, high-quality annotation (human-in-the-loop and weak supervision), and end-to-end data lineage and quality governance across the model lifecycle.
  • Data Lakes & Large Models: In the era of multi-source, heterogeneous data, data lakes have become a practical substrate for analytics and AI. I develop methods to index and retrieve lake data for LLM inference (retrieval-augmented generation, RAG), extract and represent knowledge, and enable efficient, accurate multimodal analysis with large models.

Education

Tsinghua University
  • Sep 2015 - Jul 2020. Ph.D, Dept. of Computer Science and Technology
  • Mentor: Prof. Guoliang Li
Harbin Institute of Technology
  • Sep 2011 - Jul 2015. Undergraduate, Dept. of Computer Science and Technology

Experiences

Sep 2022 - Now
Sep 2020 - Jul 2022
Postdoctoral Researcher @ Tsinghua University
Sep 2015 - Jul 2020
Supervisor: Prof. Guoliang Li
Sep 2011 - Jul 2015.

Publications ( / )

QUEST: Query Optimization in Unstructured Document Analysis
Zhaoze Sun, Qiyan Deng, Chengliang Chai, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, Lei Cao
VLDB 2025
[Paper]
OIE: An Interpretable System for Outlier Explanation and Summarization
Jingzhe Xu, Yuhao Deng, Chengliang Chai, Zequn Li, Yuping Wang, Lei Cao
VLDB 2025
[Paper]
Doctopus: A System for Budget-aware Structural Data Extraction from Unstructured Documents
Yuanhao Zhong, Yuhao Deng, Chengliang Chai, Ruixin Gu, Ye Yuan, Guoren Wang, Lei Cao
SIGMOD 2025
[Paper]
Two birds with one stone: Efficient deep learning over mislabeled data through subset selection
Yuhao Deng, Chengliang Chai, Kaisen Jin, Linan Zheng, Lei Cao, Ye Yuan, Guoren Wang
SIGMOD 2025
[Paper]
Not All Documents Are What You Need for Extracting Instruction Tuning Data
Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Jiawei Hong, Yuhao Deng, Jiacheng Wang, Tian Tan, Yizhou Yan, Jiantao Qiu, Ye Yuan, Guoren Wang, Conghui He, Lei Cao
Arxiv 2025
[Paper]
Cost-effective Missing Value Imputation for Data-effective Machine Learning
Chengliang Chai, Kaisen Jin, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, Guoliang Li, Ye Yuan, Guoren Wang
TODS 2025
[Paper]
Federated Data Analytics with Differentially Private Density Estimation Model
Jiayi Wang, Lei Cao, Chengliang Chai, Guoliang Li
ICDE 2025
[Paper]
Lead: Iterative data selection for efficient llm instruction tuning
Xiaotian Lin, Yanlin Qi, Yizhang Zhu, Themis Palpanas, Chengliang Chai, Nan Tang, Yuyu Luo
Arxiv 2025
[Paper] [Code]
Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Ye Yuan, Guoren Wang, Lei Cao
Arxiv 2025
[Paper] [Code]
Harnessing diversity for important data selection in pretraining large language models
Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ju Fan, Ye Yuan, Guoren Wang, Conghui He
Arxiv 2024
[Paper]
LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes
Chengliang Chai, Yuhao Deng, Yutong Zhan, Ziqi Cao, Yuanfang Zhang, Lei Cao, Yuping Wang, Zhiwei Zhang, Ye Yuan, Guoren Wang, Nan Tang
VLDB 2024
[Paper]
IDE: A system for iterative mislabel detection
Yuhao Deng, Qiyan Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, Guoren Wang
SIGMOD 2024
[Paper]
The dawn of natural language to sql: Are we fully ready?
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang
VLDB 2024
[Paper] [Code]
Applications and challenges for large language models: From data management perspective
Meihui Zhang, Zhaoxuan Ji, Zhaojing Luo, Yuncheng Wu, Chengliang Chai
ICDE 2024
[Paper]
Representation learning for entity alignment in knowledge graph: A design space exploration
Peng Huang, Meihui Zhang, Ziyue Zhong, Chengliang Chai, Ju Fan
ICDE 2024
[Paper]
Mitigating data scarcity in supervised machine learning through reinforcement learning guided data generation
Chengliang Chai, Kasisen Jin, Nan Tang, Ju Fan, Lianpeng Qiao, Yuping Wang, Yuyu Luo, Ye Yuan, Guoren Wang
ICDE 2024
[Paper]
Dmrnet: Effective network for accurate discharge medication recommendation
Jiyun Shi, Yuqiao Wang, Chi Zhang, Zhaojing Luo, Chengliang Chai, Meihui Zhang
ICDE 2024
[Paper]
Separation is for better reunion: data lake storage at Huawei
Xin Tang, Chengliang Chai, Dawei Zhao, Haohai Ma, Yong Zheng, Zhenyong Fan, Xin Wu, Jiaquan Zhang, Rui Zhang, Duanshun Li, Yi He, Keji Huang, Guangbin Meng, Yidong Wang, Yuefeng Zhou, Tao Tao, Lirong Jian, Jiwu Shu, Yuping Wang, Ye Yuan, Guoren Wang, Guoliang Li
ICDE 2024
[Paper]
Cost-effective in-context learning for entity resolution: A design space exploration
Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du
ICDE 2024
[Paper]
Misdetect: Iterative mislabel detection using early loss
Yuhao Deng, Chengliang Chai, Lei Cao, Nan Tang, Jiayi Wang, Ju Fan, Ye Yuan, Guoren Wang
VLDB 2024
[Paper]
Lakebench: A benchmark for discovering joinable and unionable tables in data lakes
Yuhao Deng, Chengliang Chai, Lei Cao, Qin Yuan, Siyuan Chen, Yanrui Yu, Zhaoze Sun, Junyi Wang, Jiajun Li, Ziqi Cao, Kaisen Jin, Chi Zhang, Yuqing Jiang, Yuanfang Zhang, Yuping Wang, Ye Yuan, Guoren Wang, Nan Tang
VLDB 2024
[Paper] [Code]
PACE: Poisoning attacks on learned cardinality estimation
Jintao Zhang, Chao Zhang, Guoliang Li, Chengliang Chai
SIGMOD 2024
[Paper]
Cardinality estimation using normalizing flow
Jiayi Wang, Chengliang Chai, Jiabin Liu, Guoliang Li
VLDB 2024
[Paper]
A survey of multi-dimensional indexes: past and future trends
Mingxin Li, Hancheng Wang, Haipeng Dai, Meng Li, Chengliang Chai, Rong Gu, Feng Chen, Zhiyuan Chen, Shuaituan Li, Qizhi Liu, Guihai Chen
TKDE 2024
[Paper]
Efficient coreset selection with cluster-based methods
Chengliang Chai, Jiayi Wang, Nan Tang, Ye Yuan, Jiabin Liu, Yuhao Deng, Guoren Wang
KDD 2023
[Paper]
Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data
Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, Guoliang Li
SIGMOD 2023
[Paper]
Demystifying artificial intelligence for data preparation
Chengliang Chai, Nan Tang, Ju Fan, Yuyu Luo
SIGMOD 2023
[Paper]
Learned data-aware image representations of line charts for similarity search
Yuyu Luo, Yihui Zhou, Nan Tang, Guoliang Li, Chengliang Chai, Leixian Shen
SIGMOD 2023
[Paper]
Haipipe: Combining human-generated and machine-generated pipelines for data preparation
Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, Xiaoyong Du
SIGMOD 2023
[Paper] [Code]
A Topic-Aware Data Generation Framework for Math Word Problems
Tianyu Zhao, Chengliang Chai, Jiabin Liu, Guoliang Li, Jianhua Feng, Zitao Liu
DASFAA 2023
[Paper]
Autoce: An accurate and efficient model advisor for learned cardinality estimation
Jintao Zhang, Chao Zhang, Guoliang Li, Chengliang Chai
ICDE 2023
[Paper]
HOFD: An outdated fact detector for knowledge bases
Shuang Hao, Chengliang Chai, Guoliang Li, Nan Tang, Ning Wang, Xiang Yu
TKDE 2023
[Paper] [Code]
Database meets AI: A survey
Xuanhe Zhou, Chengliang Chai, Guoliang Li, Ji Sun
TKDE 2023
[Paper]
Dynamic materialized view management using graph neural network
Yue Han, Chengliang Chai, Jiabin Liu, Guoliang Li, Chuangxian Wei, Chaoqun Zhan
ICDE 2023
[Paper]
Cost-based or learning-based? A hybrid query optimizer for query plan selection
Xiang Yu, Chengliang Chai, Guoliang Li, Jiabin Liu
VLDB 2022
[Paper] [Code]
Coresets over multiple tables for feature-rich and data-efficient machine learning
Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, Guoliang Li
VLDB 2022
[Paper] [Code]
DADER: hands-off entity resolution with domain adaptation
Jianhong Tu, Xiaoyue Han, Ju Fan, Nan Tang, Chengliang Chai, Guoliang Li, Xiaoyong
VLDB 2022
[Paper] [Code]
Interactively discovering and ranking desired tuples by data exploration
Xuedi Qin, Chengliang Chai, Yuyu Luo, Tianyu Zhao, Nan Tang, Guoliang Li, Jianhua Feng, Xiang Yu, Mourad Ouzzani
VLDB 2022
[Paper]
Learnedsqlgen: Constraint-aware sql generation using reinforcement learning
Lixi Zhang, Chengliang Chai, Xuanhe Zhou, Guoliang Li
SIGMOD 2022
[Paper]
Domain adaptation for deep entity resolution
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, Xiaoyong Du
SIGMOD 2022
[Paper] [Code]
Rw-tree: A learned workload-aware framework for r-tree construction
Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, Chaoqun Zhan
ICDE 2022
[Paper]
Feature augmentation with reinforcement learning
Jiabin Liu, Chengliang Chai, Yuyu Luo, Yin Lou, Jianhua Feng, Nan Tang
ICDE 2022
[Paper]
Synthesizing privacy preserving entity resolution datasets
Xuedi Qinl, Chengliang Chai, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li, Yaoyu Zhu
ICDE 2022
[Paper]
RNE: computing shortest paths using road network embedding
Tianyu Zhao, Shuai Huang, Yong Wang, Chengliang Chai, Guoliang Li
VLDB 2022
[Paper]
Learned Query Optimizer: At the Forefront of AI-Driven Databases
Rong Zhu, Ziniu Wu, Chengliang Chai, Andreas Pfadler, Bolin Ding, Guoliang Li, Jingren Zhou
EDBT 2022
[Paper]
Selective data acquisition in the wild for model charging
Chengliang Chai, Jiabin Liu, Nan Tang, Guoliang Li, Yuyu Luo
VLDB 2022
[Paper]
Data management for machine learning: A survey
Chengliang Chai, Jiayi Wang, Yuyu Luo, Zeping Niu, Guoliang Li
TKDE 2022
[Paper]
AlphaQO: Robust Learned Query Optimizer
Xiang Yu, Chengliang Chai, Xinning Zhang, Nan Tang, Ji Sun, Guoliang Li
IJSI 2022
[Paper]
Natural language to visualization by neural machine translation
Yuyu Luo, Nan Tang, Guoliang Li, Jiawei Tang, Chengliang Chai, Xuedi Qin
TVCG 2021
[Paper] [Code]
FACE: A normalizing flow based cardinality estimator
Jiayi Wang, Chengliang Chai, Jiabin Liu, Guoliang Li
VLDB 2021
[Paper]
A learned query rewrite system using monte carlo tree search
Xuanhe Zhou, Guoliang Li, Chengliang Chai, Jianhua Feng
VLDB 2021
[Paper] [Code]
Automatic data acquisition for deep learning
Jiabin Liu, Fu Zhu, Chengliang Chai, Yuyu Luo, Nan Tang
VLDB 2021
[Paper]
Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks
Yuyu Luo, Nan Tang, Guoliang Li, Chengliang Chai, Wenbo Li, Xuedi Qin
SIGMOD 2021
[Paper]
Ranking desired tuples by database exploration
Xuedi Qin, Chengliang Chai, Yuyu Luo, Tianyu Zhao, Nan Tang, Guoliang Li, Jianhua Feng, Xiang Yu, Mourad Ouzzani
ICDE 2021
[Paper]
Empowering natural language to visualization neural translation using synthesized benchmarks
Yuyu Luo, Jiawei Tang, Guoliang Li, Chengliang Chai
IEEE VIS 2021
[Paper]
A tree-based indexing approach for diverse textual similarity search
Minghe Yu, Chengliang Chai, Ge Yu
IEEE Access 2020
[Paper]
Visclean: Interactive cleaning for progressive visualization
Yuyu Luo, Chengliang Chai, Xuedi Qin, Nan Tang, Guoliang Li
VLDB 2020
[Paper]
Interactively discovering and ranking desired tuples without writing sql queries
Xuedi Qin, Chengliang Chai, Yuyu Luo, Nan Tang, Guoliang Li
SIGMOD 2020
[Paper]
Human-in-the-loop outlier detection
Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, Samuel Madden
SIGMOD 2020
[Paper]
Human-in-the-loop Techniques in Machine Learning
Chengliang Chai, Guoliang Li
TKDE 2020
[Paper]
Database meets artificial intelligence: A survey
Xuanhe Zhou, Chengliang Chai, Guoliang Li, Ji Sun
TKDE 2020
[Paper]
Crowdsourcing-based data extraction from visualization charts
Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo
ICDE 2020
[Paper]
Outdated fact detection in knowledge bases
Shuang Hao, Chengliang Chai, Guoliang Li, Nan Tang, Ning Wang, Xiang Yu
ICDE 2020
[Paper]
Reinforcement learning with tree-lstm for join order selection
Xiang Yu, Guoliang Li, Chengliang Chai, Nan Tang
ICDE 2020
[Paper]
Interactive cleaning for progressive visualization through composite questions
Yuyu Luo, Chengliang Chai, Xuedi Qin, Nan Tang, Guoliang Li
ICDE 2020
[Paper]
Steerable self-driving data visualization
Yuyu Luo, Xuedi Qin, Chengliang Chai, Nan Tang, Guoliang Li, Wenbo Li
TKDE 2020
[Paper]
Crowdchart: Crowdsourced data extraction from visualization charts
Chengliang Chai, Guoliang Li, Ju Fan, Yuyu Luo
TKDE 2020
[Paper]
Manually detecting errors for data cleaning using adaptive crowdsourcing strategies
Haojun Zhang, Chengliang Chai, An Hai Doan, Paraschos Koutris, Esteban Arcaute
EDBT 2020
[Paper]
Towards automatic mathematical exercise solving
Tianyu Zhao, Chengliang Chai, Yuyu Luo, Jianhua Feng, Yan Huang, Songfan Yang, Haitao Yuan, Haoda Li, Kaiyu Li, Fu Zhu, Kang Pan
DSE 2019
[Paper]
AnalyticDB: real-time OLAP database system at Alibaba cloud
Chaoqun Zhan, Maomeng Su, Chuangxian Wei, Xiaoqiang Peng, Liang Lin, Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, Chengliang Chai
VLDB 2019
[Paper]
Crowdsourcing database systems: Overview and challenges
Chengliang Chai, Ju Fan, Guoliang Li, Jiannan Wang, Yudian Zheng
ICDE 2019
[Paper]
A partial-order-based framework for cost-effective crowdsourced entity resolution
Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, Jianhua Feng
VLDB 2018
[Paper]
CDB: A crowd-powered database system
Guoliang Li, Chengliang Chai, Ju Fan, Xueping Weng, Jian Li, Yudian Zheng, Yuanbing Li, Xiang Yu, Xiaohang Zhang, Haitao Yuan
VLDB 2018
[Paper] [Code]
Crowd-powered data mining
Chengliang Chai, Ju Fan, Guoliang Li, Jiannan Wang, Yudian Zheng
Arxiv 2018
[Paper] [Video] [Video]
Incentive-based entity collection using crowdsourcing
Chengliang Chai, Ju Fan, Guoliang Li
ICDE 2018
[Paper]
CDB: optimizing queries with crowd-based selections and joins
Guoliang Li, Chengliang Chai, Ju Fan, Xueping Weng, Jian Li, Yudian Zheng, Yuanbing Li, Xiang Yu, Xiaohang Zhang, Haitao Yuan
SIGMOD 2017
[Paper]
Cost-effective crowdsourced entity resolution: A partial-order approach
Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, Jianhua Feng
SIGMOD 2016
[Paper]
Natural Language to SQL: State of the Art and Open Problems
Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, Nan Tang
VLDB
[Paper] [Code]
Retrieval Augmented Imputation Using Data Lake Tables
Chenyu Yang, Yuyu Luo, Chuanxuan Cui, Ju Fan, Chengliang Chai, Nan Tang
[Paper]
A technical report on dynamic materialized view management using graph neural network
Yue Han, Chengliang Chai, Jiabin Liu, Guoliang Li, Chuangxian Wei, Chaoqun Zhan
ICDE
[Paper]

GitHub

LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference.
LakeBench is a large-scale benchmark designed to test the mettle of table discovery methods on a much larger scale, providing a more comprehensive and realistic evaluation platform for the field, including finance, retail, manufacturing, energy, media, and more.
Given a human-generated pipeline (HI-pipeline) for an ML task, HAIPipe introduces a reinforcement learning based approach to search an optimized ML-generated pipeline (ML-pipeline) and adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline(HAI-pipeline).
HOFD is a human-in-the-loop approach for detecting outdated facts in knowledge bases. It trains a binary classifier on update signals (e.g., frequency and recency) to rank likely stale facts, has humans verify them, then uses logical rules to infer more outdated facts and feed these labels back into the model—showing strong results on YAGO and DBpedia.
A hybrid SQL query optimizer that combines learning-based and cost-based methods: it uses learning-derived hints to generate strong candidate plans, then picks the best using an uncertainty-aware execution-time predictor. On real workloads it beats state-of-the-art baselines, cutting total latency by ~25% and tail latency by ~65% vs. PostgreSQL.
It proposes a way to get both feature-rich and data-efficient ML by selecting a coreset without materializing the feature-augmented (joined) table. The method pushes coreset gradient estimation down to per-table partial feature similarities with theoretical bounds, delivering ~100× faster selection while keeping near-full-data accuracy.
A systematic study of domain adaptation for deep entity resolution: the paper proposes DADER, a framework spanning Feature Extractor, Matcher, and Feature Aligner, and comprehensively benchmarks DA methods to transfer from a labeled source ER dataset to unlabeled/low-label targets—summarizing what works, what doesn’t, and when.
A Transformer-based NL2VIS system: ncNet takes a natural-language query over tabular data and produces a visualization/spec, using attention-forcing to improve training and visualization-aware rendering to yield higher-quality charts. The goal is to let non-experts generate accurate visualizations directly from plain language.
evolveRewrite is a learned SQL transformation tool, which takes as input a SQL query and corresponding statistics (e.g., schema, #-table rows), finds the optimal rewrite sequence and outputs an optimized rewritten query. Currently evolveRewrite is developled based on Calcite.
CDB is a crowd-powered database system that supports crowd-based query optimizations with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on Amazon Mechanical Turk, CrowdFlower and ChinaCrowd.
From this repository, you can view the 📚latest advancements in Text-to-SQL (a.k.a NL2SQL). This handbook corresponds to our survey paper[TKDE'2025]: 📖A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?. We also provide tutorial slides [Update soon for VLDB'2025 Tutorial]to summarize the key points of this survey. Based on language model trends, we've created a river diagram of Text-to-SQL methods to trace the field's evolution.

Selected Honors & Awards

Academic Services

  • Executive Committee Member, CCF Database Technical Committee
  • Academic Director, CCF Advanced Disciplines Lectures
  • Workshop Chair, DBML Workshop at ICDE
  • Workshop Chair, BDQM Workshop at DASFAA
  • Guest Editor, Journal of Computer Science and Technology (JCST)
  • Program Committee Member (multiple times): VLDB, ICDE, KDD, AAAI, etc.
Top