People of ACM - Ke Yi
February 24, 2026
With massive amounts of data, exact query answers are not always meaningful. How did your work with approximate query processing (AQP) address this problem?
In my work, I have primarily taken the following three approaches to address the AQP problem: sampling, data summarization, and differential privacy. Sampling is a simple yet effective tool to reduce the data volume, and therefore the computational cost (the amount of time and resources required for processing and transferring data). But this is achieved at the expense of some small errors in the query results. While the algorithms and the statistical properties of various sampling methods are well studied on a single flat table, the problem becomes more challenging in a relational database, where data is organized into multiple relations (tables) and queries consist of complex selection, join, projection, and aggregation operators.
To ensure efficient and effective sampling for query processing, I have employed various techniques such as random walks, reservoir sampling, and importance sampling. I then adapted these techniques to the context of relational query processing. In addition to AQP, some of my relational sampling techniques have also been used for tasks such as training a machine learning model and perform clustering analysis.
I will talk about data summarization and differential privacy in the next two questions.
With Pankaj Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, and Zhewei Wei, you received the 2022 ACM PODS Test of Time Award for your paper “Mergeable Summaries.” Why are mergeable data summaries important? What was the key innovation you and your co-authors introduced in this paper?
In distributed databases, communication becomes the dominating cost in query processing. Exact query processing often requires the entire dataset to be shuffled—a technique that ensures the confidentiality and integrity of information but which can be prohibitively expensive. Data summarization is an alternative method that effectively reduces the communication cost at the expense of some approximation error in the query result. In this approach, each part of the data is compressed into a small lossy summary, which is then merged with each other, possibly in some arbitrary order, to form a summary of the entire dataset. Then the approximate query result can be extracted from the merged summary. This paper initiated the study of mergeability of summaries, analyzed the mergeability of the popular existing summaries, and developed new mergeable summaries for quantiles, heavy hitters, and geometric data. The impact is significant: mergeability is now an essential property of sketches and stands as a core principle of the Apache Data Sketches project as well as other products in industry. In my book, Small Summaries for Big Data with Graham Cormode, mergeability is also considered a key operation that all summaries shall support.
What are the main contributions of “R2T: Instance-optimal Truncation for Differentially Private Query Evaluation with Foreign Keys,” which won the SIGMOD 2022 Best Paper Award?
Privacy requirements are another reason why query results need to be approximate. Notably, differential privacy, which has now become the de facto standard for ensuring personal information privacy, mandates that certain amount of noise be injected into the query answer. Here, the key technical question is how to inject the minimum amount of noise (meaningless information) while meeting the requirement of differential privacy. Although the problem is well understood for queries over a single table, it becomes highly nontrivial for complex queries in a relational database where an individual’s data is scattered across multiple tables linked by foreign keys. This paper develops instance-optimal query processing algorithms for selection-join-aggregation queries that inject the smallest amount of noise. Based on these algorithms, we have developed a DP-SQL system prototype, which has been demonstrated at the VLDB 2024 conference.
As the Program Chair for this year’s ACM PODS conference, do you expect to see any recurring and emerging topics among the papers submitted?
While strong work continues to appear in traditional areas such as logic foundations, data modeling, and query languages, PODS 2026 is also seeing a growing number of submissions aligned with current technological developments. Notable themes include new query processing algorithms, differential privacy, and similarity search. It is encouraging to see the database theory community remain vibrant and continue contributing foundational insights to modern data management.
As the Director of HKUST’s MSc Program in Big Data Technology, what have you learned about how students need to be prepared for careers in data management?
Students in the data management field must be equipped with both strong foundational knowledge and the ability to adapt to rapidly evolving technologies. A deep understanding of database systems, algorithms, and statistical principles remains essential, but it is equally important to gain hands‑on experience with modern data infrastructures, cloud‑native platforms, and large‑scale analytics systems—the latter being particularly important for MSc students who are primarily industry-oriented. Another skill that has become increasingly indispensable nowadays is AI literacy. Today’s data professionals are expected to understand how machine learning and AI models integrate with data management pipelines—from building and deploying models to working with AI‑augmented query processing, intelligent indexing, and automated optimization. Familiarity with both traditional data systems and AI‑driven components allows graduates to work effectively in hybrid environments where these technologies increasingly intersect.
Ke Yi is a Professor at the Hong Kong University of Science and Technology (HKUST), where he serves as the Director of the MSc program in Big Data Technology. Yi’s research interests include database theory and systems, query processing, data security and privacy, parallel and distributed algorithms, data streams, sampling, and data summarization. He is especially noted for his significant contributions to approximate query processing (AQP), a vital tool for dealing with massive amounts of data.
Yi’s honors include receiving two ACM SIGMOD Best Paper Awards, four ACM SIGMOD Research Highlight Awards, and an ACM PODS Test of Time Award. He was recently named an ACM Fellow for contributions to the theory and practice of query processing.