Overview of Research. I have research interests broadly in data management and data mining. Specifically, I work in (a) high-dimensional vector data management (and its applications in large models such as retrieval-augmented generative AI), (b) spatial data management with machine learning-based techniques (i.e., AI for DB or AI4DB), (c) spatial data mining in the urban domain (e.g., traffic and mobility analysis), and (d) graph data mining (including dense subgraph mining and graphlet mining). Some of the grants that have supported my research are listed in Part (e) selected grants.
(a) High-dimensional Vector Data Management (VectorDB @NTU). Large-scale high-dimensional vector data has become ubiquitous in contemporary times. For instance, various forms of unstructured data, such as images, videos, texts, and speeches, are typically transformed into vectors using deep learning techniques. These vectors are subsequently employed in downstream analytical tasks. Nearest neighbor (NN) search in high-dimensional vector space constitutes a fundamental problem with a wide array of applications in information retrieval, recommendations, and retrieval-based large language models. We have developed several techniques for approximate NN (ANN), including (1) ADSampling for efficient and reliable distance comparisons (SIGMOD'23), (2) RaBitQ for quantizing high-dimensional vectors (SIGMOD'24), (3) iRangeGraph for attribute-filtered ANN (SIGMOD'25), (4) SymphonyQG for integrating graph-based ANN indices and quantization (SIGMOD'25), and (5) extended RaBitQ for allowing more flexible quantization with varying compression rates (arXiv).
- Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search (arXiv (Sep 2024))
- SymphonyQG: Towards Symphonious Integration of Quantization and Graph for Approximate Nearest Neighbor Search (SIGMOD'25)
- iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search (SIGMOD'25)
- RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search (SIGMOD'24)
- High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison Operations (SIGMOD'23)
(b) Spatial Data Management. I have been developing machine learning based techniques (mainly reinforcement learning based ones) for pre-processing, indexing and query processing spatial data (e.g., trajectory data, point data, etc.). Machine learning leverages data to train a model for a task and has become a new paradigm of developing algorithms in many fields, where the resulting algorithms are called learning-based algorithms. Since learning-based algorithms are driven by the data as well as the objective of the problem directly, with sufficient data available for training, they often outperform non-learning-based algorithms in terms of the effectiveness of optimizing the objectives of problems. Some recent publications are listed as follows.
- BT-Tree: A Reinforcement Learning Based Index for Big Trajectory Data (SIGMOD'25)
- Collectively Simplifying Trajectories in a Database: A Query Accuracy Driven Approach (ICDE'24)
- The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data (SIGMOD'23)
- Towards Designing and Learning Piecewise Space-Filling Curves (VLDB'23)
(c) Spatial Data Mining. I have been focusing on spatial data that is being generated in the urban domain, and mining the data for various applications in the domain. Specifically, in the urban domain, the spatial data includes road networks, point-of-interests (POIs), regions, human mobility, traffic, vehicle trajectories, etc. Mining the spatial data in the urban domain (called spatial urban data mining) would help with many applications such as location-based services, logistics, mobility, traffic, urban planning, etc. Some recent publications are listed as follows.
- TimeCMA: Towards LLM-Empowered Time Series Forecasting via Cross-Modality Alignment (AAAI'25)
- KITS: Inductive Spatio-Temporal Kriging with Increment Training Strategy (AAAI'25)
- Spatial-Temporal Large Language Model for Traffic Prediction (MDM'24)
- AirPhyNet: Harnessing Physics-Guided Neural Networks for Air Quality Prediction (ICLR'24)
- Online Anomalous Subtrajectory Detection on Road Networks with Deep Reinforcement Learning (ICDE'23)
(d) Graph Data Mining. I have conducted extensive research on developing efficient algorithms for two graph data mining tasks including dense subgraph mining and graphlet mining. A dense subgraph of a graph is one whose vertices are connected via many edges. Finding dense subgraphs in a graph has a wide range of applications including correlation mining, fraud detection, e-commerce, bioinformatics, frequent pattern mining, and community detection, etc. Graphlets are sub-graph structures that repeat themselves in a graph. Some examples of graphlets include wedges, triangles, k-node patterns, etc. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved efficiently. Some recent publications are listed as follows.
- Fast Maximum Common Subgraph Search: A Redundancy-Reduced Backtracking Approach (SIGMOD'25)
- Fast Maximal Quasi-clique Enumeration: A Pruning and Branching Co-Design Approach (SIGMOD'24)
- Efficient k-Clique Listing: An Edge-Oriented Branching Strategy (SIGMOD'24)
- Maximum k-Biplex Search on Bipartite Graphs: A Symmetric-BK Branching Approach (SIGMOD'23)
(e) Selected Grants. My research has been supported by various parties, including NTU, MOE Singapore, Alibaba, A*STAR, NCS, etc. Some selected grants are listed as follows.
- "On Developing Video Traffic Data based Intelligent Transport Fundamental Core Techniques", IAF-ICP, A*STAR, S$630.9K, 03/2023 - 06/2025, PI
- "Learn to Augment and Represent Spatial Urban Data", AcRF Tier-2, MOE, S$703.05K, 07/2022 - 06/2025, PI
- "Leveraging Machine Learning for Bipartite Matching", AcRF Tier-1, MOE, S$191.5K, 02/2022 - 02/2025, PI
- "Pre-Processing and Querying Big Trajectory Data with Reinforcement Learning", AcRF Tier-2, MOE, S$495K, 10/2021 - 10/2024, PI
- "KGConst: Towards Constructing Knowledge Graphs on Enterprise Data", SCALE Lab, S$546K, 09/2021 - 12/2023, PI
- "Towards Multi-Sourced Topological and Spatiotemporal Data Augmentation and Fusion with Deep Graph Neural Networks", Alibaba-NTU JRI, S$136K, 11/2020 - 02/2022, PI
- "Querying and analyzing groups of trajectories with representation learning", AcRF Tier-1, MOE, S$90K, 11/2019 - 10/2022, PI
- "Efficient and Scalable Solutions to Big Graph Problems with High-Complexity: When Sampling Meets Distributed Systems", NTU Start-UP Grant, S$100K, 08/2018 - 08/2022, PI