Overview of Research. I have research interests broadly in data management and data mining. Specifically, I work in (a) high-dimensional vector data management (and its applications in large models such as retrieval-augmented generative AI), (b) spatial data management with machine learning-based techniques (i.e., AI for DB or AI4DB), (c) spatial data mining in the urban domain (e.g., traffic and mobility analysis), and (d) graph data mining (including dense subgraph mining and graphlet mining). Some of the grants that have supported my research are listed in Part (e) selected grants.

(a) High-dimensional Vector Data Management (VectorDB @NTU). Large-scale high-dimensional vector data has become ubiquitous in contemporary times. For instance, various forms of unstructured data, such as images, videos, texts, and speeches, are typically transformed into vectors using deep learning techniques. These vectors are subsequently employed in downstream analytical tasks. Nearest neighbor (NN) search in high-dimensional vector space constitutes a fundamental problem with a wide array of applications in information retrieval, recommendations, and retrieval-based large language models. We have developed several techniques for approximate NN (ANN), including (1) ADSampling for efficient and reliable distance comparisons (SIGMOD'23), (2) RaBitQ for quantizing high-dimensional vectors (SIGMOD'24), (3) iRangeGraph for attribute-filtered ANN (SIGMOD'25), (4) SymphonyQG for integrating graph-based ANN indices and quantization (SIGMOD'25), and (5) extended RaBitQ for allowing more flexible quantization with varying compression rates (arXiv).

(b) Spatial Data Management. I have been developing machine learning based techniques (mainly reinforcement learning based ones) for pre-processing, indexing and query processing spatial data (e.g., trajectory data, point data, etc.). Machine learning leverages data to train a model for a task and has become a new paradigm of developing algorithms in many fields, where the resulting algorithms are called learning-based algorithms. Since learning-based algorithms are driven by the data as well as the objective of the problem directly, with sufficient data available for training, they often outperform non-learning-based algorithms in terms of the effectiveness of optimizing the objectives of problems. Some recent publications are listed as follows.

(c) Spatial Data Mining. I have been focusing on spatial data that is being generated in the urban domain, and mining the data for various applications in the domain. Specifically, in the urban domain, the spatial data includes road networks, point-of-interests (POIs), regions, human mobility, traffic, vehicle trajectories, etc. Mining the spatial data in the urban domain (called spatial urban data mining) would help with many applications such as location-based services, logistics, mobility, traffic, urban planning, etc. Some recent publications are listed as follows.

(d) Graph Data Mining. I have conducted extensive research on developing efficient algorithms for two graph data mining tasks including dense subgraph mining and graphlet mining. A dense subgraph of a graph is one whose vertices are connected via many edges. Finding dense subgraphs in a graph has a wide range of applications including correlation mining, fraud detection, e-commerce, bioinformatics, frequent pattern mining, and community detection, etc. Graphlets are sub-graph structures that repeat themselves in a graph. Some examples of graphlets include wedges, triangles, k-node patterns, etc. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved efficiently. Some recent publications are listed as follows.

(e) Selected Grants. My research has been supported by various parties, including NTU, MOE Singapore, Alibaba, A*STAR, NCS, etc. Some selected grants are listed as follows.