Overview of Research. I have research interests broadly in data management and data mining. Specifically, I work in (a) high-dimensional vector data management (and its applications in large models such as retrieval-augmented generative AI), (b) spatial data management with machine learning-based techniques (i.e., AI for DB or AI4DB), (c) spatial data mining in the urban domain (e.g., traffic and mobility analysis), and (d) graph data mining (including dense subgraph mining and graphlet mining). Some of the grants that have supported my research are listed in Part (e) selected grants.

(a) High-dimensional Vector Data Management. I have been focusing on developing various techniques for approximate k neareast neighbor (AKNN) search. One problem that I have studied is the distance comparison operator (DCO) – it determines whether the Euclidean distance of a data vector from the query vector is at most a distance threshold and if so, returns the distance. DCOs are commonly used in many algorithms of managing and mining from high-dimensional vector data, e.g., AKNN, clustering, and outlier detection. Our contribution is a randomized algorithm called ADSampling, which conducts DCOs with time logarithmic to the dimensionality in most cases (SIGMOD'23). In addition, I have developed a new quantization method called RaBitQ for high-dimensional vectors, which provides theoretical guarantee on the accuracy of approximate distances, and applied it for AKNN (SIGMOD'24). Some recent publications are listed as follows.

(b) Spatial Data Management. I have been developing machine learning based techniques (mainly reinforcement learning based ones) for pre-processing, indexing and query processing spatial data (e.g., trajectory data, point data, etc.). Machine learning leverages data to train a model for a task and has become a new paradigm of developing algorithms in many fields, where the resulting algorithms are called learning-based algorithms. Since learning-based algorithms are driven by the data as well as the objective of the problem directly, with sufficient data available for training, they often outperform non-learning-based algorithms in terms of the effectiveness of optimizing the objectives of problems. Some recent publications are listed as follows.

(c) Spatial Data Mining. I have been focusing on spatial data that is being generated in the urban domain, and mining the data for various applications in the domain. Specifically, in the urban domain, the spatial data includes road networks, point-of-interests (POIs), regions, human mobility, traffic, vehicle trajectories, etc. Mining the spatial data in the urban domain (called spatial urban data mining) would help with many applications such as location-based services, logistics, mobility, traffic, urban planning, etc. Some recent publications are listed as follows.

(d) Graph Data Mining. I have conducted extensive research on developing efficient algorithms for two graph data mining tasks including dense subgraph mining and graphlet mining. A dense subgraph of a graph is one whose vertices are connected via many edges. Finding dense subgraphs in a graph has a wide range of applications including correlation mining, fraud detection, e-commerce, bioinformatics, frequent pattern mining, and community detection, etc. Graphlets are sub-graph structures that repeat themselves in a graph. Some examples of graphlets include wedges, triangles, k-node patterns, etc. Each of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved efficiently. Some recent publications are listed as follows.

(e) Selected Grants. My research has been supported by various parties, including NTU, MOE Singapore, Alibaba, A*STAR, NCS, etc. Some selected grants are listed as follows.