Members Research Teaching Join Group

Data-driven sciences are widely regarded as the fourth paradigm of sciences that will fundamentally change the society and our everyday lives. Indeed, artificial intelligence (AI) models have already revolutionized and transformed various data-intensive industries.  Machine learning and deep learning models have achieved unprecedented extraordinary performance for image, text, audio, video, and network data analysis. The great successes are mainly due to three reasons, i.e., accumulation of the gigantic amount of data, ever-increasing computational power, and design of the highly efficient algorithms. Further, the remarkable achievement of AlphaFold2 for protein folding problem has ushered in a new era for AI-based molecular data analysis for materials, chemistry, and biology.

With the excitement and opportunities come challenges. Currently, one of the central challenges for AI-based molecular data analysis is molecular representation, which is to identify or design appropriate molecular descriptors or fingerprints. Proper descriptors should preserve the most important and intrinsic molecular properties and information that directly determinate molecular functions. In this way, they can be better “understood” by machine learning models. In fact, the performance of many learning methods is heavily dependent on the choice of data representation and featurization, which is a long-standing issue for cheminformatics and bioinformatics.  Traditional molecular descriptors are properties obtained from structural geometry/topology, chemical conformation, chemical graph, as well as molecular formula, hydrophobicity, steric properties, and electronic properties. These descriptors are widely used in the quantitative structure-activity relationship (QSAR) and learning models.

Mathematical AI for Molecular Sciences is proposed for molecular representation, featurization, and learning. As illustrated above, various types of data, in particular, molecular data from materials, chemistry, and biology, can be represented using topological models, including graphs, simplicial complexes, hypergraphs, etc. From these representations, various mathematical invariants are obtained by using advance mathematical models from algebraic topology, discrete geometry, combinatorics, etc. These mathematical invariants are used as input features for learning models. Dramatically different from previous models, molecular data are modelled using higher-dimensional topologies, such as simplicial complexes and hypergraphs, and filtration-induced multiscale representations. Further, mathematical invariant-based features characterize the most intrinsic and fundamental properties and have a better transferability for learning models.

A brief introduction of the area can be found in 2021 winter school lectures at Dalian, AATRN talk, and Prof.Guowei Wei's works (SIAM news, Harvard talk, D3R news).

We sincerely welcome highly motivated students and postdocs to join our group!

  • Persistent Spectral based machine learning (PerSpect ML) for drug design

The structure-function relationship is of essential importance to the analysis of biomolecular flexibility, dynamics, interactions, and functions. Topology studies the network and connection information within the data and provides an effective way of structure characterization. As illustrated in the figures, there are three basic topological representations, including graph, simplicial complex, and hypergraph, for molecular structures.  Features for learning models can be obtained from these representations. The essential idea is to use eigen-spectrum-based properties as molecular descriptors.

Our persistent spectral (PerSpect) theory covers three basic models, i.e., PerSpect graph, PerSpect simplicial complex, and PerSpect hypergraph. These models are filtration-based multidimensional spectral methods. Mathematically, spectral graph theory, spectral simplicial complex, and spectral hypergraph have been developed based on graph, simplicial complex, and hypergraph. These models use different types of connection matrixes, in particular, Hodge (combinatorial) Laplacian matrixes, to represent structure connection. The multidimensional representation is achieved through a filtration process. The persistence and variation of eigen spectrum information during the filtration process are characterized by persistent functions or attributes, which are further used as  molecular features or fingerprints.

Reference: Zhenyu Meng and Kelin Xia, "Persistent spectral based machine learning (PerSpect ML) for protein-ligand binding affinity prediction", Science advances (2021) 

  • Persistent Ricci curvature based machine learning

Ricci curvature is one of the fundamental concepts in differential geometry and theoretical physic. Two discrete Ricci curvature forms, i.e., Ollivier Ricci curvature (ORC) and Forman Ricci curvature (FRC), have been developed to characterize different aspects of the classical Ricci curvature. ORC is defined as the Wasserstein distance between two associated probability measurements on metric spaces. It captures clustering and coherence properties of global and local structures in networks. In contrast, FRC is defined as a combinatorial property of upper adjacent, lower-adjacent and parallel simplexes on CW complexes. This combinatorial curvature can be directly derived from the combinatorial Bochner-Weitzenbock decomposition. It characterizes geodesics dispersal property and algebraic topological information within networks. Even though the two discrete forms can have totally different values, sometimes even signs, for network substructures, they are found to be highly correlated in various complex networks. Generally speaking, positive ORCs or FRCs are commonly found in densely-packed clusters or “communities”, while negative ORCs or FRCs usually represent bridges or links between clusters.
The persistent Ricci curvature is proposed to combine filtration-based multiscale representations with Ricci curvatures for molecular featurization. Ricci curvatures are systematically evaluated on all the graphs/simplicial complexes/hypergraphs in the filtration process. The statistical and combinatorial properties of Ricci curvatures during the filtration are used as molecular descriptors.  

Referece: JunJie Wee and Kelin Xia, "Forman persistent Ricci curvature (FPRC) based machine learning models for protein–ligand binding affinity prediction", Briefings In Bioinformatics (2021)
JunJie Wee and Kelin Xia, "Ollivier persistent Ricci curvature (OPRC) based machine learning for protein-ligand binding affinity prediction", Journal of Chemical Information and Modeling, (2021)
  • Peristent hypergraph based machine learning
Variational_multiscaleHypergraphs are powerful topological representations that can characterize more general structure information than graphs and simplicial complexes. A hypergraph is composed of hyperedges, which are sets of vertices. Essentially, a hyperedge can be viewed as a generalization of simplexes without the closeness under boundary conditions. The interactions between molecules at atomic level can be well represented as hypergraphs. Mathematically, a hyperedge can be defined a set of vertices (atoms) that have at least one from each molecules. For instance, in protein-ligand interactions, a hyperedge is defined among protein and ligand atoms, but it has at least one atom from protein and the other from ligand. In this way, hyperedges represent (many-boday) interactions between protein and ligand atoms.
Element-specific models are widely used to decompose molecular complexes into a series of atom specific combinations. More specifically, proteins can be decomposed into at least 5 types of atom sets, i.e., C, O, N, S, and H, while ligand usually have at least 10 types of atom, including C, N, O, S, P, F, Cl, Br, I and H. In this way, upto 50 atom combinations can be obtained and the corresponding hypergraphs can be constructed. Topological and geometric invariants can be systematically obtained from these hyperedges and further used as features for machine learning models.

Xiang Liu, Huitao Feng, Jie Wu, and Kelin Xia, "Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction", Briefings In Bioinformatics (2021)
Xiang Liu, Xiangjun Wang, Jie Wu, and Kelin Xia, "Hypergraph based persistent cohomology (HPC) for molecular representations in drug design", Briefings In Bioinformatics (2021)

Geometric and Variational modeling
  • Variational multi-scale models
Variational_multiscale We develop geometric modeling and computational algorithm for biomolecular structures from two data sources: Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) in the Eulerian (or Cartesian) representation. Molecular surface (MS) contains non-smooth geometric singularities, such as cusps, tips and selfintersecting facets, which often lead to computational instabilities in molecular simulations, and violate the physical principle of surface free energy minimization. Variational multiscale surface definitions are proposed based on geometric flows and solvation analysis of biomolecular systems. The resulting surfaces are free of geometric singularities and minimize the total free energy of the biomolecular system. High order partial differential equation (PDE)-based nonlinear filters are employed for EMDB data processing. After the construction of protein multiresolution surfaces, we explore the analysis and characterization of surface morphology by the consideration of Gaussian curvature, mean curvature, maximum curvature, minimum curvature, shape index, and curvedness. Based on the curvature and electrostatic analysis from our multiresolution surfaces, we introduce a new concept, the polarized curvature, for the prediction of protein binding sites.                                                                      
  • Protein flexibility and rigidity analysis
FRI Protein structural fluctuation, typically measured by Debye-Waller factors, or B-factors, is a manifestation of protein flexibility, which strongly correlates to protein function. The flexibility-rigidity index (FRI) is a newly proposed method for the construction of atomic rigidity functions required in the theory of continuum elasticity with atomic rigidity, which is a new multiscale formalism for describing excessively large biomolecular systems. The FRI method analyzes protein rigidity and flexibility and is capable of predicting protein B-factors without resorting to matrix diagonalization. A fundamental assumption used in the FRI is that protein structures are uniquely determined by various internal and external interactions, while the protein functions, such as stability and flexibility, are solely determined by the structure. As such, one can predict protein flexibility without resorting to the protein interaction Hamiltonian. Additionally, we propose anisotropic FRI (aFRI) algorithms for the analysis of protein collective dynamics. Eigenvectors obtained from the proposed aFRI algorithms are able to demonstrate collective motions. 

Scientific Computing

  • MIB method for multi-material interface  problem
Multi-material interface problems are omnipresent in science, engineering and daily life. The solution to this class of problems becomes exceptionally challenging when more than two heterogeneous materials join at one point of the space and form a geometric singularityprimary. Based on the MIB method, several schemes have been constructed to solve 2D elliptic equations with discontinuous coefficients associated with three-material interfaces. The essential idea is to smoothly extend functions across the interface and employ the fictitious values at irregular points. For the geometric singularities, two sets of  interface conditions are considered simultaneously. Intensive numerical experiments are carried out to validate the proposed schemes. A second order of accuracy is obtained for complex geometric and geometric singularities.

  • Adaptive mesh based MIB method
Adaptive_mesh Mesh deformation methods break down for elliptic PDEs  interface problems, as additional interface jump conditions are required to maintain the well-posedness of the governing equation.  An interface technique based adaptively deformed mesh strategy is introduced  for resolving elliptic interface problems. We take the advantages of the high accuracy, flexibility and robustness of MIB method to construct an adaptively deformed mesh based interface method. The proposed method generates deformed meshes in the physical domain and solves the transformed governed equations in the computational domain, which maintains regular Cartesian meshes. The mesh deformation is realized by a mesh transformation PDE, which controls the mesh redistribution by a source term. The source term consists of a monitor function, which builds in mesh contraction rules. Both interface geometry based deformed meshes and solution gradient based deformed meshes are constructed to reduce  errors in solving elliptic interface problems. The proposed adaptively deformed mesh based interface method is extensively validated by many numerical experiments. Numerical results indicate that the adaptively deformed mesh based interface method outperforms the original MIB method for dealing with elliptic interface problems.                                                                                                       
  • MIB Galerkin method
MIB_Galerkin A MIB Galerkin formulation is developped for solving the elliptic interface problem. In this approach, we build up two sets of elements respectively on two extended subdomains which both include the interface. As a result, two sets of elements overlap each other near the interface. Fictitious solutions are defined on the overlapping part of the elements, so that the differentiation operations of the original PDEs can be discretized as if there was no interface. The extra coeffients of polynomial basis functions, which furnish the overlapping elements and solve the fictitious solutions, are determined by interface jump conditions. Consequently, the interface jump conditions are rigorously enforced on the interface. The present method utilizes Cartesian meshes to avoid the mesh generation in conventional finite element methods (FEMs). The accuracy, stability and robustness of the proposed 3D MIB Galerkin are extensively validated.  Near second order accuracy has been confirmed. To our knowledge, it is the first time for an FEM to show a near second order convergence in solvingthe Poisson equation with realistic protein surfaces. Additionally, the present work offers the first known near second order accurate method for C_1 continuous or H_2 continuous solutions associated with a Lipschitz continuous interface.