CE7412: Computational and Systems Biology

Offered in Semester 2 for PhD students and MEng students

Instructor: Professor Jagath Rajapakse

SYLLABUS

1. Biological foundations

Basis of molecular biology, Gens, proteins, DNA, central dogma of biology, transcription, translation, gene structure, analyzing a genome: PCR, cloning, electrophoresis, gene expression, DNA microarray

2. R basics:

Preliminaries, vectors, matrices, factors, lists, data frames, tables, functions, probability, statistical tests, R-packages: bioconductor, Biostrings, and BSgenome, R markdown

3. Probabilistic modelling

Mathematical foundations, probabilistic event and rules, discrete probability models, continuous probability models, multivariate densities and distributions, foundations of information theory, Lagrange theory for constrained optimization, probabilistic modelling, ML estimator, MAP estimator, Bayesian modeling

4. Probabilistic models of sequences

Independent models of sequences, die models with sequence and counts data, multiple dice models for multiple sequences, Markov chain modelling of sequences, first-order models of sequences, higher-order models of sequences, modelling CpG islands

5. Statistical analysis

Hypothesis testing, generalized likelihood ratio tests, Wilk’s theorem, Chi-square statistic for goodness of fit, Chi-square test of independence, extreme value analysis, testing for long repeats, exactly matching and well-matching sequences, Chargaff’s rule, Hardy-Weinberg equilibrium, sequence motifs and logos, protein sequence alignment, PAM matrices, BLAST

6. HMM and gene structure prediction

Definition of hidden Markov models (HMM); dice models of sequences; likelihood of sequences; forward algorithm; backward algorithm; Viterbi algorithm; posterior decoding; ML estimation of parameters; Expectation Minimization (EM) algorithm Baum-Welch algorithm; Baldi-Chauvin approach; gene structure prediction: VEIL; GENESCAN; profile HMM for multiple sequence alignment

7. R Graphics:

Base R plotting: plot, boxplot, histograms, bar plot, dot chart, Pi chart, strip chart, scatter plot, QQ plot, multiple plots in one, GG plot: violin plot, density plot, line plot, faceting, grammar of graphics, visualization of data, Examples: gene expression data, genomic data

8. High throughput counts data

Types of counts data, example RNA sequence dataset: Pasilla, key terminology, modeling RNA-seq data, analysis with DESeq2, differential expression analysis, linear models, multifactorial designs, statistical insights

9.Gene ontology and gene set enrichment analysis:

Gene ontology: molecular function, cellular component, biological process, GO hierarchy. GO terms and relationships, gene ontology enrichment analysis (GOEA), gene annotation, creating a topGO object, gene set enrichment analysis (GSEA), hyper geometric test, enrichment score, fgsea package, GOEA and GSEA with ClusterProfiler, KEGG GSEA

10. Supervised learning:

Predictive analytic, overfitting, linear discriminant analysis (LDA), predicting diabetes types from clinical variables, predicting embryonic cell state from gene expression, quadratic discriminant analysis, machine learning: generalizability, cross validation, curse of dimensionality, objective functions, predicting colon cancer from stool microbome composition, classifying mouse cells from their expression profiles, neural networks, support vector machines, bagging, random forests, boosting

11.Biological networks

Networks with igraph, network layouts, network and node descriptions: density, node degree, degree distributions, paths, centrality, hubs and authorities, subgraphs and communities, cliques, community detection, hypergraphs, trees, protein interaction networks, random networks, small world property, scale-free graphs.

12.. Variant calling from DNA-seq data:

Next generation sequencing (NGS), Bioconductor resources for NGS analysis,

Whole genome sequencing (WGS) versus whole exome sequencing (WES), DNA-Seq analysis pipeline, types of genomic variations. methods of variant calling, predicting open reading frames in long reference sequences, extracting information in genomic regions of interest, finding SNPs and indels from sequence data using variant tools, plotting features of genetic maps, estimating a copy number at a locus of interest, finding phenotype and genotype associations with GWAS

REFERENCE TEXTS

1. R Bioinformatics Cookbook, Dan Maclean, Packt, 2019

2. Modern statistics for modern biology, Susan Homes and Wolfgang Huber, Cambridge, 2019

ASSESSMENT

Two Assignments 50% (Individual)

Project 40% (the project is to be proposed and executed by groups up to three)

Class participation 5%