CE7412:
Computational and Systems Biology
Offered
in Semester 2 for PhD students and MEng students
Instructor:
Professor Jagath Rajapakse
SYLLABUS
1. Biological foundations
Basis of molecular biology, Gens, proteins, DNA, central dogma of biology, transcription, translation, gene structure, analyzing a genome: PCR, cloning, electrophoresis, gene expression, DNA microarray
2. R basics:
Preliminaries, vectors, matrices, factors, lists, data frames, tables, functions, probability, statistical tests, R-packages: bioconductor, Biostrings, and BSgenome, R markdown
3. Probabilistic modelling
Mathematical foundations, probabilistic event and rules, discrete probability models, continuous probability models, multivariate densities and distributions, foundations of information theory, Lagrange theory for constrained optimization, probabilistic modelling, ML estimator, MAP estimator, Bayesian modeling
4. Probabilistic models of sequences
Independent
models of sequences, die models with sequence and counts data, multiple dice
models for multiple sequences, Markov chain modelling of sequences, first-order
models of sequences, higher-order models of sequences, modelling CpG islands
5. Statistical analysis
Hypothesis testing, generalized likelihood ratio tests, Wilk’s theorem, Chi-square statistic for goodness of fit, Chi-square test of independence, extreme value analysis, testing for long repeats, exactly matching and well-matching sequences, Chargaff’s rule, Hardy-Weinberg equilibrium, sequence motifs and logos, protein sequence alignment, PAM matrices, BLAST
6. HMM and gene structure prediction
Definition of
hidden Markov models (HMM); dice models of sequences; likelihood of sequences;
forward algorithm; backward algorithm; Viterbi algorithm; posterior decoding;
ML estimation of parameters; Expectation Minimization (EM) algorithm Baum-Welch
algorithm; Baldi-Chauvin approach; gene structure prediction: VEIL; GENESCAN;
profile HMM for multiple sequence alignment
7. R Graphics:
Base R plotting: plot, boxplot, histograms, bar plot, dot chart, Pi chart, strip chart, scatter plot, QQ plot, multiple plots in one, GG plot: violin plot, density plot, line plot, faceting, grammar of graphics, visualization of data, Examples: gene expression data, genomic data
8. High throughput counts data
Types of counts
data, example RNA sequence dataset: Pasilla, key
terminology, modeling RNA-seq data, analysis with DESeq2, differential
expression analysis, linear models, multifactorial designs, statistical
insights
9.Gene ontology
and gene set enrichment analysis:
Gene ontology:
molecular function, cellular component, biological process, GO hierarchy. GO
terms and relationships, gene ontology enrichment analysis (GOEA), gene
annotation, creating a topGO object, gene set
enrichment analysis (GSEA), hyper geometric test, enrichment score, fgsea package, GOEA and GSEA with ClusterProfiler,
KEGG GSEA
10. Supervised learning:
Predictive
analytic, overfitting, linear discriminant analysis (LDA), predicting diabetes types from clinical variables,
predicting embryonic cell state from gene expression, quadratic discriminant
analysis, machine learning: generalizability, cross validation, curse of
dimensionality, objective functions, predicting colon cancer from stool microbome composition, classifying mouse cells from their
expression profiles, neural networks, support vector machines, bagging, random
forests, boosting
11.Biological
networks
Networks with igraph, network layouts, network and node descriptions:
density, node degree, degree distributions, paths, centrality, hubs and authorities,
subgraphs and communities, cliques, community detection, hypergraphs, trees,
protein interaction networks, random networks, small world property, scale-free
graphs.
12.. Variant
calling from DNA-seq data:
Next generation sequencing (NGS), Bioconductor
resources for NGS analysis,
Whole genome
sequencing (WGS) versus whole exome sequencing (WES), DNA-Seq analysis
pipeline, types of genomic variations. methods of variant calling, predicting
open reading frames in long reference sequences, extracting information in
genomic regions of interest, finding SNPs and indels from sequence data using
variant tools, plotting features of genetic maps, estimating a copy number at a
locus of interest, finding phenotype and genotype associations with GWAS
REFERENCE TEXTS
1. R Bioinformatics Cookbook, Dan Maclean, Packt, 2019
2. Modern statistics for modern biology, Susan Homes and Wolfgang Huber, Cambridge, 2019
ASSESSMENT
Two Assignments 50% (Individual)
Project
40% (the project is to be proposed and executed by groups up to three)
Class participation 5%