This is the course page for a lecture series, ‘Bayesian inference for researchers’ that I have taught a number of times at Oxford.
The lecture slides and problem sheets for the course are here.
The material covered in each of the lectures roughly follows:
Lecture 1. Introduction to the course (lecture dates, syllabus, problem sets, group projects); the theory of inference – going from the Big World to the Small World; likelihoods defined by the boundaries of the Small World; all inference as inverting a likelihood; Classical versus Bayesian approaches to inversion; an introduction to probability distributions, and their manipulation; parts of Bayes’ rule for inference.
Lecture 2. How to build a statistical model (applied example); how to use existent data to validate a model through graphical methods, and posterior predictive checks (more theory on PPCs to follow in lecture 3); the theory of priors in Bayesian inference; the illusion of ‘uninformative’ priors; prior predictive distributions; posterior point summaries – MAP, posterior median and mean; posterior interval summaries – CPI vs HDI; posterior predictive distributions; the difficulty of Bayesian statistics due to its high dimensional integrals; conjugate priors and sampling as ways round these difficulties; definition of conjugate priors; beta-binomial conjugate example.
Lecture 3. Posterior predictive checks in detail – their use as a way of checking the appropriateness of the model for any aspect of the data; posterior predictive checks as shifting the boundaries of the Small World; recap of the difficulty of Bayesian inference – the need to calculate high dimensional integrals; the restriction of using conjugate priors; potential solutions for the difficulty – discretising the posterior, and numerical quadrature; the problem that the complexity of the aforementioned methods scales exponentially with the parameter dimensionality; an introduction to sampling through the computational die; independent sampling as a way to approximate high dimensional integrals; the difficulty of creating an independent sampler in practice – rejection sampling, inverse transform sampling, and importance sampling; the impossibility of independent sampling from the posterior in practice (in part due to the difficulty of calculating the denominator); a shift from independent sampling to dependent, which allows us to use relative posterior height, rather than absolute; the cost of dependent sampling; quantifying the cost of dependent sampling through the effective sample size (Markovian die example).
Lecture 4. MCMC as a method to dependently sample from the posterior; the Metropolis algorithm explained through toy examples; the acceptance/rejection rule, and why it works; detailed balance and chain ergodicity underpinning chain convergence to the posterior; sampling efficiency as a function of step size; adaptive MCMC; the difficulty of monitoring convergence to the posterior; problems with convergence monitoring using a single chain; using parallel chains initialised in random parts of posterior space to judge convergence; Gelman and Rubin’s R-hat; warm up period; effective sample size revisited and defined more fully; the effectiveness of an algorithm should be measured by # effective samples per second not samples; Gelman’s ‘Folk Theorem’.
Lecture 5. An introduction to Gibbs sampling; Gibbs sampling as a specific example of the Metropolis sampler; Gibbs sampling versus Random Walk Metropolis; the problem of Random Walk Metropolis and Gibbs; an introduction to Hamiltonian Monte Carlo in Stan ; using symplectic integrators to simulate volume-conserving paths in (position, momentum) space; the revised acceptance/rejection rule; comparing the results of Gibbs with aforementioned algorithms. The following material will be mentioned in references, but will not be explicitly covered in the lectures: the problem of ‘U-turns’ in HMC; an introduction to the ‘No U-Turn Sampler’ (NUTS) used in Stan as a way of avoiding manual tuning; the problem with HMC; Riemannian MCMC as a remedy, albeit with the extra computational cost of evaluating the 2nd derivative; the problem of multimodality, and a potential solution through Adiabatic MCMC.
Lecture 6. An introduction to the Stan language; Stan’s differences versus JAGS/BUGS; Stan’s code blocks; variable declaration; access through ‘rstan’ or other Matlab/Python wrappers; live coding up a model in Stan and how to debug by print; an introduction to Shiny Stan; PPCs in Stan via the ‘generated quantities’ block; an introduction to Stan’s active Google groups forum. An introduction to hierarchical models as lying in a spectrum from complete pooling to fully heterogeneous parameters; the benefits of hierarchical modelling; shrinkage of estimators towards group mean; priors and hyper-priors; model testing in hierarchical models; parameter identification, and the importance of fake data simulation; coding up a hierarchical model in Stan and comparing it to a non-hierarchical equivalent; an introduction to the student projects.
Lecture 7. A focus on one/two case studies: fitting an ODE/PDE model using Random Walk Metropolis, and a nice hierarchical model. If we have time we will look at how to estimate a model’s out-of-sample predictive power through AIC, DIC, WAIC, LOO-CV and K-fold CV; the choice of cross-validation metric to reflect the eventual use of the model.