Course Title

COMPUTATIONAL STATISTICS II

Course code

STAT 6182

Credits

3

Level:

Graduate

Course Type:

Elective

Pre-requisites

Multivariate statistics, mathematical statistics, familiarity with basic matrix algebra.

 

Rationale

In today’s world, many of the problems that statisticians face cannot be handled analytically thus alternative methods must be sought computationally to find approximate solutions to real world problems. This area has burgeoned over the last two decades as the cost of computing power decreased and speed of computers has increased. It is thus necessary to have courses within the MSc and PhD programs in Statistics that would cater for the need of computational skills in order to solve some of the more demanding problems for which there are no analytic solutions. This is the second course on computational statistics in the Master’s in Statistics program. Data mining is in particular a burgeoning area since most of the data today is classified as BIG data. In addition, spatial and temporal data is necessary in areas such as disease mapping and climate modelling.

 

Course Description

This course is meant to cover the techniques in statistics that are computational in nature that would not have ordinarily been covered by the other courses in the statistics masters program. The course covers topics such as spatial statistics, advanced Bayesian models and some data mining techniques. Both the theoretical underpinnings of the material and the application through computational aspects. The course will be hands on and practical and will rely heavily on the statistical software R. Matlab will be utilized where there is a need for numerical computations. We will rely on both real data and simulated data for illustrating the main concepts in the course. Datasets from different subject areas will be utilized.The course is the first in a sequence of two computational statistics courses.  This course is presented to address these concerns. 

 

Content

Computational Statistics is a branch of mathematical sciences concerned with efficient methods for obtaining numerical solutions to statistically formulated problems. This course will introduce students to a variety of computationally intensive statistical techniques and the role of computation as a tool of discovery. Topics include Spatial and temporal modeling, Bayesian Networks and some selected data mining techniques like neural networks and   support vector machines.

 

Aims and Goals

The main goals of the course are:

  • Introduce students to some common data mining techniques with both theoretical and practical coverage.
  • Illustrate the use of computers in solving problems such as classification in statistics.
  • Model both spatial and temporal data using appropriate statistical software.
  • Understand the role of computation as a tool of discovery in data analysis.

 

Objectives

Upon successful completion of this course, students MUST be able to:

  • Use appropriate methods for density estimation in both the univariate and multivariate setting.
  • Write computer code in R to perform density estimation and use suitable methods such as cross-validation to determine the effectiveness of the model.
  • Use R and Winbugs to model data in the Bayesian framework.
  • Model spatial and temporal data using appropriate data using R.
  • Discuss data mining techniques such as complex theories both verbally and in written format
  • Use appropriate data mining methods for specific problems such as classification.

 

Mode of Delivery

Lectures are delivered face-to-face. All lectures, assignments, handouts, and review materials are available online to all students. Lectures supplemented with laboratory work and tutorials.

 

Course content and structure

Week

Material

Notes

1

Review of R 

Review of course outline and expectations. Downloading R and related software for course.

Course introduction, format of delivery.

Lab

2

Spatial Statistics

Geostatistics: Variogram/covariance function; kriging. Introduction to Spatial Statistics R package.

Lectures, with lecture notes made available

*Project topics to be investigated and finalised

Lab

3

Spatial Statistics

Spatio-temporal modeling – Disease mapping.

Lectures, with lecture notes made available

*Students start research on project and report on progress to Supervisor on weekly basis.

Lab

4

Spatial Statistics

Spatio-temporal modeling – Climate Modeling.

Lectures, with lecture notes made available

Lab

5

Spatial Statistics

Spatio-temporal modeling – Review of ARCGIS.

Lectures, with lecture notes made available

Lab

6

Data Mining Techniques

Classification using Support Vector Machines

Lectures, with lecture notes made available

Lab

7

Data Mining Techniques

Classification using Support Vector Machines (continued)

Lectures, with lecture notes made available

Lab

8

Data Mining Techniques

Neural Networks Part I

Lectures, with lecture notes made available

Lab

9

Data Mining Techniques

Neural Networks Part II

Lectures, with lecture notes made available

Lab

10

Data Mining Techniques

Introduction to Bayesian Networks

Lectures, with lecture notes made available

11

Data Mining Techniques

Application of Bayesian Networks – Risk Assessment and Decision

Lectures, with lecture notes made available

12

Data Mining Techniques

Application of Bayesian Networks – Risk Assessment and Decision – Some more examples

Lectures, with lecture notes made available

13

Revision and Group Presentations

 

 

 

Assessment

Course-work 100 %

This course will be assessed completely via 4 individual assignments and one group project. Each assignment and project will involve both theoretical and computer based problems. 

Individual Assignments (4) – 60%

Four homework assignments will be given, collected and graded throughout the semester.

While discussion of the homework is allowed, you must prepare your solutions separately. Direct copying of written work or computer code is considered cheating and will result in a zero on the assignment. Assignments are worth 60% of the course grade.

Group Project (1) – 40%

Each student will be required to do a group project during the second half of the semester. The minimum group size is 3, however larger groups are encouraged. The topics will vary and can be discussed with the instructor. The groups will be required to present their project in class on last week of classes. Full details will be given around class session four. The project is worth 40% of the course grade.

 

Resource requirements

This course is on computational methods and many of the assignments will require the use of a computer. An introduction to the statistical programming language R will be presented as part of the course and students will be highly encouraged to complete their assignments in R. Other programming languages will be allowed upon approval of the instructor. Students are expected to document and hand in all code used to complete their homework assignments. R can be downloaded for free from: http://www.r-project.org/

The statistical computing lab already has Stata and Matlab. Open source statistical software such as R, Winbugs and Openbugs will be used as far as possible.

 

PRESCRIBED TEXTS AND READING MATERIALS

Required reading

Computational Statistics, by G. H. Givens and J. A. Hoeting, (Wiley 2005).

Statistical Computing with R by M. Rizzo, Chapman and Hall

Spatial Statistics for Spatial Data, by N. Cressie

 

Recommended reading

Hastie, T., Tibshirani, R. and Friedman J. 2009. Elements of Statistical Learning Springer.