Class notes and slides available on Moodle platform.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning (Second Edition). New York: Springer.
Available online: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
Friedman, J., Hastie, T., & Tibshirani, R. (2013). The elements of statistical learning. Second edition. Springer, Berlin: Springer series in statistics.
Available online: https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf
Learning Objectives
The course introduces the student to statistical methods and to statistical learning for studying multivariate and high-dimensional data. In particular, it will provide the fundamental tools necessary to understand and apply the recent scientific literature on the statistical aspects of machine learning and on multivariate models.
Labs with the R language will integrate the course to facilitate the understanding, interpretation, and use of the proposed methodologies. At the end of the course, the students will understand multivariate statistics and the statistical aspects of machine learning and be able to choose and apply an appropriate methods and algorithms in specific contexts. They will be able to critically examine the output of an algorithm/model and visualize and present it. They will have the instruments to understand and compare new techniques with existing ones.
For student in SDS: Preliminary requirements: Statistical inference; Probability and Mathematics for Statistics
Teaching Methods
Lectures, labs, flipped classes, group contests.
Further information
Attendance is strongly recommended
Type of Assessment
The exam consists of two parts:
(1) Homework, to be uploaded on Moodle for students attending classes and briefly presented in class. For students who do not complete 75% of the homework, a brief oral test (25% of the final score) will be added to point (2).
(2) Seminarial presentation of two projects aimed at demonstrating personal mastery of the course topics.
For attending students, the first project can be prepared in a group and presented in a contest between groups (30% of the final grade). The topic of the projects will be chosen by the students within the topics covered in the course and extensions thereof.
Before the presentation, slides and codes must be uploaded to the Moodle platform.
The following skills will be evaluated: comprehension of the research topic, application of theoretical and computational tools, rigor in using the methodologies selected and the capacity to defend the conclusions obtained.
Course program
(1) Introduction to statistical learning. Definition of statistical learning and difference among machine learning and statistical models. Supervised and Unsupervised Learning. Regression and Classification. Accuracy measures. Trade-off variance and bias.
(2) Data Generating Process, Monte Carlo simulations, Resampling and cross-validation methods.
(3) Introduction to nonparametric regression, piecewise constant and polynomial regression, splines, kernel regression.
(4) Linear Model Selection and Subset Selection. Regularized estimators: Ridge, Lasso, Elastic net, Adalasso
(5) Tree-based algorithms: CART, conditional trees, oblique trees.
Tree-based ensembles: bagging, boosting, adaboost, gradient boosting (also non-tree based version), Random Forest, BART.
(6) Dimension reduction methods: PCA and SVD and their relationship
(7) Clustering: hierarchical and non-hierarchical algorithms and their characterization, probabilistic algorithms (Gaussians mixtures)
(8) Ensemble of strong classifiers: Super Learner
(9) SVM and SVM with kernel
(10) Introduction to graphical models Graphs and conditional independence properties Undirected graphs (networks / Markov random fields) Markov properties and factorization Gaussian graphical models Log-linear graphical models Directed Graphs (Bayesian networks / DAGs) Markov properties and factorization Learning Basics of Chain Graphs Markov properties and factorization