Course teached as: B025406 - MULTIVARIATE ANALYSIS AND STATISTICAL LEARNING Second Cycle Degree in STATISTICS AND DATA SCIENCE Curriculum GENERALE
Teaching Language
English
Course Content
Multivariate Gaussian distribution. Graphical models with hints in high dimentional settings (lasso-type estimators). Principal components and Factor analysis. Linear and Quadratic discriminant analysis. Supervised learning via CART, boosting, random forest, super learner, BART.
Unsupervised learning algorithms for clustering: hierarchical clustering, k-means and model-based clustering.
Class notes and slides available on Moodle platform.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning (Second edition). New York: Springer.
Friedman, J., Hastie, T., & Tibshirani, R. (2013). The elements of statistical learning. Second edition. Springer, Berlin: Springer series in statistics.
Giudici, P. (2005). Applied data mining: statistical methods for business and industry. John Wiley & Sons.
Learning Objectives
The course introduces the student to the application and theory of statistical models for the study of high-dimensional and multivariate data and statistical learning methods. In particular, insights into classical multivariate analysis and some methodologies of data mining and supervised and unsupervised statistical learning will be covered. To facilitate the understanding, interpretation and use of the methodologies, the course includes exercises with the R language.
For student in SDS: Preliminary requirements: Statistical inference; Probability and Mathematics for Statistics
Teaching Methods
Lectures, labs, flipped classes and contests.
Further information
Students attending the 6 CFU course, have to agree upon the topics in their reduced syllabus, covering 2/3 of the 9CFU syllabus.
Type of Assessment
The exam consists of two parts:
(1) Homework exercises to submit via Moodle for students attending classes. For students who do not deliver 75% of the homework exercises, a short oral exam will be included to point (2) (25% of the final score).
(2) Seminarial presentation of two projects aimed at demonstrating personal mastery of the course topics.
For attending students, the first project can be prepared in a group and presented in a contest between groups (30% of the final grade).
Before the presentation, slides and codes must be uploaded to the Moodle platform.
The following skills will be assessed: comprehension of the research topic, application of theoretical and computational tools, rigor in using the methodologies selected and the capacity to defend the conclusions obtained.
Course program
1. Multivariate Gaussian distribution: Bivariate and multivariate distribution; marginal and conditional distributions; Correlation and marginal/conditional independence; Inference on the parameters of a Multivariate Gaussian distribution
2. Introduction to graphical models Graphs and conditional independence properties Undirected graphs (networks / Markov random fields) Markov properties and factorization Gaussian graphical models Log-linear graphical models Directed Graphs (Bayesian networks / DAGs) Markov properties and factorization Learning Basics of Chain Graphs Markov properties and factorization
3. Principal components analysis Notation Definition and properties of PCA Interpretation of PCA
4. Introduction to statistical learning Statistical learning versus Machine learning Supervised and Unsupervised Learning Regression vs Classification Accuracy measures Bias-Variability Trade-off Resampling and cross-validation
5. Linear Model Selection and Regularization Subset Selection Shrinkage Methods Ridge Regression Lasso and Elastic net
6. Tree-Based Methods Basics of DecisionTrees RegressionTrees ClassificationTrees Bagging and Boosting Random Forests BART
7. Super learner for regression and classification
8. Factor analysis Introduction to exploratory factor Rotation of axes Interpretation of the factorial axes Outline of confirmatory factor analysis
9. Cluster Analysis Introduction to the problem of classification Distances and metrics Hierarchical and nonhierarchical methods Probabilistic and fuzzy methods.