Basic and advanced data mining algorithms on relational and transactional data: mining through the search for associative rules, sequential patterns, clustering and
classification. Data storage and algorithms in external memory: sorting, B-trees, B+-trees, hash methods, multidimensional data representation, data stream.
- J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Stanford InfoLab, 2010
- P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Pearson, 2006
- J.S. Vitter, Algorithms and Data Structures for External Memory, nowPublishers Inc. 2008
Learning Objectives
The purpose of this course is to introduce students to the main data structures for external memory, the main data mining techniques for relational and transactional data and their experimentation. At the end of the course, students will have a good understanding of the structures for data storage in secondary memory and of the algorithms for the analysis and search of regularities in data. Students will be able to choose the most suitable information-gathering data structure and the most suitable data mining algorithm for real dataset analysis.
Prerequisites
Basic concepts of algorithms, data structures and databases.
Basic concepts of algorithm analysis. SQL language.
Teaching Methods
CFU: 12
Total number of hours of the course: 300
Number of hours for personal study and other individual learning activities: 204
Number of hours on the classroom activity: 96
Classroom activities include lectures and practical exercises with WeKA and other open-source software.
Further information
Office hours:
Prof. Donatella Merlini
Prior appointment via e-mail
Department of Statistics, Computer Science, Applications
Viale Morgagni, 65
50134 - Florence (FI)
Tel: 055 2751509
E-Mail: donatella.merlini@unifi.it
Prof. M. Cecilia Verri
Prior appointment via e-mail
Department of Statistics, Computer Science, Applications
Viale Morgagni, 65
50134 - Florence (FI)
Tel: 055 2751513
E-Mail: mariacecilia.verri@unifi.it
Type of Assessment
Written and oral exam on data mining and data organization techniques.
Project with WeKA.
To access the oral exam, candidates must have passed the written test. The final grade will be calculated as the average of the grade of the written exam and of the oral exam.
Course program
Data Mining - Basic and advanced data mining algorithms on relational and transactional data: mining through the search for associative rules, sequential patterns, clustering and classification. In particular, the following algorithms will be presented during the course: K-means clustering algorithm and some of its variants, agglomerative hierarchical clustering with single link, complete link, group average and Ward method; the DBscan density based clustering algorithm; Apriori (with hash tree structure for candidate item generation) and FP-Growth algorithms for association rules analysis; Apriori-like algorithms for sequential patterns analysis; classification algorithms based on decision trees, rules and nearest-neighbor; introduction to Naive Bayes and ANN classification. Application to text categorization. Some techniques for data preprocessing, postprocessing, exploration and visualization will be presented. All the algorithms and methodolgies presented during the course will be experimented with WeKA and other open-source software.
Data Organization - Introduction to Big Data. Organization of external memories: memory hierarchies, memory hierarchy management.
Algorithms and data structures for external memory: fundamental operations and complexity limits. External sorting: mergesort.
External memory search: tree-based techniques (B-trees, B + -trees), hash techniques (static, dynamic, extensible, virtual, linear). Organization of multidimensional data (Kd-trees, R-trees, R * -trees). Bitmap indexes. Management of data streams.