STOR767: Advanced Machine Learning

The goal of statistical machine learning and data mining is not to test a specific hypothesis or construct a confidence interval; instead, the goal is to find and understand an unknown systematic component within the realm of noisy, complex data.

Lectures: MW 12:20 – 1:35pm, Hanes 130. Syllabus

InstructorYufeng Liu

Office Hours: Mondays 1:35-2:30pm (Hanes 354); Fridays 11:30am-12:30pm (GSB 4250)

TAs: Weibin Mo (Ph.D Student in Statistics) Email: harrymok@email.unc.edu             Office: Hanes Hall B-40   Office Hours: Tuesdays and Thursdays 3:30-4:30pm

        Jianyu Liu (Ph.D Student in Statistics) Email: liuoo@live.unc.edu           Office: Hanes Hall B-1

        Zhengling Qi (Ph.D Student in Statistics) Email: qizl1027@live.unc.edu                   Office Hanes Hall B-26

TextbookThe Element of Statistical Learning: data mining, inference, and prediction, by Hastie, Tibshirani, and Friedman (2009). The electronic version can be downloaded for free.

Additional References:

  • The Nature of Statistical Learning Theory by Vapnik (1999).
  • Statistics for High Dimensional Data by Buhlmann and van de Geer (2011).
  • An Introduction to Support Vector Machine and Other Kernel-Based Learning Methods by Cristianni and Shawe-Taylor (2000).
  • Learning with Kernels by Scholkopf and Smola (2000).
  • Convex Optimization by Boyd and Lieven Vandenberghe
  • An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani (2013).
  • Linear Models with R by Faraway 2nd edition (2014).

Statistical Software:

 R

We will use R for this course. R is free so you can easily use it anytime and anywhere. R can be downloaded from the R websiteRstudio is a recommended interface for the R software. It is also free, and it runs on Windows, Mac, and Linux operating systems. We will also use R Markdown, which can produce high quality documents and reports.

Reference: W. N. Venables, D. M. Smith, and the R Core Team. 2017. An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics  (version 3.4.3).

Evaluation & Grading: There will be homework assignments through the semester on both the theoretical and computational aspects of the course.
The course grade will be given based on class participation, homework grades, project (presentation & final report), and exam.
The distribution of the grade is as follows:

• Homework 25%;
• Exam 35%;
• Presentation 20%;
• Final report 20%
 ————————

Homework Policy: Homework assignments will be posted on the course web page. Each homework assignment will be graded: late/missed homework assignments without permission will receive a grade of zero. Assignments will be collected at the beginning of class on the day they are due, so please be prepared to turn in your homework at that time.

Honor Code: Students are expected to adhere to the UNC honor code at all times. Violations of the honor code will be prosecuted.

Announcements, Assignments & Lectures:

Lectures Date Tentative Plan Remark
1 Jan 9 W Introduction & Overview of Supervised Learning                             Reading: Ch1&2 Notes 1

Homework 1

2 Jan 14 M Overview of Supervised Learning           Reading: Ch1&2
3 Jan 16 W Linear Regression and Extensions  Learning Reading: Ch3  Notes 2       Homework 2
Jan 21 M  No class
 

4

Jan 23 W Linear Regression and Extensions Reading: Ch3 Homework 1 due
5 Jan 28 M Linear Regression and Extensions     Reading: Ch3
6 Jan 30 W Linear Classification Methods                    Reading: Ch4 Notes 3
7 Feb 4 M Linear Classification Methods                    Reading: Ch4 Homework 3
8 Feb 6 W Splines                                  Reading: Ch5&6 Notes 4

Homework 2 due

9 Feb 11 M Splines                                  Reading: Ch5&6
10 Feb 13 W Smoothing&Kernel Methods                       Reading: Ch5&6 Notes 5
11 Feb 18 M Smoothing&Kernel Methods& Wavelet         Reading: Ch5&6 Homework 3 due   Homework 4
12 Feb 20 W Density estimation & Additive Models  Reading: Ch5&6
13 Feb 25 M Cross Validation & Beyond      Reading: Ch7&8 Notes 6
14 Feb 27 W Support Vector Machines    Reading: Ch4.5, 5.8, 12 Notes 7

Homework 4 due

Homework 5

15 Mar 4 M Support Vector Machines    Reading: Ch4.5, 5.8, 12
16 Mar 6 W  Tree-based Methods and Beyond   Reading: Ch8, 9, 10 Notes 8

Project Proposal Due (both hardcopy and email)

Mar 11 -15  UNC spring break
17 Mar 18 M Tree-based Methods and Beyond   Reading: Ch8, 9, 10 Homework 5 due    Homework 6
18 Mar 20 W Unsupervised Learning: clustering         Reading: Ch14 Notes 9
19 Mar 25 M  Unsupervised Learning: dimension reduction                              Reading: Ch14 &17
20 Mar 27 W Unsupervised Learning: Graphical models  Notes 10       Homework 6 due
21 Apr 1 M In-class Exam
22 Apr 3 W Guest Lecture on Neural Networks and Deep Learning

Dr. Tao Wang, Senior Manager, AI and Machine Learning, R&D, SAS Institute

23 Apr 8 M In-class presentations Xiaoyang Chen; Miheer Dewaskar; Zhenghan Fang; Gang Li
24 Apr 10 W In-class presentations Tianshe He; Kentaro Hoffman; Dayton Steele; Benjamin Leinwand; Bohan Li
25 Apr 15 M In-class presentations Daiqi Gao;  Zichao Li; Deyi Liu; Wei Liu; Yiyun Luo
26 Apr 17 W In-class presentations Carson Mosso; Robert Niewoehner; Kevin O’connor; Yifeng Shi;  Nhan Pham
27 Apr 22 M In-class presentations Jack Prothero; Yukai Huang; Aleksandr Touzov; Haodong Wang
28 Apr 24 W In-class presentations

Final Report Due

(both hardcopy and email)

Mingyi Wang; Xi Yang; Ai Ye; Jonghwan Yoo; Hang Yu

Supplemental Reading:

  • Overview   Data mining and statistics: what is the connection? Friedman (1997)
  • Linear Regression & Extensions  Ridge regression: biased estimation for nonorthogonal problems, by Hoerl and Kennard (1970)

     Ridge regression: applications to nonorthogonal problems, by Hoerl and Kennard (1970)

     Regression shrinkage and selection via the lasso, Tibshirani (1996)

     Better subset regression using the nonnegative garrote, by Breiman (1995)

     Continuum Regression: Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares,    Partial Least Squares and Principal Components Regression, by Stone and Brooks (1990)

    Bertsimas, D., , King, A.,  and  Mazumder, R. (2016). Best subset selection via a modern optimization lens, Annals of Statistics,  44, 2, 813-852.

  • Classification  Flexible Linear Discriminant Analysis by Optimal Scoring, Hastie, Tibshirani, and Buja (1994)

  • Splines and Smoothing  Spline Models for Observational Data, by Wahba (1990)

    Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, by     Green, P. and Silverman, B. (1994), Chapman and Hall, London.

    Donoho, D. and Johnstone, I. (1994). Ideal spatial adaptation by wavelet shrinkage, Biometrika81: 425–455.

    Scott, D. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, New York.

  • Cross Validation  Linear model selection by cross validation, Shao (1993)

    Estimation of Prediction Error, Efron (2004)

    Improvements on Cross-Validation: The .632+ Bootstrap Method, Efron and Tibshirani (1997)

  • Support Vector Machines  Support Vector Machines and the Bayes Rule in Classification, Lin (2002)

    Support Vector Machines for Classification in Nonstandard Situations, Lin et al. (2002)

     Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV  Wahba (1998)

  • Tree-based Methods Classification and Regression Trees, by Breiman et al. (1984)

    Multivariate Adaptive Regression Splines (MARS) by Friedman (1990)
    Experiments with a new boosting algorithm Freund and Schapire (1996)

     Additive logistic regression: A statistical view of boosting , by Friedman, Hastie and Tibshirani(2001)

  • Unsupervised Learning Methods Clustering Algorithms, by Hartigan. (1975)

    A Brief Introduction to Independent Component Analysis by JV Stone, 2005.

    Introductory chapter of the book A. Hyvärinen, J. Karhunen, E. Oja (2001). Independent Component Analysis

    Kruskal and Wish (1978), Multidimensional Scaling, Sage

    http://www.personality-project.org/r/mds.html