Course Description

An increasing amount of data is now generated in a variety of disciplines,
ranging from finance and economics, to the natural and social sciences.
Making use of this information requires both statistical tools and an
understanding of how the substantive scientific questions should drive
the analysis. In this hands-on course, we learn to explore and analyze
real-world datasets. We cover techniques for summarizing and describing data,
methods for statistical inference, and principles for effectively communicating results.

Prerequisites:
MS&E 120 or equivalent,
and CS 106A or equivalent

We encourage you to attend our crash course on R on Saturday, January 14 and Sunday, January 15. Please sign up
here.
You can view the R course materials here.

Instructors

Sharad Goel (

email)

Lauren Gomez (TA) (

email)

Jongbin Jung (TA) (

email)

Chiraag Sumanth (TA) (

email)

Ron Tidhar (Grader) (

email)

Schedule

Class: Tuesdays & Thursdays @ 1:30 PM - 2:50 PM in

200-002
Discussion Section: Thursdays @ 3:00 PM - 3:50 PM in

380-380c
*No discussion section during the first week of class.*
We use Piazza
to manage course questions and discussion. Please sign up
here.

**Office Hours**

Mondays @ 3 PM - 5 PM in Shriram 054 (Chiraag)

Mondays @ 7 PM - 9 PM in Huang 305 (Lauren)

Tuesdays @ 3 PM - 5 PM in Huang 356 (Sharad)

Tuesdays @ 5 PM - 7 PM in Shriram 366 (Lauren)

Wednesdays @ 10 AM - 12 PM in Huang 203 (Jongbing)

Wednesdays @ 5 PM - 7 PM in Y2E2 335 (Chiraag)

Thursdays @ 4:30 PM - 6:30 PM in Y2E2 105 (Lauren)

*There are no office hours during the first week of class.
Feel free to schedule an appointment if you would like to meet.*

[ Optional ] Textbooks

All of Statistics by Larry Wasserman
(available

online)

R for Data Science by Garrett Grolemund and Hadley Wickham

Statistics by David Freedman, Robert Pisani, and Roger Purves

Natural Experiments in the Social Sciences by Thad Dunning

Computing Environment

We primarily use

R
(

R Studio is the recommended interface),
including the plotting library

ggplot2,
and the data manipulation library

dplyr.

Evaluation

8 homework assignments (80%)

Final project (20%)

Syllabus

Week 1: Data Exploration & Visualization

Summary statistics, data manipulation, group-wise operations, joins, principles of plotting

Week 2: Statistical Inference I

Chapter 6 of

*All of Statistics*
Sampling distributions, statistical estimators, confidence intervals

Week 3: Statistical Inference II

Selected topics from Chapters 7, 8 & 9 of

*All of Statistics*
Maximum likelihood estimation, method of moments, the bootstrap

Week 4: Linear Regression I

Part III of

*Statistics*, and selected topics from Chapter 13 of

*All of Statistics*
Correlation, simple linear regression, confidence & prediction intervals

Week 5: Linear Regression II

Selected topics from Chapter 13 of

*All of Statistics*
Multiple regression, feature generation, model evaluation, normal equations

Week 6: Logistic Regression

Selected topics from Chapter 13 of

*All of Statistics*
Logistic regression, multinomial logistic regression, model evaluation

Week 7: Bias-Variance Tradeoff

Overfitting, under-fitting, cross-validation, regularization

Week 8: Natural Experiments & Causal Inference

Examples, regression discontinuity, Rubin causal model, instrumental variables

Week 9: Ethics & Privacy

Institutional review boards, (de-)anonymization, online tracking

Week 10: Project Presentations

Assignments

Unless otherwise stated, assignments are to be done individually.
You are welcome to work with others to master the principles and approaches used to
solve the homework problems, but the work you turn in should be your own.
Late homework will not be accepted, but your lowest homework grade will be dropped.

Assignment 0:

Due Date: Thursday, January 12, 11:59 pm PT

Complete chapters 1-6 of the online Try R tutorial
(you'll need to sign up for a free account).
After completing the tutorial, take a screen shot of the final page,
and submit it on Canvas.
In preparation for the R crash course this weekend,
install R Studio
(which in turn requires installing R).
Please also sign up for Piazza.

Assignment 1:

Due Date: Thursday, January 19, 11:59 pm PT

Exploring and visualizing data with
dplyr
and ggplot2 in
R.
Details here.

Assignment 2:

Due Date: Thursday, January 26, 11:59 pm PT

Statistical estimators and confidence intervals. Details here.

Assignment 3:

Due Date: Tuesday, February 7, 11:59 pm PT

The bootstrap, MLEs, and the method of moments. Details here.

Project proposal

Due Date: Thursday, February 9, 11:59 pm PT

In teams of 2-5 people, submit
a 2-3 page single-spaced proposal (as a PDF file) for your final
project. Clearly state your research question and potential
data sources, and outline a tentative methodology.
You are free to pursue any topic related to applied statistics.
At the end of the quarter, each team will prepare a written report
(approximately 10 single-spaced pages in length) detailing their work, and give a short in-class presentation.
To help determine the feasibility and suitability of your project,
please discuss your idea with the teaching staff before submitting your proposal.

Assignment 5:

Due Date: Thursday, February 16, 11:59 pm PT

Linear Regression. Details

here.

Assignment 6:

Due Date: Thursday, February 23, 11:59 pm PT

Logistic regression. Details

here.

Assignment 7:

Due Date: Thursday, March 2, 11:59 pm PT

Bias-variance trade-offs, cross-validation, and regularization. Details

here.

Final Project

Presentation slides (in PDF format) due on Monday, March 13, 11:59 pm PT;
please submit your slides on Canvas.
In-class presentations on Tuesday, March 14 and Thursday, March 16
(

sign up)
Paper (in PDF format) due on Wednesday, March 22, 11:59 pm PT;
please submit your paper on Canvas.

In-class presentations are limited to 4 minutes, with an additional 1 minute for questions.
Your final paper should clearly state and motivate your research question, summarize
the related literature, describe your methods, detail your results
(and include the appropriate plots), and discuss the implications
of your findings. The paper should be approximately 10 single-spaced pages long.

Lectures

Lecture 1: Data Exploration

Lecture 2: Visualization

Lecture 3: Intro to Statistical Inference

Lecture 4: Confidence Intervals

Lecture 5: The Bootstrap

Lecture 6: Parametric Inference

Lecture 7: Correlation & Regression

Lecture 8: Simple Linear Regression

Lecture 9: Uncertainty in Regression

Lecture 10: Model Evaluation & Feature Generation

Lecture 11: Logistic Regression

Lecture 12: Multinomial Logistic Regression

Lecture 13: Bias-Variance Tradeoff

Lecture 14: Regularization

Lecture 15: Natural Experiments

Lecture 16: Causal Inference

Lecture 17: Ethics

Lecture 18: Privacy