Course Description
With a vast amount of information now collected on our online and offline actions — from what we buy,
to where we travel, to who we interact with — we have an unprecedented opportunity to
study complex social systems. This opportunity, however, comes with scientific, engineering,
and ethical challenges. In this hands-on course, we develop ideas from computer science and
statistics to address problems in sociology, economics, political science, and beyond. We
cover techniques for collecting and parsing data, methods for large-scale machine learning,
and principles for effectively communicating results. To see how these techniques are applied
in practice, we discuss recent research findings in a variety of areas. This course was previously
listed as MS&E 331.
Prerequisites: An introductory course in applied statistics,
and experience coding in R or Python.
There is a $25 course materials fee for running experiments on Mechanical Turk.
Instructors
Sharad Goel (
email)
Imanol Arrieta Ibarra (TA) (
email)
Schedule
Class: Tuesdays & Thursdays @ 1:30 - 2:50 in
Thornton 110
Lab Section: Thursdays @ 12:30 - 1:20 in
Thornton 110
Sharad's Office Hours
Tuesdays @ 3:00 - 5:00 in
HEC 356
Imanol's Office Hours
Wednesdays @ 10:00 - 12:00 in
HEC 314
During the first week of school, office hours are by appointment only.
We use Piazza
to manage course questions and discussion. Please sign up
here.
Computing Environment
A Unix-like setup is required (e.g., Linux, OS X, or Cygwin).
We primarily use R
(R Studio is recommended)
and Python 2.7 (Anaconda Python is recommended),
including ggplot2 for visualization and
dplyr for data manipulation.
We also use Vowpal Wabbit
(a fast online learning algorithm), MALLET
(a toolkit for natural language processing), and
Amazon Elastic MapReduce
(a web service for distributed computing).
Evaluation
4 assignments (60%)
Project proposal (10%)
Final project (30%)
Syllabus
Week 1: Introduction & Visualization
Examples in computational social science, principles of visualization
Lab section: Intro to R and ggplot2
Week 2: Data Exploration & Collection
Data manipulation, validation, and streaming computation
Lab section: Data manipulation with
dplyr,
streaming in Python, and bash
Week 3: Map Reduce
The map reduce model of computation, higher-level MR operations
Supplementary Material:
Week 4: Network Analysis
Network representations, definitions, applications, empirical properties, computation
Supplementary Material:
Lab section: Web scraping
Week 5: Supervised Learning
Regression & classification, (stochastic) gradient descent, regularization
Week 6: Text as Data
Supplementary Material:
Week 7: Crowdsourcing & Online Experiments
Experimental design, Mechanical Turk
Supplementary Material:
Week 8: Natural Experiments & Causal Inference
Simpson's paradox, Rubin causal model, causal inference with observational data
Lab section: No lab section this week
Week 9: Privacy & Ethics
Institutional review boards, anonymization, online tracking
Lab section: No lab section this week
Week 10 (End-Quarter Period): Project Presentations
Each team presents for 10 minutes (8 minute talk + 2 minutes for Q&A)
Assignments
Homework assignments are to be done in groups of 2-3, and the final project in groups of 2-5.
You are strongly encouraged to form interdisciplinary teams for both the homework and project.
All group members should be involved in completing each part of the homework assignments
(i.e., think
pair programming
as opposed to divide-and-conquer).
Late assignments are subject to a 10 percentage point deduction for each day late,
up to five days, after which the assignment may no longer be accepted.
Assignment 0:
This mini-assignment should be done individually, not in groups.
Due Date: Thursday, September 24, 11:59 pm PT
Complete chapters 1-6 of the online
Try R tutorial
(you'll need to
sign up for a free account).
After completing the tutorial, please take a screen shot of the final page, and submit it
here.
Also, in preparation for Thursday's discussion section,
please install
R Studio
(which in turn requires installing
R),
and sign up for
Piazza.
Assignment 1:
Due Date: Thursday, October 8, 11:59 pm PT
Twitter API, streaming computation, dplyr, ggplot, and more. Details here.
Project Proposal
Due Date: Thursday, October 15, 11:59 pm PT
In teams of 2-5 people, write a 2-3 page proposal for your final project.
Clearly state and motivate your research question, list potential data sources,
and outline a tentative methodology.
Before submitting your proposal, discuss the
feasibility and appropriateness of the project with either Sharad
or Imanol. Office hours are a good time to do this, but we can also schedule
alternative times to meet.
The final project consists of a report (approximately 10 single-spaced pages)
and a short in-class presentation.
Please submit your proposal here.
Assignment 2:
Due Date: Tuesday, October 27, 11:59 pm PT
Why is it hard to catch a cab in the rain? Details here.
Assignment 3:
Due Date: Thursday, November 5, 11:59 pm PST
Fun with fast, online learning. Details here.
Assignment 4:
Due Date: Tuesday, November 19, 11:59 pm PST
Mechanical Turk, online surveys, and natural experiments. Details here.
Lectures
Lecture 1: Intro to Computational Social Science
Lecture 2: Visualization
Lab Sessions 1 & 2: Intro to R
An introduction to data manipulation and visualization in R with
dplyr and
ggplot2. To follow along with the tutorial,
we suggest installing
R studio.
The example scripts are available on
GitHub.
Slides
by Jongbin Jung, Houshmand Shirani-Mehr, and Jessica Su
Lecture 3: Data Collection & Manipulation
Lecture 4: Counting at Scale
Lab Session 2: Intro to Bash and Python Streaming
Lecture 5: MapReduce
Lecture 6: Promise & Perils of "Big Data"
Lab Session 3: Amazon Elastic MapReduce
Lecture 7: Intro to Networks
Lecture 8: Network Analysis
Lab Session 4: Web Scraping
An introduction to web crawling in Python, starting from basic use cases
to scraping AJAX-generated pages.
To work through the examples, we suggest installing
Anaconda Python,
Beautiful Soup,
and Selenium.
The examples scripts are available on
GitHub.
Slides
by Jongbin Jung and Houshmand Shirani-Mehr.
Lecture 9: Supervised Learning I
Lecture 10: Supervised Learning II
Lab Session 5: Intro to Vowpal Wabbit
Lecture 11: Text as Data
Lab Session 6: Intro to MALLET and NLTK
Lecture 13: Crowdwork
Lab Session 7: Amazon Mechanical Turk
Assignment 1
In this assignment, you'll play with the public Twitter API to investigate:
(1) how the number of tweets varies over the course of the day;
and (2) the time trend for a specific topic of your choice.
The goal is to learn about APIs, streaming computation, data manipulation, and plotting.
This is a fairly involved assignment, so please start early.
Step 1.
The Twitter Streaming API
provides access to a small random sample of public tweets as they are generated,
and also let's one query for tweets that match a user-provided list of keywords.
We will use both of these features. To get started,
create a Twitter application
that you'll use to access the API. (The application will only be used by you,
so what you enter for the name, description and website are not terribly important.)
After creating your application, select the "Keys and Access Tokens" tab, and then
create an access token. You'll need to save four pieces of information from this page:
the Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret.
If you need to access or update your application information at a later data, you can
return to it via the management console.
Step 2.
Install Tweepy, a Python wrapper for
the Twitter API. The easiest way to install it is with pip:
pip install tweepy
If you don't have root permissions (e.g., on corn.stanford.edu), you can install to your home directory with the
"--user" option:
pip install --user tweepy
Step 3.
To help you get started with Tweepy and the Twitter Streaming API, we wrote a short Python script,
tweet_stream.py. The script requires that you create a file
with your API credentials in this format,
where you replace the parameters with your own credentials. Assuming you save the file as creds.txt in
your working directory, you can run the script with the command:
python tweet_stream.py --keyfile creds.txt
(Enter Control-C to terminate the script.)
The output of the above call is a near real-time
JSON stream of randomly selected tweets.
To save the results, you can redirect the output to a file like this:
python tweet_stream.py --keyfile creds.txt > tweets
However, as the API returns a fair amount of data, it's more efficient to store a compressed version of the data.
The script lets you store a gzipped version of the results as
follows:
python tweet_stream.py --keyfile creds.txt --gzip tweets.gz
You can view the compressed data with zless,
zcat, or similar utilities (e.g., gzcat on OS X).
In addition to simply returning a random sample of tweets, the Twitter API let's you fetch
all tweets that contain a specific word.
For example, via tweet_stream.py, you can obtain the stream of all tweets that
contain "stanford" (where the matching is case-insensitive) with the command:
python tweet_stream.py --keyfile creds.txt --filter stanford
(The API limits the number of results it will return, so if your filter term is too common,
you'll only get a subset of the tweets.)
After playing with the API, start collecting a random sample of tweets, and also collect a
filtered set of tweets matching a term of your choosing. Select a term that
you believe will show an interesting time trend. Collect at least 24 hours worth of tweets.
To prevent your command from terminating when you disconnect from the session, you can use a
utility such as screen,
tmux, or nohup.
For example, with nohup, you can run the command in the background with:
nohup python tweet_stream.py --keyfile creds.txt --gzip tweets.gz &
Note that you are only allowed to connect to
at most one
streaming API endpoint at a time,
and attempting to connect to more than one simultaneously may result in some of the streams
being terminated. However, your teammates can each use their own credentials to poll the API in
parallel.
To terminate a script that's running in the background, first find its
process ID with:
ps -u<user-id>
where you replace <user-id> with your actual user ID. Then kill it with:
kill <pid>
where <pid> is the process ID you found with ps.
When using the
corn.stanford.edu machines,
there are
a few things to be aware of: (1) For jobs that run for more than about 24 hours, you need to first run the
command "keeptoken",
and then follow the directions. (2) ssh'ing into
corn.stanford.edu will redirect you to a specific corn machine (e.g., corn20.stanford.edu) depending on load.
When you start your process, it will be running on that specific machine, so to return to it later you'll
need to keep track of the machine and then directly ssh into it.
(3) Your home directory is not
big enough to store the tweet file, so write it to the /tmp directory with a command like this:
nohup python tweet_stream.py --keyfile creds.txt --gzip /tmp/tweets.gz &
Replace "tweets.gz" with a unique name so that no one accidentally writes over
your file. Also note that the /tmp directories are local to each specific corn machine
(i.e., they are not network directories), so you need to log back into the specific machine
to view it later. (4) If a file in the /tmp directory is not accessed for 24 hours, it
may automatically be deleted. So when your script has terminated, copy the results to a permanent
location (e.g., you own computer).
Step 4.
Each line of the output from the above commands is in JSON, and before you can analyze the results,
you need to convert the data to a more suitable format. Write a Python script (named parse_tweets.py)
that takes a stream of tweets as input
(via stdin),
and writes tab-separated output (to stdout) where each row corresponds to a tweet
and the three columns of the output are date, time rounded to the nearest 15-minute interval (e.g., 18:30),
and timezone. When generating the output, restrict
to tweets where the user has specified one of the four Twitter-normalized U.S. timezones
(e.g., "Eastern Time (US & Canada)"). Output the date and time in Pacific time.
The json module is useful for parsing the input,
as is pytz for dealing with time conversions.
Ensure that your script is not memory intensive (i.e., convert the data in a streaming fashion).
Your script should run with the following command, writing the results to stdout:
zcat tweets.gz | python parse_tweets.py
Using your parse_tweets.py script, parse the random sample of tweets and also the filtered set of tweets,
and save them to two separate files.
Step 5.
Use ggplot2 in R to plot the volume of tweets over time,
with separate lines indicating the volume in each timezone. To generate the plot, first use
dplyr
to compute the volume of tweets
in each 15-minute interval in your data (separately for each timezone).
Make two plots: One for the random sample of tweets, and one for
the filtered set of tweets. Save your plot generation script as tweet_analysis.R.
Your script should be completely self-contained (e.g., load all necessary libraries),
and not contain any extraneous calculations.
In particular, if you run the script in a directory that contains the data files,
it should read in the data and output the plots in that same directory without any additional setup or user intervention.
Here are a few suggestions for making plots in R.
First, use a white background (the default is gray), which you can produce
by setting the appropriate ggplot2 theme at the top of your script:
theme_set(theme_bw())
Second, use the scales package to
format plot labels (e.g., to format numbers).
Third, for reports, save plots in PDF
(a vector format) rather
than PNG or JPEG
(raster formats).
Finally, explicitly set the height and width of plots to get the appropriate aspect ratio (this also has the
added benefit of helping to ensure the axis labels are appropriately sized). For example,
for a square plot you might use something like this:
ggsave(plot=my_plot, file='my_plot.pdf', width=5, height=5)
Step 6.
Prepare a short report detailing your results. The report should include the
two plots you generated along with a brief description of what you found. You should also discuss
any limitations of the methodology. Submit a single compressed tar file (tar.gz) that contains:
(1) parse_tweets.py; (2) tweet_analysis.R;
(3) the two files generated in Step 4 (i.e., the cleaned tweet data);
and (4) your report, as a single PDF file.
Assuming your files are in a folder called hw1, you can generate the tarball with the command:
tar -cvzf hw1.tar.gz hw1/*
Please submit your work here. Only one
member from the team should submit the final files (but ensure that all team members are
listed on your report).
Assignment 2
Why is it so hard to catch a cab in the rain?
Demand for taxis almost certainly increases, which at least explains part of the story.
Behavioral economics offers another intriguing possibility.
Cab drivers might be target earners, stopping for the day only when
they have reached their personal income target (e.g., $150).
When it rains, the theory posits, cab drivers hit their targets sooner (because more people are riding cabs)
and accordingly stop
earlier in the day, decreasing the supply of taxis.
Indeed,
Camerer, Babcock, Loewenstein, and Thaler (1997)
find evidence for this explanation.
Though plausible, this target earnings theory is at odds with neoclassical economics,
which predicts drivers would work more on days during which they can make more money per hour
(since, over their lifetimes, this strategy would allow them to accrue more money for a fixed amount of
time spent working). Consequently, if taxi drivers earn higher hourly wages when it rains,
this reasoning suggests supply of taxis would increase, tempering, though perhaps not fully
offsetting the increase in demand.
In recent work, Princeton economist Henry Farber looks at
a random subset of New York City taxi rides in 2009-2013, and finds evidence for the
neoclassical prediction.
In this assignment, we'll use Amazon's Elastic MapReduce service to replicate and extend Farber's
analysis to the full set of 700 million NYC taxi rides in 2010-2013
(the 2009 data are no longer publicly available).
If you haven't already, please apply for AWS Credits
so that you can complete the assignment without paying for the Amazon services out of pocket.
Also, create an S3 bucket in the
"US Standard" region to save your results and intermediate output files.
Be sure that your AWS region is set to "US East (N. Virginia)" when you start an Elastic MapReduce cluster.
This is important for both computation time and expense since the data reside in Virginia.
Step 1.
Read Camerer's (short and engaging) explanation
of the target earners theory;
also read Sections 1, 3, and 4 of Farber's paper.
Step 2.
We'll be using the complete dataset of
all 700M trips taken by NYC yellow cabs in 2010-2013, over 100GB of data in total.
As documented here,
the dataset is segmented into two parts:
one that contains trip origination and destination details, and a second that contains fare information.
For your convenience, we've uploaded all the data to an Amazon S3 bucket: s3://stanford-mse-231/yellowcab/.
We've also uploaded a small sample of the data to s3://stanford-mse-231/yellowcab-test/.
For testing your code, it's often useful to work with a small slice of the data.
Start by downloading the January trip
and fare data.
If you're working on the corn.stanford.edu machines, you can use wget to download the files.
For example:
wget https://5harad.com/data/nyctaxi/2013_trip_data_1.csv.gz
Even though this is just a sample of the full year's worth of data, it is still pretty big
and cumbersome to work with.
Write a Python script that pulls out all the trip information for a single day in January;
make sure you pull out the same trips in both data files.
Include the header in your output, so that your test files have the same structure as the original ones.
Finally, ensure that your script is not memory intensive
(e.g., do not load the entire dataset into memory before filtering it).
Step 3.
Next we need to join the trip and fare datasets into a single, clean dataset such that each row
has all the information about the trip. Given the size of the datasets, we'll use
Amazon EMR.
Write two Python scripts, join_map.py and join_reduce.py to carryout the join
in MapReduce.
As noted here, the data contain a variety
of errors.
In the reduce step, you can check for obviously corrupt records and only output those that
appear to be reasonable. As with all data analysis, there isn't a single best way to do this, so
just use your judgement.
Before attempting to join the full datasets, test your scripts on the single day of
trips that you generated in Step 2. As discussed in class, you can simulate MapReduce locally with:
cat trip.csv fare.csv | ./join_map.py | sort | ./join_reduce.py > join.tsv
where we assume trip.csv and fare.csv are the subsets of data for a single day.
Once you're confident your scripts are working locally, use EMR to join the test sample at
s3://stanford-mse-231/yellowcab-test/. Finally, use EMR to join the full set of 700M trips.
Be sure that your AWS region is set to "US East (N. Virginia)" for all EMR computation.
Also be sure to include the Python 2.7 shebang line
at the top of your Python scripts:
#!/usr/bin/env python2.7
A common mistake is to assume the input is streamed into the mappers in a specific order
(e.g., that the header appears at the top of the file).
While this is true in the local version above, it is not true for the real, distributed version of MapReduce.
(Indeed, in a distributed system, it is not even clear what it would mean to say the lines are ordered.)
Similarly for the reduce phase, all you can assume is that lines with the same key are adjacent in the stream,
but otherwise there is no structure to the order of input. One helpful test is to permute the order of the input
and then check if your scripts still work. You can permute the rows like this:
cat trip.csv fare.csv | shuf > permuted_data.csv
Now you can run your scripts on the permuted data:
cat permuted_data.csv | ./join_map.py | sort | ./join_reduce.py > join.tsv
Step 4.
For each driver (identified by his or her anonymized hack license) and for each hour of each day in the dataset,
we're going to compute a variety of statistics regarding supply, demand, and earnings, as described below.
- t_onduty: the total amount of time (in units of hours) that the driver is on-duty during the hour.
There is not a perfect way to infer this from the data, but we will
assume that if a cab is unoccupied for at least 30 minutes,
then the driver is not on duty (e.g., the driver is taking a break or is between shifts) during that
unoccupied stretch.
- t_occupied: the total amount of time with passengers in the cab during the hour.
- n_pass: the total number of passengers picked up during the hour.
- n_trip: the total number of trips started during the hour.
- n_mile: the total number of miles traveled with passengers in the hour.
For trips that cross an hour boundary, assume the driver traveled at a constant speed for the
duration of the trip.
- earnings: the total amount of money the driver earned in that hour.
As with millage, for trips that cross an hour boundary, assume drivers earn the final payment
at a constant rate throughout the trip.
Earnings consist of the fare plus the tip. Unfortunately, cash tips are not recorded in the data,
so this will underestimate total earnings.
The schema of the resulting output is thus:
date, hour, hack, t_onduty, t_occupied, n_pass, n_trip, n_mile, earnings
(Remember that hack IDs are re-anonymized each year.)
Write two scripts, driver_stats_map.py and driver_stats_reduce.py, to
compute the desired output via MapReduce.
Before running on the full dataset in EMR, be sure to test your scripts locally on the small joined dataset
you created in Step 3.
Step 5.
Write a MapReduce job to aggregate the quantities computed in Step 4 by date and hour.
The schema of the output is thus:
date, hour, drivers_onduty, drivers_occupied, t_onduty, t_occupied, n_pass, n_trip, n_mile, earnings
where drivers_onduty is the number of drivers who were on duty for at least one minute during the hour,
and drivers_occupied is the number of drivers who were occupied for at least one minute during the hour.
The remaining quantities are the sums of the analogous ones in Step 4.
Step 6.
At this point the aggregated data from Step 5 is small enough that we can efficiently work
with it in R. The next step is to join the taxi data with hourly precipitation for the
Central Park weather station, available from
NOAA.
You can download the precipitation data here, and
the documentation here.
After you've joined the data (in R), the schema of the output should be:
date, hour, precip, drivers_onduty, drivers_occupied, t_onduty, t_occupied, n_pass, n_trip, n_mile, earnings
where precip is the amount of rain (in inches) that it rained during each hour.
Step 7.
Now that all the necessary data are compiled into a single, manageable dataset, you can investigate the
effect of precipitation on hourly wages for taxi drivers, and the supply of and demand for taxis.
At this point in the analysis,
dplyr
is quite useful.
As a first step, plot the average hourly wage of taxi drivers throughout the day,
both when it rains and when it doesn't. Specifically, create a single plot where the
x-axis corresponds to the 24 hours of the day, the y-axis is average hourly wage, and there
are separate lines for periods during which it rained and which it did not.
("Rain" is being used as shorthand for "precipitation", which may include both rain and snowfall.)
Carryout similar analyses for supply and demand.
In particular, explore various ways to quantify supply and demand
(there are multiple reasonable approaches), with the goal of understanding why
it's harder to catch a cab in the rain.
Step 8.
Prepare a short report detailing your methodology, results, and conclusions.
How do your results compare to those of Farber and of Camerer et al.?
Your report should tell a coherent story, and not simply be a collection of plots you
generated in the course of exploring the dataset.
Submit a single compressed tar file (tar.gz) that contains:
(1) your report as a PDF file;
(2) all your R and Python scripts;
(3) a README file that documents what each script does;
(4) a TSV file containing the data you generated in Step 6
(write.table is useful for
converting a data.frame in R to a TSV).
As always, your scripts should be clearly written, commented, and self-contained so that
we can easily run them to reproduce your analysis.
Please submit your work here. Only one
member from the team should submit the final files (but ensure that all team members are
listed on your report).
Assignment 3
This assignment is divided into two independent parts, as described below.
Part I.
Michel et al. analyzed a corpus of
five million books to quantitatively study cultural trends,
and White et al. mined
web search queries to detect drug interactions.
If you had access to the full digitized text of every book ever written
and/or the full log of search queries,
what scientific questions would you ask?
Write a short, 2-3 page paper addressing this topic.
Be sure to discuss the benefits and downsides of such data sources over traditional ones for
answering the scientific question(s) you propose.
Part II.
In this part of the assignment, you'll use stochastic gradient descent
to classify news stories based on the article text.
Step 1.
Download and install vowpal wabbit (VW), a fast online
learning algorithm. On the corn machines, you can simply clone the git repository:
git clone git://github.com/JohnLangford/vowpal_wabbit.git
Then, to build vw, enter the vowpal_wabbit directory and type:
make
This will compile a vw executable at vowpal_wabbit/vowpalwabbit/vw. In the examples below, the program
will be invoked with ./vw (without the full path), but be sure to include the full path
when you run the commands, add vw to your path, or run it from the appropriate directory.
You can learn more about vw here.
Step 2.
Download
the complete set of New York Times article metadata from January 1, 2014 to October 1, 2014.
The dataset was compiled via the
New York Times Article Search API,
using this script. (If you're interested in
running the script you'll need to
request
an API key, though that is not necessary for this assignment.)
Each line in the article dataset corresponds to a "page" of results (i.e., a set of up to 10 articles).
The first two columns specify the date of the articles and the page set, and
the third column contains JSON-formatted metadata for the articles.
Step 3.
Set up a binary classification problem. First select a news category (e.g., sports, finance, national, etc.)
that you wish to infer for each article.
(The classification for each article is listed in the "news_desk" field.)
Then determine what features you'll use for the inference. The full article text is not available, but
you can use various metadata (e.g., the article abstract or first paragraph). Think about how best to
turn that information into features (e.g., how to deal with capitalization and punctuation). Finally,
write a python script called vw_format.py that takes the article data as streaming input and outputs (to stdout)
text formatted for vw.
Note that vw is very particular about the formatting, so pay close attention to that. Specifically,
for logistic regression, the labels must by +1 and -1.
To facilitate model evaluation, "tag" each example with the class label (see the
formatting
documentation for details.) Your script should run with the following command:
zcat nyt_articles.tsv.gz | python vw_format.py > vw_examples.txt
Step 4.
Divide the vw-formatted examples into an 80% training set and a 20% test set. Determine whether it
is better to randomly partition the data, or to split chronologically.
Step 5.
Train the model. You are free to explore various vw options when training your model
(e.g., the learning rate and number of passes),
but the basic logistic regression command, with the default parameter settings,
should yield reasonably good results:
./vw -d training_data.txt -f predictor.vw --loss_function logistic
where predictor.vw is the name of the file where the fitted model is saved.
(Note that the model was fit extremely fast!)
Step 6.
Generate model predictions for both the test and training sets:
./vw -d test_data.txt -t -i predictor.vw -p test_predictions.txt
./vw -d training_data.txt -t -i predictor.vw -p training_predictions.txt
Step 6.
Write an R script, called model_evaluation.R,
to assess performance of the model. Compute the ROC curve, AUC, precision, recall, and accuracy, both
on the training and on the test sets.
The ROCR package is useful
for computing the ROC curve and AUC.
Also generate a calibration plot, similar to the one we
discussed for stop-and-frisk.
(The stop-and-frisk plot is on a log-log scale, but a standard, linear scale is probably
best in the news classification case.)
Recall that the calibration plot helps to assess whether
among events that the model states are X% likely, X% in fact occur. For example,
among news articles that a sports classifier says are 70% likely to be sports, what percent
are in fact sports? To generate the plot, first round the model predictions to an appropriate
level (e.g., to the nearest 5 or 10 percentage points). For each of the resulting bins of rounded
predictions, compute: (a) the average model prediction; and (b) the empirical
frequency of "positive" outcomes. A scatter plot of these two statisitcs across bins indicates
the model's calibration, with points closer to the diagonal corresponding to better calibration.
To see the distribution of model predicions, it is useful
to size the points by the total number of events in each bin. Note that the vw predictions
are on the logit scale, so you will need to first transform them to the probability scale
before constructing the plot.
Finally, inspect the model coefficients, with the vw-varinfo command
(the vw-varinfo executable can be found in the vowpal_wabbit/utl directory):
utl/vw-varinfo --loss_function logistic -d training_data.txt
Step 7.
Prepare a 1-3 page report detailing your results and analysis decisions.
Include a few high positive and negative weight model coefficients, and discuss whether
or not they seem reasonable.
Submission.
Submit a single compressed tarball (tar.gz) that contains:
(1) your report (as a PDF file) from Part I of the assignment; (2) vw_format.py; (3) model_evaluation.R;
and (3) your report (as a PDF file) from Part II of the assignment.
As usual, only one member from your team should submit the final files.
Assignment 4
This assignment is divided into two independent parts, as described below.
Part I.
In this part of the assignment you'll conduct a survey on Mechanical Turk, and statistically correct
for potential sample biases. Before beginning, please review the
guidelines for research
on Mechanical Turk.
Step 1.
To familiarize yourself with Mechanical Turk, create a
worker account and complete some HITs.
Step 2.
Design a short, 4 question survey.
The first two questions should ask for the respondent's sex and age.
The next two should elicit the respondent's attitudes or behaviors on a subject
of your choosing. Try to select questions for which you expect systematic
variation by age and gender.
Be sure to avoid sensitive or potentially offensive questions, and
do not ask for any
personally identifiable information
(e.g., the respondent's email address).
Step 3.
Sign up for a requester account, so you can submit your
survey.
Step 4.
Start setting up the project by going to the "create tab" and clicking on the
"new project" link. Select the survey template and click "create project".
Edit the HIT properties. You are free to select the payment rate as you see fit,
though $0.10 is common for a short survey. (If you find that it is taking too long to get responses,
you may need to increase the payment amount.) Initially, set the number of HITs to 20, so you can
conduct a trial run. Set the time allotted per assignment to 2 minutes, the HIT expiration to 2 days,
and auto approval to 1 day.
Under the advanced tab, set "custom worker requirements", and remove the Master worker requirement.
Also add a requirement that workers reside in the United States.
Go to the "Design Layout" step next, and build the survey. You can click the "source" button
to see and edit the raw HTML code for the survey. You should be able to copy-and-paste sections
of the template to build your own survey. Ensure that the "name" and "value" properties are appropriately
set. If they are not, your data will be corrupt! You can read about the basics of HTML forms
here.
Step 5.
Preview the HIT, and confirm that you're only running a small trial;
the total cost should be about two dollars. Then publish the batch.
You can watch the progress of your task under the "Manage" tab, and you can also
view the individual responses as they come in.
Check that the results from the trial run are reasonable, and make any necessary changes.
A common error is mislabeling a response value, which leads to data corruption. If necessary,
run a second, small trial run.
When you are satisfied with the survey design and implementation, run a batch of 200.
(The total cost should be about $20, and is part of the course materials fee.)
Step 6.
Download the results as a CSV file, and analyze them in R. Compute and plot
the age and gender distributions, and analyze the (unadjusted) answers for the
2 substantive questions. Generate adjusted estimates by
poststratifing the responses by gender.
Finally, generate adjusted estimates by poststratifying by both age and gender.
The age-by-gender distribution in the United States is available from the
Census.
You'll need to first bin the Mechanical Turk workers into age categories, and likely
your categories will have to be the union of 2 or more Census categories to avoid
cells with very low numbers of individuals.
Part II.
Identify a plausible natural experiment
that could be used to answer a social scientific question of your choice.
You do not need to carry out any data analysis,
but the natural experiment you describe should be based on a specific instance of
as-if randomization that does in fact exist in the real-world. That is,
it should be possible in theory to carry out your proposal. What
are the assumptions your approach relies on? What are the
advantages and limitations of your proposal?
Submission.
Prepare a 2-3 page report detailing:
(1) the HITs you completed in Part I, Step 1;
(2) the results from your survey and your analysis decisions;
and (3) your proposal from Part II.
Submit a single PDF file here.