Programming Assignment 6 — Project Proposal

Due: April 13th

The last assignment you will have to do in this class is a project, which will be due May 2nd. There are two phases to the project:

  1. Project Proposal Explain to us what you will do for your project in a one page writeup. Due April 13th.

  2. Project Submission Write Python code and a document (as you did for previous homeworks) performing the analyses you proposed in step 1 and discussing the results. (We will provide a new repository and writeup for the final submission after April 13th.)

You are allowed to work on this project with a partner, though you do not need to. You should feel free to re-use any code that you wrote for homeworks.

Project Proposal

In this course, we have discussed a number of different analyses you can do to reason about data, model that data, and make predictions:

  1. Exploratory modeling of data using things like histograms
  2. Reasoning about statistical differences using things like confidence intervals
  3. Building models of data using linear regression or autoregression
  4. Classifying data using supervised learning
  5. Clustering data using unsupervised learning

For your project, we are leaving things quite open ended. Here are links to sources of interesting data, encompassing things like labeled spam email data, information about disease progression, demographic data, etc.

  1. SLS Data
  2. Elements of Statistics Data

We want you to propose analyses that can determine interesting thigns about this data. For example, you could use the spam data set to build a classifier to detect spam (indeed, early email spam filters were based on the naive Bayes approach we discussed in class). You could use the phoneme data set to see whether you can distinguish different phonemes, or distinguish the same phoneme uttered by different speakers (this one might be quite hard!)

In your proposal, we would like you to:

  1. Pick two data sets that you will investigate

  2. Break those data sets up into training and testing data (you can do this randomly, or the data set may already be divided appropriately). Use the statistical analyses that we discussed in sampling and estimation to justify this split (or discuss why the split may not be a good one).

  3. For one (or both) data sets, tell us one type of predictive model you will build (i.e., regression model). Explain what kind of model you will build and why you think it might be effective.

  4. For one (or both) data sets, tell us one type of classifier or clustering model you will build. Explain why you picked the approach you did, and why you think it will be effective (in the case of a classifier) or tell you something useful (in the case of a clustering approach)

You must perform at least one analysis on each of your two data sets, and you must perform at least one analysis that does modeling (#2) and one analysis that does classification or clustering (#3).

Your project writeup should be 1-2 pages. Feel free to come talk to us for suggestions on data sets to look at and ideas of what analyses to perform.

What to submit

For the project proposal, please submit a file called either proposal.doc or proposal.pdf with your proposal writeup. If you are working with a partner, please provide your partner's name, their Github userid and their Purdue username in the writeup. Tag your submission with submission as in previous homeworks.