STAT151A Final Project
\(\,\)
Below are the details of your final project. As usual, these may be subject to change with notice.
1 Project outline
The final project will be an open-ended investigation into real data using tools from linear regression.
The final submission should take the form of a pdf with at most 10 pages, including figures, but not including citations. We will ask that you submit reproducible code. The code will not be explicitly graded as part of the assignment, but we may use it to confirm that you actually did what you say you did.
The final project will be collected via Gradescope.
For the final project, you may form groups of up to three students. Please select your own groups. Each person may participate in only one group. All members of a group will get the same grade. Changing the group members after having your project approved will require instructor approval.
2 Key dates
Project proposals will be due at 9pm on Sunday April 14th, but may be submitted at any time.
To submit a proposal, please use this form. Only one member of the group needs to submit a proposal. There is a field on the form to indicate the group members. Please do not submit the same proposal multiple times.
I will review and approve project proposals by Tuesday, April 16th. If you would like to have your proposal reviewed early so you can get started sooner, I will be happy to do that, just ask.
The final projects will be due at 9pm on Sunday May 5th.
3 Project structure
A typical project should consist of
- A dataset which is openly available,
- A question or questions to be investigated with the dataset,
- An attempt to answer the question using tools from this course, and
- A critical analysis of assumptions and potential shortcomings of the analysis.
We will provide a basic R markdown template to follow.
Alternative projects
Thoughtful deviations from this template are welcome. Some possible examples are:
- Studying an advanced regression topic (e.g. double descent, regression trees, Bayesian methods) using real or simulated data
- Reproducing an existing study, then changing some of the methods to investigate the stability of the results
- Study the numerical stability of R’s
lm
with a detailed analysis of its core linear algebra routines
Alternative proposals should still meaningfully engage with the content of the course. For example, a project that simply fits deep neural networks to some data is inappropriate, but a project that meaningfully compares a deep neural network to regression techniques on the same dataset could be appropriate.
If you want to deviate from the template, please describe your proposal in detail on the project form.
More detailed guidelines
Here are some more guidelines for the project proposals, as well as what we will look for in a good project.
Datasets
Datasets should be openly available. Furthermore, they should come with detailed and clear descriptions of how the data were collected. Ideally, the information about data collection is in the form of a published paper or study.
- Good example: The bodyfat dataset
- Bad example: The marketing dataset in An Introduction to Statistical Learning (no real information about how it was collected)
- Bad example: Proprietary data from your aunt’s internet startup (not open access)
I hope that good datasets may become part of future versions of this course.
Here are some potential places to look for datasets:
- Kaggle
- UCI ML repository
- Openintro
- Fox regression book
- IEEE Dataport (unfortunately, many cost money)
- ROS textbook
Please feel free to share other suggestions with me and with the rest of the class!
Questions
Questions should be about the real world, not framed directly in terms of statistical analyses.
- Good example: Can we increase household income by giving cash to poor families?
- Good example: Can we produce a useful predictor of bodyfat from simpler measurements?
- Bad example: Is there an association between \(x_n\) and \(y_n\) in this dataset?
- Bad example: Is coefficient \(\beta_1\) in such-and-such a regression statistically significant?
Please be clear about whether your problem is an inference or a prediction problem (or both, or neither).
Attempted answers
In order to attempt to answer your question with linear regression, you have to connect your real-world question with a regression analysis. Please be very clear about
- What assumptions you need to connect your regression to your question
- Whether those assumptions are reasonable
Clear thinking will be more important here than definitive answers to your question.
Critical analysis
Finally, please connect the results of your analysis to your question. Here are some of the kinds of questions you might ask:
- What is the answer to your question?
- Were you not able to answer your question due to some limitation you found in the data?
- How might you collect different data to successfully answer your question?
- What different statitsical analyses might answer the question better?
- What evidence is there that your assumptions are satisfied?
- What evidence could you imagine collecting to establish that your assumptions were satisfied?
Again, clear thinking will be more important here than definitive answers to your question.