STAT151A Final Project

\(\,\)

Below are the details of your final project. As usual, these may be subject to change (with notice).

1 Project outline

The final project will be an open-ended investigation into real data using tools from linear regression.

The final submission should take the form of a pdf with at most eight pages, including figures, but not including citations. We will ask that you submit reproducible code. The code will not be explicitly graded as part of the assignment, but we may use it to confirm that you actually did what you say you did.

The final project will be collected via Gradescope.

Datasets

The project will analyze one of the four datasets described and linked to in the first section of the datasets page:
Housing, Microcredit, Teaching, Bodyfat, or Spotify.

If you have an interesting, open–access dataset that you would like to make available to the whole class, I am open to considering it! Please reach out well before HW 5 is due.

Groups

For the final project, you must form groups of three students. Please select your own groups. Each person may participate in only one group. All members of a group will get the same grade. Changing the group members after having your project approved will require instructor approval.

Some groups need to have only two people!

The class size does not divide into three, so there will need to be two groups of two people. If you would like to submit a group of two, you are welcome to do so, but if there are more than are necessary to divide the class, we will

  • Randomly select the needed number of two–person groups and then
  • Require the rest of the groups to re–form into groups of three on their own.

If you have a better idea for how to so this fairly, please let me know.

Key dates

  • You must submit your group members and datset by Monday, Nov 11th at 9pm
    • Please use the Gradescope assignment “Final project group proposals” to submit your groups.
  • You must submit a first draft by Monday, Nov 25th at 9pm.
  • You must submit a final draft by Monday, Dec 9th at 9pm.

2 Grading

(Subject to change)

We will spend the class of Tuesday Nov 26th providing critical feedback on each others’ first drafts. Your final project grade will consist of:

  • 10% peer grades of your first draft
  • 10% peer grades of your critical feedback on their first draft
  • 80% final draft

I reserve the right to adjust and inflate the peer grades as needed if there seem to be systematic biases. I will not decrease the peer grades.

3 Project structure

A typical project should consist of

  • A real-world question or questions to be investigated with the dataset,
  • An attempt to answer the question using tools from this course, and
  • A critical analysis of assumptions and potential shortcomings of the analysis.

We will provide a basic R markdown template to follow.

For the whole project, clear and critical thinking is what matters. There are no extra points for positive results, nor are there extra points for the most sophisticated methods (unless they are motivated by the problem). There are extra points for novel or creative perspectives and ideas (when justified by careful reasoning).

Questions

Questions should be about the real world, not framed directly in terms of statistical analyses.

  • Good example: Can we increase household income by giving cash to poor families?
  • Good example: Can we produce a useful predictor of bodyfat from simpler measurements?
  • Bad example: Is there an association between \(x_n\) and \(y_n\) in this dataset?
  • Bad example: Is coefficient \(\beta_1\) in such-and-such a regression statistically significant?

Here are some things to think about to make sure it’s a real-world question, and not a purely statistical question:

  • Different answers to the question would lead you to do something differently in the real world. (Maybe think about what you’d do differently and incorporate that into your question.)
  • Your question should, in principle, be possible to answer with datasets, real or hypothetical, other than the one you have. Some datasets might provide better answers, some worse.
  • You should be able to imagine an ideal (but maybe practically impossible) experiment which would give you a perfect answer to your question.
  • You should be able to articulate your question and its possible answers to someone who doesn’t know any statistics at all.

Please be clear about whether your problem is an inference or a prediction problem (or both, or neither).

Attempted answers

In order to attempt to answer your question with linear regression, you have to connect your real-world question with a regression analysis. Please be very clear about

  • What assumptions you need to connect your regression to your question
  • Whether those assumptions are reasonable

Clear thinking will be more important here than definitive answers to your question.

Critical analysis

Finally, please connect the results of your analysis to your question. Here are some of the kinds of questions you might ask:

  • What is the answer to your question?
  • Were you not able to answer your question due to some limitation you found in the data?
  • How might you collect different data to successfully answer your question?
  • What different statitsical analyses might answer the question better?
  • What evidence is there that your assumptions are satisfied?
  • What evidence could you imagine collecting to establish that your assumptions were satisfied?

Again, clear thinking will be more important here than definitive answers to your question.

4 Tentative Rubric

The final project will be graded on the basis of the following criteria.

  • Real-world question
    • \(\star\) The question is framed in terms that someone who knows no statistics can understand
    • \(\star\) It is clear whether the question or use case is prediction or inference
    • \(\star\) The real-world question is clearly addressed (it need not be answered definitively, but it should be clear what the data can and cannot tell us about the question)
    • The motivation or importance of the question or use case is especially clear
    • Future actions based on the statistical analysis are well-articulated
  • Application of statistical concepts
    • \(\star\) The authors carefully checked the dataset for missing or unusual values and cleaned the data as appropriate
    • \(\star\) The authors appropriately quantified the statistical uncertainty in their analysis given their assumptions
    • \(\star\) The results of their statistical analysis are meaningfully connected to the question or use case
    • The authors explored a range of plausible assumptions
    • The authors examined the data for evidence that their assumptions were or were not met
  • Critical analysis
    • \(\star\) The authors carefully consider how data collection might affect the validity of their analysis or assumptions
    • \(\star\) The authors clearly articulated how the dataset might answer or fail to answer the real-world question
    • \(\star\) The authors carefully investigated unusual or suprising results or patterns in the data
    • Alternative explanations or answers to the question or use case are carefully considered
    • Outside data or scientific knowledge is brought to bear on the problem
  • Clarity of exposition
    • \(\star\) Previous work and collaboration is clearly acknowledged
    • \(\star\) The plots and tables are readable and well-labeled
    • \(\star\) The report is well-structured, with a coherent introduction, analysis, and conclusion
    • The primary conclusion or recommondations are clear, and the report can be safely read quickly
    • The computational tools and methods are well-documented