STAT151A Final Project

\(\,\)

Below are the details of your final project. As usual, these may be subject to change (with notice).

1 Project outline

The final project will be an open-ended investigation into real data using tools from linear regression.

Datasets

The project will analyze one of the four datasets described and linked to in the first section of the datasets page: Housing, Microcredit, Teaching, Bodyfat, or Spotify.

Forming groups

For the final project, you must form groups of four students. Please select your own groups. Each person may participate in only one group. All members of a group will get the same grade. Changing the group members after having your project approved will require instructor approval.

To form a group, get your group members together and draw a random number using the following R command:

sample(1e9, size=1)

Please then submit the Gradescope assignment “Final project group proposals” as a group submission with this group number.

If a particular randomly generated group number as fewer than four members, you may be broken up or randomly assigned another group member. If more than one particular group number has more than four members, the group will be randomly split up and assigned to other groups.

You may enter “no group” into the Gradescope assignment to be randomly assigned a group.

Key dates

You must submit your group members and dataset by Sunday, April 19th at 9pm (see above)
You must submit a first draft by Sunday, April 26th at 9pm.
You must submit a final draft by Sunday, May 10th at 9pm.

2 Grading

We will spend the class of Monday April 27th providing critical feedback on each others’ first drafts. Lecture attendance by at least one group member on this day is mandatory.

Your final project grade will consist of:

10% peer grades of your first draft
10% peer grades of your critical feedback on their first draft
80% final submission

I reserve the right to adjust and inflate the peer grades as needed if there seem to be systematic biases. I will not decrease the peer grades.

Self-evaluation

Each group member – separately – will turn in a description of how the work was divided in the group: who did what for the different components of the project. If you have any qualms about the process and who contributed what, detail it here. If there was no problem that’s great, but you still must turn in the description of how the work was distributed. This should be a couple of paragraphs and each person should write it independently.

It is expected that some students contribute more than others. This can be a benefit, since less confident students can learn from their peers. However, the expectation is also that every group member contributes to the extent of their abilities.

If, in the self-evaluations, there is strong evidence that one or more group members deliberately did not contribute, then I may subjectively reduce that group member’s grade for the final project. If I do, I will let that group member know, and give them an opportunity to meet with me in person to demonstrate the work that they did and explain the situation.

In most cases, I expect that the final proejct grades will not require any modification. The purpose of this self-evaluation is primarilyi to detect misconduct, not produce a gradation of students. If you are all content with each other’s contributions, and each indicate as much, then the final project grade will be the same for each group member.

3 Format and submitting

The final submission should take the form of a pdf with at most eight pages, including figures, but not including citations. Only the first eight pages of the report are subject to grading.

The final project will be collected via Gradescope as a group assignment.

We will ask that you submit reproducible code as a zip file. The code will not be explicitly graded as part of the assignment, but we may use it to confirm that you actually did what you say you did.

4 Project structure

A typical project should consist of

A real-world question or questions to be investigated with the dataset,
An attempt to answer the question using tools from this course, and
A critical analysis of assumptions and potential shortcomings of the analysis.

We will provide a basic R markdown template to follow.

For the whole project, clear and critical thinking is what matters. There are no extra points for positive results, nor are there extra points for the most sophisticated methods (unless they are motivated by the problem). There are extra points for novel or creative perspectives and ideas (when justified by careful reasoning).

Examples

Three examples of successful final projects from past courses have been shared in the Files section of BCourses. Please do not circulate these projects.

Questions

Questions should be about the real world, not framed directly in terms of statistical analyses.

Good example: Can we increase household income by giving cash to poor families?
Good example: Can we produce a useful predictor of bodyfat from simpler measurements?
Bad example: Is there an association between \(x_n\) and \(y_n\) in this dataset?
Bad example: Is coefficient \(\beta_1\) in such-and-such a regression statistically significant?

Here are some things to think about to make sure it’s a real-world question, and not a purely statistical question:

Different answers to the question would lead you to do something differently in the real world. (Maybe think about what you’d do differently and incorporate that into your question.)
Your question should, in principle, be possible to answer with datasets, real or hypothetical, other than the one you have. Some datasets might provide better answers, some worse.
You should be able to imagine an ideal (but maybe practically impossible) experiment which would give you a perfect answer to your question.
You should be able to articulate your question and its possible answers to someone who doesn’t know any statistics at all.

Please be clear about whether your problem is an inference or a prediction problem (or both, or neither).

Attempted answers

In order to attempt to answer your question with linear regression, you have to connect your real-world question with a regression analysis. Please be very clear about

What assumptions you need to connect your regression to your question
Whether those assumptions are reasonable

Clear thinking will be more important here than definitive answers to your question.

Critical analysis

Finally, please connect the results of your analysis to your question. Here are some of the kinds of questions you might ask:

What is the answer to your question?
Were you not able to answer your question due to some limitation you found in the data?
How might you collect different data to successfully answer your question?
What different statitsical analyses might answer the question better?
What evidence is there that your assumptions are satisfied?
What evidence could you imagine collecting to establish that your assumptions were satisfied?

Again, clear thinking will be more important here than definitive answers to your question.

5 Grading Rubric

The final project will be graded on the basis of the following criteria. The grading will be structured so that a project that satisfied each \(\star\) criteria will get an A, but you can make up points by doing a particularly good job on the other criteria. This means there can be many ways to get a good grade.

Real-world question
- \(\star\) The question is framed in terms that someone who knows no statistics can understand
- \(\star\) It is clear whether the question or use case is prediction or inference
- \(\star\) The real-world question is clearly addressed (it need not be answered definitively, but it should be clear what the data can and cannot tell us about the question)
- The motivation or importance of the question or use case is especially clear
- Future actions based on the statistical analysis are well-articulated
Application of statistical concepts
- \(\star\) The authors carefully checked the dataset for missing or unusual values and cleaned the data as appropriate
- \(\star\) The authors appropriately quantified the statistical uncertainty in their analysis given their assumptions
- \(\star\) The results of their statistical analysis are meaningfully connected to the question or use case
- The authors explored a range of plausible assumptions
- The authors examined the data for evidence that their assumptions were or were not met
Critical analysis
- \(\star\) The authors carefully consider how data collection might affect the validity of their analysis or assumptions
- \(\star\) The authors clearly articulated how the dataset might answer or fail to answer the real-world question
- \(\star\) The authors carefully investigated unusual or suprising results or patterns in the data
- Alternative explanations or answers to the question or use case are carefully considered
- Outside data or scientific knowledge is brought to bear on the problem
Clarity of exposition
- \(\star\) Previous work and collaboration is clearly acknowledged
- \(\star\) The plots and tables are readable and well-labeled
- \(\star\) The report is well-structured, with a coherent introduction, analysis, and conclusion
- The primary conclusion or recommondations are clear, and the report can be safely read quickly
- The computational tools and methods are well-documented

6 Use of AI

I consider the use of AI analogous to a friend who had done well in the course before. You may ask for help, but you must not let them do the work for you.

Here are some example uses of AI that are acceptable:

It’s okay to ask AI to check grammar and typos of text you wrote yourself.
It’s okay to ask AI to generate a figure according to precise instructions from you.
It’s okay to ask AI to check code or mathematical derivations for errors.

Here are some example uses of AI that are not acceptable:

It’s not okay to ask AI to generate text outright.
It’s not okay to give AI the rubric and ask it to suggests a project outline.
It’s not okay to submit a draft to AI and ask it to edit it so as to receive a higher grade.

The distinction is whether you have asked AI to do the work for you, or to help edit work you have already done.