- The instructor will place you into a group of 3-4 students
- Pick a data set that you and your group find interesting. (Example sources found below. Feel free to select your data from any other source as appropriate.)
- Form a research question
- Perform data pre-processing, data cleaning, outlier removal, and so on to sanitize your data as necessary.
- Save your data in a .csv file (or other format as appropriate for your data set and project scenario).
- Explore your data to reveal interesting/useful information based on your project scenario.
- Create at least 2 visualizations that you find interesting/useful.
- Do at least one of the following, depending in your interests and background:
- compute meaningful statistical quantities (e.g., means, correlations)
- perform a statistical test on the data (e.g., t-test)
- fit a model to the data (e.g., regression)
- Write at least two Python classes, each of which has at least one method. For example, these classes can be simple as in our lecture notes.
- kaggle
- AWS Open Data
- data.world
- ICPSR
- The Google Dataset Search
- The UCIML Repo
- The CMU data repository
- The datasets subreddit
- Tycho
- Data Portals
1. WRITTEN REPORT (max of 10 pages) containing (due Dec 17):
- Abstract: Paragraph outline describing your question, what you did, and what you learned
- Introduction: Describe your project scenario. Starting out, what did you hope to accomplish/learn?
- Data: Describe your dataset and its significance. Where did you obtain this dataset from?
Why did you choose the dataset that you did?
Indicate if you carried out any preprocessing/data cleaning/outlier removal, and so on to sanitize your data. - Data Processing Methodology: Describe briefly your process to obtain results/output.
- Results:
- Show at least two visualizations
- Display and discuss the results. Describe what you have learned and mention the relevance/significance of the results you have obtained.
- Classes: Describe what classes you made. Describe methods in the classes that you wrote. Show a sample run of 1 or 2 of your methods (screen captures or copy-and-paste is fine).
- Conclusions: Summarize your findings, explain how these results could be used by others (if applicable), and describe ways you could improve your program. You could describe ways you might like to expand the functionality of your program if given more time.
2. CODE (due Dec 17)
- Clearly document, organize, and name your code file or files
- The files can be in Jupyter Notebooks or Python scripts
SUBMISSION
- By the deadline, submit (i) written report and (ii) code files in one Zip file submit through Canvas.
RUBRIC
Total Points = 60
Assignment | Description | Possible Points |
---|---|---|
Paper | Paper includes abstract, introduction, and conclusions | 10 |
Paper discusses data source, data summary, and data processing methodology | 10 | |
Paper includes at least two visualizations | 10 | |
Paper includes answers to research questions | 10 | |
Paper includes methods in user-defined classes | 10 | |
Code | Code is clear and well-documented and presents Python classes | 10 |