-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproposal_draft.txt
64 lines (48 loc) · 2.51 KB
/
proposal_draft.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
11/7 meeting with professor:
1. Use 2014 "Birth Data Files" - for successful births (~5 GB data)
2. Use 2014 "Period Linked Birth-Infant Death Data Files" (~5 GB data)
Metrics:
------------
1. Which features are most pre-dominant in successful births
2. Which features are most pre-dominant in unsuccessful births
Prediction:
-----------
1. Based on features values can we predict the survival of the infant in %.
Utilize at least 3 of the concepts below, with at least one from data pipelines:
----------------------------------------------------------------------------------
Data Pipelines (Frameworks): Spark
Analytics: Similarity Search, Link Analysis, Linear Modeling, Clustering and Dimensionality Reduction, Large-Scale Machine Learning, Distributed TensorFlow (need to figure out which ones will be most applicable besides machine learning)
Tasks:
-------
All presentations should include:
● (Anurag)Big Picture Scope: Problem definition (if applicable) or need in society addressed
● (Anurag)Motivation: Why big data? Why should people care?
● (Renu)Sustainable Development Goal: How are you addressing the SDG?
● (Shayan)Data Description: Source(s), number of variables / features. Number of observations.
● (All)Methods Proposed. How are you going to solve / analyze.
● (All)Specific Experiments / Analyses Proposed: What are the specific experiments or analyses you intend to run?
● (Shayan)Mock-up or Preliminary Results: What are the key figures / tables you home to come out of your work-- show a “fake” example rather than just tell. Any prototyping on small data?
● (Keshav)Team Work. What will each team member be responsible for? (may deviate based on how project develops, but teams at least need a plan).
● (All)Summary of course concepts. (1 slide): Identify how it relates to the course.
● (Keshav)Conclusion (1 slide): Summarize the main contributions of your work.
Topics to figure out:
---------------------
1. Similarity Search (Anurag)
2. Distributed Tensor Flow (Renu)
3. Clustering and Dimensionality Reduction(Keshav)
4. Large Scale Machine Learning (Shayan)
Google Doc:
-------------
https://docs.google.com/document/d/11KH5w_qdKpl2YnbgQ3I7wxlqMbIk1KV8DcSErnlUXXk/edit
Google Presentation:
-----------------------
https://docs.google.com/presentation/d/1mPmHjyJZVHrvnXTaRUJwl2dt0GkrhftjIX1aprHcqCY/edit#slide=id.p
Tasks:
-----------
1. Data Load
2. Data Cleaning
3. Feature Extraction
4. Dimensionality Reduction
5. Feature Enrichment
6. Model Training
7. Model Evaluation