-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathapproach.txt
122 lines (111 loc) · 6.21 KB
/
approach.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# wysa
Date: 02-04-2021
Approach and steps
I have done this on R language
Over View of Dataset
# variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
# 1 Show.Number 0 0.0000000000 0 0 0 0 integer 3640
# 2 Air.Date 0 0.0000000000 0 0 0 0 character 3640
# 3 Round 0 0.0000000000 0 0 0 0 character 4
# 4 Category 0 0.0000000000 0 0 0 0 character 27995
# 5 Value 0 0.0000000000 0 0 0 0 character 150
# 6 Question 0 0.0000000000 0 0 0 0 character 216124
# 7 Answer 24 0.0001106348 0 0 0 0 character 88269
Problem Statement -> Build a model to predict the value of the question in the TV game show “Jeopardy!”.
Pre checks before model making
Check for NA value and if exist eliminate them.
Value column having $ string so eliminate it.
Questions having special characters, numbers and other unwanted word so cleaned it.
Converting both column into right formate. Value -> Integer, Question-> Clean it and make plain text.
Removing None value from Value columns // found 3634 value to eliminate
Approach1 -> Apply algorithms on Document Term Matrix
Create Corpus and perform below operations
tolower
removePunctuation
removeNumbers
removeWords with stop words file which is having handsome number of unwanted words
stripWhitespace
stemDocument
PlainTextDocument
Method 1-> Create Document Term Matrix
Remove Spare Terms with .95 values, got zero term in matrix. We can not use this matrix further
Method 2-> Firstly create Term document matrix by below command
adtm <- TermDocumentMatrix(combine_corpus, control = list(bounds = list(local = c(2,Inf))))
Change it to Document Term Matrix
Find frequemcy of terms and above 100, find below terms
# "play" "your" "citi" "mean" "island" "number" "time" "day" "state" "word" "man"
# [12] "river" "clue" "lake" "call" "presid" "king" "love" "war" "year" "film" "type"
Create final Data frame with Value column and this 22 columns
Applied Multiple linear Model got R-Squared value is 0.01015664 which is very bad.
Method 3-> Apply binning and removing outliers on Value column which is our removing column
Found 1900 is the upper value for outlier as Q3 - 150% of Range(Q3-Q1)
Applied binning concept and shrink 150 different values of Value column in to 20 values.
0 to 100 -> 100
100 to 200 -> 200
-----------------
-----------------
1900 - 2000 -> 2000
Applied Multiple liner regression, no visible improvement and got R-Squared value 0.02419002
Cause -> The Document Term matrix having all values as zero for all the 197623 rows and we can't improve it's
performance until we solve this problem
Approach2-> Instead of creating Document term matrix, use tokenization
Create Tokens by below command
reviewtokens=tokens(df5$Question,what="word",remove_numbers=TRUE,remove_punct=TRUE, remove_symbols=TRUE, remove_hyphens=TRUE)
Perform below operations on tokens
tokens_tolower
Remove stop words
Stemming
Create N-Grams Unigrams and Bigrams
Creat dfm
Remove sparsity on .98 values
Sort dfm and take all the term which is having at least 30 document frequency
And Prepare a data frame
feature frequency rank docfreq group
1 name 15473 1 15051 all
2 one 12491 2 11913 all
3 href 11015 3 9004 all
4 first 9102 4 8871 all
5 target 9064 5 7242 all
6 blank 9015 6 7194 all
7 target_blank 8979 7 7160 all
8 citi 7075 8 6838 all
9 call 6229 9 6092 all
11 use 6002 11 5862 all
12 countri 5797 12 5737 all
13 state 5566 13 5263 all
14 us 5474 14 5390 all
15 man 4947 15 4832 all
16 like 4942 16 4775 all
17 type 4879 17 4760 all
18 film 4824 18 4734 all
19 seen 4735 19 4723 all
20 can 4642 20 4510 all
21 year 4494 21 4372 all
22 new 4455 22 4314 all
23 clue 4197 23 3851 all
24 titl 4045 24 4013 all
25 mean 3952 25 3825 all
26 made 3745 26 3690 all
27 word 3574 27 3408 all
28 known 3437 28 3395 all
29 crew 3327 29 3319 all
30 also 3321 30 3318 all
target name one href first target blank target_blank citi call play use countri state us man like type film
1 200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 200 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
4 200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 200 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
6 200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Method 1-> Applied rpart algorithm which actually Recursive Partitioning and Regression Trees
Used below command
tree=rpart(tokendf$target ~ ., data = tokendf, method="class",control = rpart.control(minsplit = 200, minbucket = 30, cp = 0.0001))
Got accuaray as follows
## 0.2143425 on entire data set
## 0.2143425 on 80 percent split of training data set
## 0.2125595 on 20 percent split of testing data set
Method 2-> Using random forest algorithm
Applied below command
fit.forest <- randomForest(trainData$target~.,data=trainData, na.action=na.roughfix,importance=TRUE, ntree=100)
Got Accuary 0.2199 on train data set
Got Accuary 0.2122 on test data set