-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcontributions_by_Texas.Rmd
1565 lines (1097 loc) · 98.6 KB
/
contributions_by_Texas.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Exploration of Financial Contributions to Presidential Campaigns in Texas by Jehyeon Heo.
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, Packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
library(gridExtra)
library(dplyr)
library(RColorBrewer)
library(GGally)
library(scales)
library(memisc)
library(reshape2)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_datas}
# Load the financial data.
Data <- read.csv('ctrbsTX.csv',
row.names = NULL,
header = TRUE)
# Get header names without 'row.names'.
data_headers <- colnames(Data)[-1]
# In the data, remove the last column which is all NAs.
Data$election_tp <- NULL
# Give the data right header names for each columns.
colnames(Data) <- data_headers
# Load the candidates data.
Cand <- read.csv('candidates.csv')
```
In 2016, there was presidential election in United States. At the result of it, Donald Trump became the 45th president of US. He gave sensational impact to Repulican party and became the nominee of it. His biggest rival was Hillary Clinton, Democratic party nominee. Lots of people forecasted that Hillary Clinton would win. But even if she got more votes than Trump in the election, she lost.
In this document, I'm going to explore about financial contributions in Texas to presidential campaigns of 2016 US election. I chose Texas because I was interested in the state.
I got the financial data from [FEC site](http://classic.fec.gov/disclosurep/pnational.do). And I used the data about each candidate. It has gender, party, height and age information of each candidates. I made the data by searching the internet.
Before I enter into analysis, I'm going to introduce the questions that I got about the data.
1. How was the distribution of the contributions?
2. How did the sum of contributions differ by parties or dates or candidates, etc?
3. I want to know the results of above questions by comparing Clinton and Trump. And also for Republican party and Democratic party.
4. Which contributer contributed the maximum amount? And who contributed most often?
I'm sure that I'll get more questions as I explore the datas. I'll show you which questions I get when I explore the data and show the results of the exploration.
# Basic Data Exploration Section
Fisrt, I'm going to explore financial data. I loaded the data as a name 'Data'.
```{r echo=FALSE, message=FALSE, warning=FALSE, Basic_Exploration_for_Data}
# Get the number of observations and columns.
dim(Data)
# Show the structure of the data.
str(Data)
# Show the distributions of the data by summary.
summary(Data)
# Get the number of candidates and columns.
print('Number of candidates in Data:')
length(unique(Data$cand_nm))
# Show some contributers' names.
print("Show top 5 most common contributers' names")
summary(Data$contbr_nm)[1:5]
```
The Data has about 550,000 observations and 18 variables. And there are 25 candidates' information in it.
And I can see from the structure and the summary of the Data that it would be difficult to identify unique contributers. If I try to sort out distinctive contributers by using names, there can be lots of different people with same name. Even if I try to use city or occupation variables too, I cannot know whether the contributions are made by diffrent people who live in the same city or by same person who moved to other city and changed his or her job.
I'm going to show an example using one of most common contributer name 'RUDOLPH, BONNIE' in the following.
```{r echo=FALSE, message=FALSE, warning=FALSE, Same_name_different_people_example}
# Show the states of 'ROBINSON, ROBBIE'.
bonnie <- subset(Data, contbr_nm == 'RUDOLPH, BONNIE')
print('How many unique city names are registered for him?')
length(unique(bonnie$contbr_city))
print('What are the city names?')
unique(bonnie$contbr_city)
print('How many unique zip codes are registered for him?')
length(unique(bonnie$contbr_zip))
print('What are the zip codes?')
unique(bonnie$contbr_zip)
print('How many of people are registered as his employer?')
length(unique(bonnie$contbr_employer))
print('Who are his employers?')
unique(bonnie$contbr_employer)
print('How many occupations are registered for him?')
length(unique(bonnie$contbr_occupation))
print('What are his occupations?')
unique(bonnie$contbr_occupation)
```
When I explored top 5 most common contributer names, I found that RUDOLPH, BONNIE contributed 463 times. It looked like he lives in Laredo, Texas. When I see the city name with zip codes and search about it using google, I think that Laredo is wrongly written to 'Latedo'.
But when I see the employer and occupation of him, I cannot but have a question about the name's uniqueness. Is he retired? Or is he working at a university? Or did he work at a university and retired? I cannot know about it using only this data.
Therefore I cannot answer the questions related with 'unique' contributers using this data.
I also found from the structure of Data that the type of contb_receipt_dt should be date, not string. I'm going to change the type. And I also think that it is better to make types of file_num and tran_id as string, not num or factor because there are too many different cases.
```{r echo=FALSE, message=FALSE, warning=FALSE, Change_types}
# Set locale of R.
print("Set locale as 'C'")
Sys.setlocale("LC_TIME", "C")
# Change the type to Date.
Data$contb_receipt_dt <- as.Date(
Data$contb_receipt_dt, '%d-%b-%y'
)
# Change the types of file_num and tran_id.
Data$file_num <- as.character(Data$file_num)
Data$tran_id <- as.character(Data$tran_id)
# Check the result using str function.
str(Data$contb_receipt_dt)
str(Data$file_num)
str(Data$tran_id)
```
The type of contb_receipt_dt values are changed to date. And the types of file_num and tran_id values are changed to string.
Now I'm going to explore candidates data, too. I loaded the data as a name 'Cand'.
```{r echo=FALSE, message=FALSE, warning=FALSE, Basic_exploration_for_Cand}
# Get the number of observations and columns.
dim(Cand)
# Show the structure of the data.
str(Cand)
# Show the distributions of the data by summary.
summary(Cand)
# Show whether the unique values of the candidates' names are all equal to the cand's X values.
print("If the unique values of the candidates' names in Data are all equal to the values in cand:")
all(unique(Data$cand_nm) == Cand$X)
```
There are 25 candidates' information in Cand. 17 of them are included in Repulican party and 5 of them are in Democratic party. Only 3 of all candidates are female.
I think that it is good to add new column which notify who became presidential nominee from each party. I'm going to name the column as 'if_nominee' and assign TRUE for nominees and FALSE for others. And I'm going to regard Evan McMullin, independent presidential candidate as a nominee.
```{r echo=FALSE, message=FALSE, warning=FALSE, Make_if_nominee_column}
nominees = c('Clinton, Hillary Rodham',
'Trump, Donald J.',
'Johnson, Gary',
'McMullin, Evan',
'Stein, Jill')
Cand$if_nominee <- ifelse(Cand$cand_nm %in% nominees, TRUE, FALSE)
summary(Cand$if_nominee)
```
5 candidates became nominees and others couldn't(or didn't).
When I explored the candidates' data, I found that the values of the candidates' names are same for Data and Cand that I can merge Data and Cand by outer join using 'cand_nm' column. I'm going to use only 'cand_nm', 'cand_party' and 'if_nominee' columns to join because I think that these columns would help me to understand more about contributions for parties and nominees.
```{r echo=FALSE, message=FALSE, warning=FALSE, Outer_join_Data_and_Cand}
# Merge the 'Data' and 'Cand' by outer join.
Data <- merge(x = Data,
y = Cand[,c('cand_nm', 'cand_party', 'if_nominee')],
by = 'cand_nm',
all.x = TRUE)
print('Show the names of the columns after merged')
names(Data)
```
I can see that 'cand_party' and 'if_nominee' columns are added to Data.
# Univariate Plots Section
Now I'm going to get univariate plots.
First, I'm going to show the distribution of the contribution counts for each candidates. And I'll also show what is the percentages of total for each count.
I decided to use bar plot because the x variable is names, which are categorical values and the y variable is counts. I think that I will use bar plot lots of times because most of variables in Data are categorical values.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_cand_nm_counts}
#From now on, I'll make some variables pass through table, sort and as.data.frame function. I'll do them to draw ggplot graph in order by counts. And I can filter the values by counts easily.
# I designated dnn in table function to give a name to the dimensions in the result. If I don't designate it, the default name is 'Var1'.
# I also designated responseName in as.data.frame function to give a name to the responses in the result. If I don't designate it, the default name is 'Freq'.
# I designated decreasing equal T in sort function to make names with more counts appear at head.
name_counts <- as.data.frame(sort(table(Data$cand_nm, dnn = 'name'),
decreasing = T),
responseName = 'count')
name_counts
# I'm going to often use theme function in ggplot. It is for rotating x labels 90 degrees when the label names are long. And I designated hjust = 1 to align the texts to the bottom of the plot and vjust = 0.5 to align them to each ticks.
ggplot(aes(x = name, y = count),
data = name_counts) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
# I chose count/sum(count) as y variable to express percentage.
ggplot(aes(x = name, y = count/sum(count)),
data = name_counts) +
geom_bar(stat = 'identity') +
scale_y_continuous(labels = percent_format()) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Clinton got the most number of contributions. After her, Cruz, Sanders, Trump, Carson, etc. follow. When I changed the counts to percentages, Clinton got about 38% and Cruz, Sanders, Trump and Carson followed by about 25%, 15%, 14%, 5% each. I wonder if a candidate who got more contributions, got bigger sum of contribution amounts. I'm going to explore it by adding amounts for each candidate in the bivariate section later.
In this time, I'm going to show the distribution of the count of contributions in each cities in Texas.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_city_counts}
cities_counts <- as.data.frame(sort(table(Data$contbr_city, dnn = 'name'),
decreasing = T),
responseName = 'count')
print("Number of contributers' cities")
length(cities_counts$name)
ggplot(aes(x = name, y = count),
data = cities_counts[1:10, ]) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
There are 2252 city names in the data that I showed top 10 results.
The top 10 cities of Texas by population ordered by rank are Houston, San Antonio, Dallas, Austin, Fort Worth, El paso, Arlington, Corpus Christi, Plano and Laredo. But when I see the top 10 cities by counts, the rank is a little different from the rank by population. San Antonio changed the rank with Austin. And while Corpus Christi and Laredo are included in populated top 10, they didn't appear at top 10 counts.
In this time, I'm going to show the histogram of contribution amounts. I can use histogram this time, because I can consider contribution amounts as continuous variable.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_contribution_amount}
summary(Data$contb_receipt_amt)
print('Top 6 most frequent contribution amounts:')
head(sort(table(Data$contb_receipt_amt), decreasing = T))
# I designated binwidth as $100, because the range of amounts is about $30,000.
# I think that $100 is the right binwidth compared to the range.
ggplot(aes(x = contb_receipt_amt),
data = Data) +
geom_histogram(binwidth = 100) +
scale_x_continuous(breaks = seq(-15000, 15000, 5000))
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
# Show the histogram for center 50% of amounts. I designated binwidth as 1$, because the range of IQR range is $80.
ggplot(aes(x = contb_receipt_amt),
data = subset(Data,
contb_receipt_amt >= quantile(Data$contb_receipt_amt, 0.25) & contb_receipt_amt <= quantile(Data$contb_receipt_amt, 0.75))) +
geom_histogram(binwidth = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_continuous(limits = c(15, 105))
```
The minimum and maximum amount was -\$16,600 and \$15,000 each. It is weird that there are - values of contributions. I wonder what causes it.
The most common contribution amount is \$25. \$50, \$100 follow it. It is unexpected to see that 25$ is the most common one because I thought that it is convenient to contribute amounts in the tens.
I can see from the table and the histogram that most of contributions are in between \$20 and \$100.
I'm going to investigate about negative values in contribution amounts in the following.
```{r echo = FALSE, message=FALSE, warning=FALSE, Univariate_more_about_amounts}
minus_amounts <- subset(Data, contb_receipt_amt < 0)
summary(minus_amounts$contb_receipt_amt)
print('Top 6 negative amounts:')
head(sort(minus_amounts$contb_receipt_amt))
print('Datas whose contribution amounts are less than -$10,000:')
subset(minus_amounts, contb_receipt_amt < -10000)
print('Datas about DURHAM, JOE and CLARK, ELLOINE M.')
subset(Data, contbr_nm %in% c('DURHAM, JOE', 'CLARK, ELLOINE M.'))
# Get the descriptions about them to know the causes.
minus_amounts.description <- as.data.frame(sort(
table(minus_amounts$receipt_desc, dnn = 'description'),
decreasing = T),
responseName = 'count'
)
print('Top 6 most common descriptions for negative values:')
head(minus_amounts.description)
ggplot(aes(x = description, y = count),
data = head(minus_amounts.description)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
# I also want to know why there are amounts whose values are above $5,400.
more_than_5400_amounts <- subset(Data, contb_receipt_amt > 5400)
# Get the descriptions about them to know the causes.
more_than_5400_amounts.description <- as.data.frame(sort(
table(more_than_5400_amounts$receipt_desc, dnn = 'description'),
decreasing = T),
responseName = 'count'
)
print('Show the counts of contribution amounts whose values are above $5,400.')
sort(table(more_than_5400_amounts$contb_receipt_amt),
decreasing = T)
print('Top 6 descriptions for amounts whose values are above $5,400:')
head(more_than_5400_amounts.description)
```
First I got the summary of the negative amounts to know the distribution of the values. And I found that the difference of mean and median are more than -\$700. I could guess that it happened because of the existence of outliers like -\$16,600. The result gave me one more question. Why there are outliers of amounts?
And then, I got the top 6 negative amounts to investigate outliers. The 2 biggest negative amounts were huge compared to the others. Therefore I printed all columns of the values from the data. And I found that they were refunds. How there can be refunds which are more than \$10,000? When I got datas about the contributers from Data, their record were all refunds.
Therefore I searched their names in FEC site. For 'Durham, Joe' there were some records that aren't in Data. For 'Clark, Elloine M.' there were lots of records but there were some variety of similar names. But I can't find the refund records. I couldn't answer the questions even by searching the site.
Let's come back to the original question. To know why there are negative values, in this time, I got the top 6 most common receipt descriptions and showed using graph, too. The most common reasons were refund, redesignation and reattribution. When I found about them using the internet, there were [contribution limits](https://www.opensecrets.org/overview/limits.php) and [remedying excessive contributions methods](https://www.fec.gov/help-candidates-and-committees/candidate-taking-receipts/remedying-excessive-contribution/). Therefore I found that negative values of contribution amounts aren't wrong datas even though -\$16,600 refund is still weird.
But I got one more question when I saw the contribution limits. An indivisual can contribute maximum \$5,400 for primary and general elections. I want to know why there are some contributions which are more than \$5,400. Therefore I got the counts of them. And I found that most of them are \$10,800. And when I got the counts of receipt descriptions of the contributions, I found that most of them are for reattribution and redesignation. When I saw the result, I guessed that it can be possible that \$10,800 are made for 2 people requesting reattribution. But I cannot know the exact reason using this data.
In this time, I'm going to show the distribution of contribution dates. I can use line graph this time because I thought that it is the right method to express time series data. I'll show the counts for each day and each month.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_contribution_date_counts}
dates_counts <- as.data.frame(table(Data$contb_receipt_dt, dnn = 'date'),
responseName = 'count')
# Make the type of date variable in date_counts as Date. I needed to do it because as I made new data.frame, the type of the variable changed to factor.
dates_counts$date <- as.Date(dates_counts$date, format = '%Y-%m-%d')
print('Top 6 most frequent contribution dates:')
head(dates_counts[order(-dates_counts$count),])
ggplot(aes(x = date, y = count),
data = dates_counts) +
geom_line() +
scale_x_date() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
# In this time, I needed to designate stat and fun.y in geom_line to aggregate daily data to each month. And I used sum as fun.y to get monthly counts of contributions.
# I made all day parts in date to '01' to aggregate daily counts and to get monthly counts. The type of result is character that I used as.Date to change the type to Date.
ggplot(aes(x = as.Date(format(date, '%Y-%m-01'), format = '%Y-%m-%d'),
y = count),
data = dates_counts) +
geom_line(stat = 'summary',
fun.y = sum) +
scale_x_date() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
When I see the top 6 dates of counts, there was the most number of contributions in July 12nd of 2016.
And I can see that the overall contribution counts are increased from March, 2015 to March, 2016. And then it decreased and increased 2 times peaking at July, 2016 and October, 2016. It is overall trend and I can see from the line graph for each date that there are really lots of and huge fluctuations.
In this time, I'm going to show the counts of contributions for each election type. I'll use bar plot again.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_election_type_counts}
election_types_counts <- as.data.frame(
sort(table(Data$election_tp, dnn = 'type'),
decreasing = T),
responseName = 'count')
election_types_counts
ggplot(aes(x = type, y = count),
data = election_types_counts) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Most of contributions are for P2016 and G2016 and the contributions for P2016 are more than 2 times of the contributions for G2016. US presidential elections are divided into 2 stages, primary and general. Therefore I think that it is right that most of financial contributions were for presidential election in 2016.
But there are some contributions whose election types, to my thinking, aren't related with it. I'm going to investigate deeply about them in the following.
```{r echo=FALSE, message=FALSE, warning=FALSE, Consider_election_types_other_than_P2016_and_G2016}
# I'm going to print head and tail of the o2016 data because there are 68 rows.
print('Head and tail parts of datas whose election type is O2016')
o2016 = subset(Data, election_tp == 'O2016')
head(o2016)
tail(o2016)
print('Are all O2016 contributions for Jill Stein?')
all(o2016$cand_nm == 'Stein, Jill')
print('Since When and till when the contributions are received?')
range(o2016$contb_receipt_dt)
print('Datas whose election type is P2012')
subset(Data, election_tp == 'P2012')
# I'm going to print head and tail of the data whose election type is '' because there are more than 1500 rows.
print("Head and tail parts of datas whose election type is ''")
election_type_vacant <- subset(Data, election_tp == '')
head(election_type_vacant)
tail(election_type_vacant)
print('For whom are the contributions?')
unique(election_type_vacant$cand_nm)
print('When the nominees received them?')
# I chose barplot to show the daily counts of type '' contributions.
ggplot(aes(x = contb_receipt_dt),
data = subset(election_type_vacant,
if_nominee == T)) +
geom_bar(stat = 'count') +
scale_x_date(date_breaks = '1 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
print("Since when and till when the candidates who aren't nominees received them?")
range(subset(election_type_vacant, if_nominee == F)$contb_receipt_dt)
ggplot(aes(x = contb_receipt_dt),
data = subset(election_type_vacant,
if_nominee == F)) +
geom_bar(stat = 'count') +
scale_x_date(date_breaks = '1 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
First, when I see the head and tail parts about datas whose type is O2016, it looked like they are all for Jill Stein. Thus, I checked whether the contributions are all for her and it was.
Jill Stein was Green party nominee that I want to know since when and till when the contributions are given. And I found that they are from 2016-11-23 to 2016-11-28. It is after the election. And compared to the the term of presidential election campaigns it is really short.
I searched about her and I found that it is related with [presidential election recount fundraising](https://en.wikipedia.org/wiki/Jill_Stein#2016). This is why they were classified as O2016, 'other' type election in 2016. It is not related with primary or general election. Therefore I think that it is better to exclude O2016 when exploring the data later.
Second, I got the datas whose type is P2012. I really wondered why the contributions related with 2012 election are in here.
There are only 2 contributions and I searched about ['Gunn, George'](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=Gunn%2C+george&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016&contributor_employer=HSI) and ['Wylie, Wayne'](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=wylie%2C+wayne&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016&contributor_employer=jpmorgan+chase) in FEC site. And I found that they did them to 'Rick Santorum for president, inc.(2012)'. Even though they contributed in 2015 and 2016, it looked like they did them for debt retirement for [Rick Santorum's 2012 US president election](https://en.wikipedia.org/wiki/Republican_Party_presidential_primaries,_2012). Therefore I think that the contributions are not for Santorum's 2016 election and I'll exclude P2012 when exploring the data.
Third, when I see the head and tail parts about datas whose type is '', it looked like they are all for several candidates. Thus, I checked who received the contributions and 11 candidates got them.
In the 11 candidates, there were nominees and those who aren't. Therefore I thought that it is good to investigate the data dividing by if_nominee variable.
When I got the distribution of daily counts of contributions, I found that most of them were contributed when the nominees were running campaigns for general election. It made me think that it is better to assign the type of contributions for nominees to G2016 and for other candidates to P2016.
I checked that it can be applied to candidates who weren't nominees by drawing the barplot of daily contribution counts for them. The contributions for them almost stopped from July, 2016. July, 2016 was the month that [Republican](https://en.wikipedia.org/wiki/Republican_Party_presidential_primaries,_2016) and [Democratic](https://en.wikipedia.org/wiki/Democratic_Party_presidential_primaries,_2016) presidential primaries took place.
But to be more confident to my thinking for assigning types, I searched for contributers whose names appear in the head and tail part of election_type_vacant using [FEC site](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016).
Refunded contributions didn't appear in the site. Some contributions are recorded as primary(one example in [here](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=WAY%2C+RICHARD+A+MR.&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016) 2016-08-26 data) instead of ''. And others are recorded '' as it is in Data.
I also found from [FEC contribution brochure site](https://transition.fec.gov/pages/brochures/contrib.shtml) that there can be presumptive redesignations for the cases of 'Is not designated in writing for a particular election;'.
Therefore I decided to leave the election type '' as it is. And I'll also use the data of the type too distinguishing from P2016 and G2016. I think that it is the case of not designating for a particular election in writing the contribution form.
In this time, I'm going to show the distribution of contributions for each parties in Texas. I'll use bar plot again.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univarite_party_counts}
parties_counts <- as.data.frame(sort(table(Data$cand_party, dnn = 'party'),
decreasing = T),
responseName = 'count')
parties_counts
ggplot(aes(x = party, y = count),
data = parties_counts) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
The most number of contributions were for Democratic party candidates. And they got about 20,000 more contributions than Republican party candidates. And rest of the parties and independent candidate got few contributions compared to the above 2 parties.
It is a lot amazing that even though the candidates who were included in Democratic party were just 5, the sum of contribution counts for them were more than the sum of counts for Republican party in which there were 17 candidates. I think that it means that in Democratic party there was at least one candidate who was famous in Texas and got support a lot. And I think that the candidate was Hillary Clinton, Democratic party nominee because when I drew univariate plot of contribution counts by each candidate name, I found that she got most number of them.
But related with contributions, the sum of amounts can be more important than the sum of counts. Therefore I'm going to explore how the sum of amounts were different by parties, candidates, election types, etc. in the bivariate sections.
I'm going to show the distribution of contributions for nominees and those who aren't. I want to compare the 2 groups. Therefore I need to use if_nominee column which I made.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_if_nominee_counts}
nominees_counts <- as.data.frame(sort(table(Data$if_nominee,
dnn = 'if_nominee'),
decreasing = T),
responseName = 'count')
nominees_counts
ggplot(aes(x = if_nominee, y = count),
data = nominees_counts) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
I wanted to compare the counts of the contributions for nominees and those who weren't. The number of contributions for nominees was about 20,000 more than the number of contributions for the others. But compared to the size of the numbers, the difference wasn't much big. I think that it means that there was at least one candidate who was likely to be a nominee but couldn't be. And I think that they were Ted Cruz, Senate from Texas and Bernie Sanders, Senate from Vermont because when I drew univariate plot of contribution counts by each candidate name, I found that they got distinctive number of contributions.
# Univariate Analysis
### What is the structure of your dataset?
There are 548396 contributions in the dataset with 20 features after merging. In these 20 features, 14 got factor type, 3 got string type and each of the rest 3 got Date, num, and logical types.
Other observations by summary:
1. There are 25 unique candidates in the dataset.
2. Hillary Clinton got the most number of contributions.
3. The mean of contribution amounts was \$138. And the median was \$38.
4. There are some amounts which are negatives.
5. IQR range of the contribution dates were from 2016-02-06 to 2016-08-12.
6. The counts of contributions for Democratic party were bigger than the counts of contributions for Republican party.
### What is/are the main feature(s) of interest in your dataset?
The main feature is contb_receipt_amt, the contribution amount. And I want to know how contribution counts, sums, quantiles, and averages became different by other variables.
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I think that cand_nm, cand_party, if_nominee, election_tp, contbr_city and contb_receipt_dt columns can help me to understand the political characteristics in Texas related with 2016 US presidential election. And I expect that receipt_desc column can sometimes give me information about the cause of some situations.
### Did you create any new variables from existing variables in the dataset?
I created the variable 'if_nominee' from cand_nm. If cand_nm is one of 'Clinton, Hillary Rodham', 'Johnson, Gary', 'McMullin, Evan', 'Stein, Jill' and 'Trump, Donald J.', the value is TRUE. And if cand_nm is not in above 5 names, the value is FALSE.
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I drew the histogram of the contb_receipt_amt to IQR range and got the top 6 most common contribution amounts as a table. And I found that the most common amount was \$25. It is not the number of tens that I thought that it was unusual. And when I drew the bar plot of contribution counts by parties, I thought that it was also unusual that a party in which had 5 candidates got more contributions by count than a party in which had 17 candidates.
I changed the type of contb_receipt_dt column from character to Date. I did it to express the character typed values as time series. And I expected that the variable can be used to express the trends of other variables and their values.
# Bivariate Plots Section
Before I start drawing bivariate plots, using what I learned from univariate analysis, I'm going to make a new dataframe by subsetting Data to get only concise and essential parts.
I'm going to select contb_receipt_amt, cand_nm, cand_party, if_nominee, election_tp, contbr_city, contb_receipt_dt and receipt_desc columns. And I'm going to exclude election type P2012 and O2016. And one more thing that I'm going to do is to set the order of types to make 'P2016' ahead of 'G2016'.
```{r echo=FALSE, message=FALSE, warning=FALSE, Make_new_concise_dataframe}
chosenColumns = c('contb_receipt_amt',
'cand_nm',
'cand_party',
'if_nominee',
'election_tp',
'contbr_city',
'contb_receipt_dt',
'receipt_desc')
new_Data <- Data[, chosenColumns]
new_Data <- subset(new_Data, !election_tp %in% c('P2012', 'O2016'))
new_Data$election_tp <- factor(new_Data$election_tp,
levels = c('P2016', 'G2016', ''))
str(new_Data)
```
There are 548326 contribution datas and 8 features in new dataframe.
I now start to draw bivariate plots. I want to investigate about how the contribution amounts are differed by different values in other variables first.
In this time, to be more specific, I want to know how the contribution amounts were changed as time went on. Therefore I'm going to use contb_receipt_amt and contb_receipt_dt columns. And I'll draw line graphs to show the trend and scatter plot to see how the distribution of contribution amounts are changed.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date1}
ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt),
data = new_Data) +
geom_line(stat = 'summary', fun.y = sum) +
scale_x_date(date_breaks = '3 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
First plot is line graph which shows the trend of sum of the contribution amounts. The sum amounts sky rocketed 5 times and collapsed 2 times. And one collapsion which happened at about June, 2016 was amazingly severe. I wonder why it happened. It looked like lots of refund or redesignation or reattribution happened at the time. But I'm going to explore more about it in the multivariate section because I want to use receipt_desc variable too with candidates' names and contribution amounts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date2}
# I designated aes(color = 'method_name') in each geom_line to make legend. 'method_name' becomes one of the name of line in legend. And I used scale_color_brewer to assign each line colors automatically. I specified size in override.aes argument to make legend lines look thicker.
# And I aggregated daily data to monthly to see the overall trend of the statistical results without severe fluctuations.
ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'),
format = '%Y-%m-%d'),
y = contb_receipt_amt),
data = new_Data) +
geom_line(stat = 'summary', fun.y = mean,
aes(color = 'mean')) +
geom_line(stat = 'summary', fun.y = median,
aes(color = 'median')) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.05),
aes(color = '0.05quantile')) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.95),
aes(color = '0.95quantile')) +
scale_x_date(date_breaks = '3 month') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Methods',
override.aes = list(size = 2))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Second plot is line graph which shows the trend of mean, median, 0.05 quantile, and 0.95 quantile amounts. And I chose to draw after aggregating daily dates to monthly to see the overall trends with less fluctuations. I can see that mean values were always bigger than median values. And I can also see that the difference from median values to 0.95 quantile values are much bigger than to 0.05 quantile values. I think it is because there are some outlier amounts. I think that I can check it by drawing scatter plot of dates and amounts. And I can see that 0.95 quantile and mean were biggest at March, 2015. I wonder why it happened especially at the time. But I'm going to explore more about it in the multivariate section because I want to use receipt_desc variable too with candidates' names and contribution amounts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date3}
ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt),
data = new_Data) +
geom_jitter(alpha = 0.05) +
scale_x_date(date_breaks = '3 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Third plot is scatter plot which shows how the distribution of contribution amounts were changed as time went on. I can see some outlier amounts which are about \$10,000 and some lines of dots. It makes me wonder how the outlier values are distributed by each party, candidates. I'll find about it later. And I can also see that almost all of contribution amounts are less than about $3,000. It means that there are some common contribution amounts. I think that I can see lines of dots more clearly as I transform the y axis to log scale. I chose log scale because there were less dots as amounts are bigger.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date4}
# I subsetted to use only amounts equal to or above 0 to represent in log scale. And I plused 1 to amounts to make them above 0 after I transformed the scale to log. I think that adding $1 won't affect much to the values especially for amounts whose original values are more than $30.
ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt + 1),
data = subset(new_Data, contb_receipt_amt >= 0)) +
geom_jitter(alpha = 0.05) +
scale_x_date(date_breaks = '3 month') +
scale_y_log10(breaks = c(30, 100, 300, 1000, 3000, 10000)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Fourth plot is scatter plot whose y axis is transformed to log scale. And to represent in log scale, I used only amounts equal to or above 0. Therefore I showed the distribution of amounts without paying back values. I can see more clearly about lines of dots. And I can also find that from about August, 2015 lots of contributions are done with amounts of \$10 to \$300.
I could see how the contribution amounts were changed as time went on by drawing above plots.
In this time, I want to know how the contribution amounts are differed by nominees group and other candidates group. Therefore I'm going to use contb_receipt_amt and if_nominee columns. And I'll use boxplot and barplot to show the differences. Boxplot is for seeing the distribution of amounts for each group. And barplot is for comparing the sum amounts for each group.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_nominees}
amt_vs_if_nominee <- ggplot(aes(x = if_nominee, y = contb_receipt_amt),
data = new_Data) +
geom_boxplot(alpha = 0.2) +
scale_x_discrete()
amt_vs_if_nominee
# I designated aes(shape = 'mean') in geom_point to show the means in each group of boxplots and show what the symbols mean using legend. To designate shape I used scale_shape_manual.
# And I used coord_cartesian function to see the IQR range in detail. To designate proper ylim, I used trial and error method.
amt_vs_if_nominee +
geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
scale_shape_manual('', values=c('mean' = 8)) +
coord_cartesian(ylim = c(-100, 300))
ggplot(aes(x = if_nominee, y = contb_receipt_amt, group = 1),
data = new_Data) +
geom_bar(stat = 'summary', fun.y = sum) +
scale_x_discrete()
```
When I see the first boxplot, I can find that lots of outlier amounts are for candidates who weren't nominees. To compare the distribution without outliers, I drew second boxplot limiting the amounts from -\$100 to \$300 values. And I also mark symbol of mean to compare between mean and median. And I can see that the mean and median of contribution amounts for nominees are a little smaller than those for the others. But I can find from barplot that the sum of amounts for nominees are a little bigger than the others.
In this time, I want to know how the contribution amounts are differed by parties. Therefore I'm going to use contb_receipt_amt and cand_party columns. And I'll use boxplot and barplot to show the differences as I did for if_nominee.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party1}
amt_vs_party <- ggplot(aes(x = cand_party, y = contb_receipt_amt),
data = new_Data) +
geom_boxplot(alpha = 0.1) +
scale_x_discrete()
amt_vs_party
```
I can see from above boxplot that lots of outlier amounts, especially those of more than \$10,000 are for Republican party candidates. To see more clearly IQR range without outliers, I limited the result from -\$100 to \$600 in second boxplot. And it gave me interesting result.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party2}
# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group. I also used again coord_cartesian to close up the IQR range.
amt_vs_party +
geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
scale_shape_manual('', values = c('mean' = 8)) +
coord_cartesian(ylim = c(-100, 600))
new_Data %>%
group_by(cand_party) %>%
summarise(mean_amount = mean(contb_receipt_amt),
median_amount = median(contb_receipt_amt)) %>%
t() %>%
as.data.frame()
```
The mean and median amounts for Independent candidate and Libertarian party are bigger than those for the others. Even their IQR is large. But they almost don't have outliers. Considering the fact that almost all of the contributions are for Democratic and Republican parties which I found in univariate analysis, it looked like the contributions for independent candidate and Libertarian party are done with larger amounts at a time than those for the 2 main parties. The mean and median amounts for Green party are slightly bigger than those for Republican party. And Democratic party got smallest mean and median. It is interesting that the mean, median of the party are less than the values of Republican party even though the number of contributions for the party are biggest.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party3}
ggplot(aes(x = cand_party, y = contb_receipt_amt),
data = new_Data) +
geom_bar(stat = 'summary', fun.y = sum) +
scale_x_discrete()
```
And third plot shows that the sum of contribution amounts for Republican party was bigger than the value of Democratic party. And compared to the 2 parties, what others got were really few.
In this time, I want to know how the contribution amounts are differed by candidates. Therefore I'm going to use contb_receipt_amt and cand_nm columns. And I'll use boxplot and barplot to show the differences as I did for cand_party.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates1}
amt_vs_cands <- ggplot(aes(x = cand_nm, y = contb_receipt_amt),
data = new_Data) +
geom_boxplot(alpha = 0.2) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5))
amt_vs_cands
```
In first boxplot, the outlier amounts for Cruz are outstanding. Except for them, I can see that there are lots of outliers for Carson, Clinton, Paul, Sanders and Trump. And the distribution of the amounts for Jeb Bush is also interesting because IQR is large but there are some huge outlier amounts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates2}
# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group. I also used again coord_cartesian to close up the IQR range.
amt_vs_cands +
geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
scale_shape_manual('', values = c('mean' = 8)) +
coord_cartesian(ylim = c(-100, 3000))
new_Data %>%
group_by(cand_nm) %>%
summarise(mean_amount = mean(contb_receipt_amt),
median_amount = median(contb_receipt_amt)) %>%
arrange(mean_amount)
```
I limited from -\$100 to \$3,000 amounts in second plot. And I also showed the mean and median amounts for all candidates arranged by mean values. The mean and median for Clinton and Sanders are smallest. But they have lots of outliers. Jeb Bush got second largest IQR. The distributions for Carson, Cruz, Paul and Rubio show that the values of mean and median are small compared to the others, but there are lots of outliers.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates3}
# I arranged by n to order the candidates' names and their sum of amounts as I ordered the names by contribution counts.
cands_sum_amounts <- new_Data %>%
group_by(cand_nm, cand_party) %>%
summarise(sum_amounts = sum(contb_receipt_amt),
n = n()) %>%
arrange(desc(n))
# I used factor function to order the levels in the factor of candidates' names as I arranged the names in data.frame. Ordering the data.frame rows is different from ordering the factor's levels.
cands_sum_amounts$cand_nm <- factor(cands_sum_amounts$cand_nm,
levels = cands_sum_amounts$cand_nm)
ggplot(aes(x = cand_nm, y = sum_amounts),
data = cands_sum_amounts) +
geom_bar(stat = 'identity') +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
subset(cands_sum_amounts, select = -cand_party)
```
I drew the barplot of the sum of amounts for each candidates. When I represent the graph, I arranged the order by contribution counts, not by the sum of amounts. I did it intentionally to compare the order of counts and sums. And it shows really interesting result. I found that the top 5 candidates by contribution counts were Clinton, Cruz, Sanders, Trump and Carson in order. Clinton and Cruz still got first and second places. But Trump went to 3rd place beating Sanders by huge gap. What is most interesting is that Jeb Bush went to 4th place beating Sanders, Rubio and Carson. It looked like his outlier contribution amounts played huge role.
I think that it is also worth trying to know how the contribution amounts are differed by election types. Therefore I'm going to use contb_receipt_amt and election_tp columns. And I'll use boxplot and barplot to show the differences as I did for cand_nm.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_election_type}
# I won't show the result of the datas whose election type '' because I don't know what exact type they got.
amt_vs_type <- ggplot(aes(x = election_tp, y = contb_receipt_amt),
data = subset(new_Data, election_tp != '')) +
geom_boxplot(alpha = 0.2) +
scale_x_discrete()
amt_vs_type
# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group.
amt_vs_type +
geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
scale_shape_manual('', values = c('mean' = 8)) +
coord_cartesian(ylim = c(-100, 200))
ggplot(aes(x = election_tp, y = contb_receipt_amt),
data = subset(new_Data, election_tp != '')) +
geom_bar(stat = 'summary', fun.y = sum) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
I can see from the first boxplot that there are a lot more outliers for P2016 than G2016. Without outliers in second boxplot, I can see that the mean and median of P2016 values are bigger than those of G2016. In the barplot of the sum of amounts by election types I can see that the sum for P2016 is more than 2 times of the sum for G2016.
In this time, I want to know how the contribution amounts are differed by cities. Therefore I'm going to use contb_receipt_amt and contbr_city columns. But I found in the univariate analysis that there are 2252 different city names in the Data. Therefore I'm going to show only the result of the top 10 cities by contribution counts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city1}
sum_amounts_by_city <- new_Data %>%
group_by(contbr_city) %>%
summarise(sum_amount = sum(contb_receipt_amt),
mean_amount = mean(contb_receipt_amt),
median_amount = median(contb_receipt_amt),
n = n()) %>%
arrange(desc(n)) %>%
head(10)
top10_sum_amounts_city <- subset(new_Data,
contbr_city %in%
sum_amounts_by_city$contbr_city)
ggplot(aes(x = contbr_city, y = contb_receipt_amt),
data = top10_sum_amounts_city) +
geom_boxplot(alpha = 0.1) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
First plot is boxplot which shows the distribution of amounts in each top 10 city. And I can see lots of outlier amounts in Houston and Dallas.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city2}
# I designated position_nudge and width in geom_bar to represent 0.95quantiles, means and medians in a graph. Position_nudge moves the x position of each bar and width controls widths of bars.
# To show the legend, I used again aes(color = 'method_name') in geom_bar and scale_color_brewer.
ggplot(aes(x = contbr_city, y = contb_receipt_amt, group = 1),
data = top10_sum_amounts_city) +
geom_bar(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.95),
aes(fill = '0.95quantile'),
position = position_nudge(x = 0.33),
width = 0.33) +
geom_bar(stat = 'summary', fun.y = mean,
aes(fill = 'mean'),
width = 0.33) +
geom_bar(stat = 'summary', fun.y = median,
aes(fill = 'median'),
position = position_nudge(-0.33),
width = 0.33) +
scale_x_discrete() +
scale_fill_brewer(type = 'qual',
guide = guide_legend(title = 'Methods',
override.aes = list(size = 2))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Second plot is bar graph which shows the median, mean and 0.95 quantile amounts by each top 10 city. It shows that 0.95 quantile value of Dallas was really big. The mean of Dallas was biggest too but not much different from Houston. In the following, I'm going to explore more about top 5% amounts from Dallas.
```{r echo=FALSE, message=FALSE, warning=FALSE, Explore_more_about_Dallas_top_0.05}
# I grouped by city and use filter to get top 5% amounts of each city. And I made new column 'if_dallas' using mutate to specify Dallas datas.
top_5percent_by_city <- new_Data %>%
group_by(contbr_city) %>%
filter(contb_receipt_amt >= quantile(contb_receipt_amt, 0.95)) %>%
mutate(if_dallas = ifelse(contbr_city == 'DALLAS', TRUE, FALSE))
# I used histogram because I want to know the distribution of contribution amounts. And I designated fill as if_dallas to compare Dallas datas to other cities' datas at a time.
ggplot(aes(x = contb_receipt_amt, fill = if_dallas),
data = top_5percent_by_city) +
geom_histogram(binwidth = 100, alpha = 0.7)
# To help know the distribution of amounts more in detail, I printed top 6 common values in top 5% amounts from Dallas.
head(as.data.frame(sort(table(subset(top_5percent_by_city,
if_dallas == T)$contb_receipt_amt,
dnn = 'amount'),
decreasing = T),
responseName = 'count'))
```
I can see that almost all of top 5% contributions from Dallas were \$2,700. And there were really small counts for the amounts less than \$2,700 which I could expect a lot more from the other cities. It looked like that is why their 0.95 quantile was biggest.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city3}
ggplot(aes(x = contbr_city, y = contb_receipt_amt, group = 1),
data = top10_sum_amounts_city) +
geom_bar(stat = 'summary', fun.y = sum) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Third plot is barplot of the sum of amounts of each top 10 city. It shows that the sum from Houston is bigger than the sum from Dallas. It looked like even if there were some high contributions from Dallas, but the sum from Houston was bigger because of bigger contribution counts.
==================================================
Contribution amounts are main feature that I was interested in. And I drew lots of bivariate graphs related with them.
From now on, I'm going to draw bivariate graphs taking support features. I want to know more about the political differences between cities in Texas.
First, I want to know how the contribution counts for each candidate are differed by each city. Therefore I'm going to use cand_nm and cand_party columns. And I'll use barplot because I need to show the differences of several candidates by each city, which are categorical values. And I used stacked bar to compare several candidates by each city at a time.
But before I'm going to draw the plot, I'm going to make a variable named main_cands_counts. Its one column contains the names of candidates and the other column contains the counts of contributions. And there will be only 7 names because I chose to select only 7 main candidates by contribution counts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Get_main_candidates_by_contribution_counts}
main_cands_counts <- new_Data %>%
group_by(cand_nm) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(7)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_cand_nm_and_city}
# I used sum_amounts_by_city which I already made to make top10_sum_amounts_city. In its contbr_city column, there are top 10 city names by contribution counts.
# I also used main_cands_counts to show only the result of 7 main candidates.
ggplot(aes(x = contbr_city, fill = cand_nm, group = cand_nm),
data = subset(new_Data,
contbr_city %in% sum_amounts_by_city$contbr_city &
cand_nm %in% main_cands_counts$cand_nm)) +
geom_bar(stat = 'count', position = 'stack') +
scale_fill_brewer(type = 'qual',
palette = 2,
guide = guide_legend(title = 'Candidates',
override.aes = list(size = 2))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
It also gave me really funny result. I can see that Clinton got the most number of contributions from all cities except Spring. But second places are changed by each city. In Austin Sanders got far more contributions than Cruz. But in Dallas and San Antonio Sanders got almost same to Cruz. And the relation became reversed in Houston. The result makes me wonder how the distribution for 2 main parties would be different by each city. I think that I will see a huge gap of Democratic party and Republican party in Austin. And I expect that there will be smaller gap in Houston. I already found from univariate section that almost all contributions from Texas were for Democratic or Republican party candidates. Therefore I decided to compare only 2 main parties. And I'll use barplot to compare between categorical values like city and party.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_party_and_city}
ggplot(aes(x = contbr_city, fill = cand_party, group = cand_party),
data = subset(new_Data,
contbr_city %in% sum_amounts_by_city$contbr_city &
cand_party %in% c('Democratic', 'Republican'))) +
geom_bar(stat = 'count', position = 'dodge') +
scale_fill_brewer(type = 'qual',
palette = 2,
guide = guide_legend(title = 'Party',
override.aes = list(size = 2))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Yes, I can see the huge gap in Austin and smaller gap in Houston. The count to Democratic party was highest in Austin. And the count to Republican party was highest in Houston. And I can see that the contributions from almost all cities are higher for Democratic party than for Republican party.
==========================================================
In the above part, I drew bivariate graphs related with cities. In this time, I'm going to draw bivariate graphs related with dates. And I chose dates because I think that contribution date is useful to let me know about the trends.
First, I want to know how the daily contribution counts were changed as time went on. Especially, I want to compare between the candidates who became nominees and those who couldn't. Therefore I'm going to use contb_receipt_dt and if_nominee columns. And I'll use line graph to show the trend of contribution counts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_nominees}
ggplot(aes(x = contb_receipt_dt, color = if_nominee),
data = new_Data) +
geom_line(stat = 'count') +
scale_x_date(date_breaks = '3 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
I can see that until about June, 2016 lots of contributions are for candidates who weren't nominees. It is likely because at the time nominees weren't decided yet. But the contributions for nominees increased from about February, 2016 and soared highly on about July, 2016. In this time the contributions for candidates who weren't nominees decreased to almost 0. But overall contributions for nominees decreased until about September, 2016 and again increased until the general election. I wonder what happened at September, 2016 that the contributions are decreased.
In this time, I want to know how the contribution counts were changed as time went on especially comparing between the parties. Therefore I'm going to use contb_receipt_dt and cand_party columns. And I'll use line graph this time too to show the trend of contribution counts.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_party}
ggplot(aes(x = contb_receipt_dt, color = cand_party),
data = new_Data) +
geom_line(stat = 'count') +
scale_x_date(date_breaks = '3 month') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
I can see that there are a lot more dramatic fluctuations of contributions for Republican party than those for Democratic party. The contributions for Republican party increased more highly and decreased more deeply but the counts are generally more than the counts for Democratic party. But I can see that the contributions for Democratic party catched up almost to the contributions for Republican party at about March, 2016. And I can also see some terms that the contributions for Democratic party are more than those for Republican party. They were about from May to June of 2016 and August to November of 2016. I guess that it is somewhat because of Trump. I think that I can find more about it when I draw the plots of dates and candidates.
Therefore in this part, I'm going to see how the contribution counts were changed as time went on, especially comparing between Trump and Clinton. I chose the 2 candidates because they became main parties' nominees. I'm going to use contb_receipt_dt and cand_nm columns. And I'll use line graph again.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_main_nominees}