contributions_by_Texas.Rmd

Exploration of Financial Contributions to Presidential Campaigns in Texas by Jehyeon Heo.
========================================================

```{r echo=FALSE, message=FALSE, warning=FALSE, Packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.

# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.

# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.

library(ggplot2)
library(gridExtra)
library(dplyr)
library(RColorBrewer)
library(GGally)
library(scales)
library(memisc)
library(reshape2)
```

```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_datas}

# Load the financial data.
Data <- read.csv('ctrbsTX.csv',
                  row.names = NULL,
                  header = TRUE)

# Get header names without 'row.names'.
data_headers <- colnames(Data)[-1]

# In the data, remove the last column which is all NAs.
Data$election_tp <- NULL

# Give the data right header names for each columns.
colnames(Data) <- data_headers

# Load the candidates data.
Cand <- read.csv('candidates.csv')

```

 In 2016, there was presidential election in United States. At the result of it, Donald Trump became the 45th president of US. He gave sensational impact to Repulican party and became the nominee of it. His biggest rival was Hillary Clinton, Democratic party nominee. Lots of people forecasted that Hillary Clinton would win. But even if she got more votes than Trump in the election, she lost.
 
 In this document, I'm going to explore about financial contributions in Texas to presidential campaigns of 2016 US election. I chose Texas because I was interested in the state.
 
 I got the financial data from [FEC site](http://classic.fec.gov/disclosurep/pnational.do). And I used the data about each candidate. It has gender, party, height and age information of each candidates. I made the data by searching the internet.
 
 Before I enter into analysis, I'm going to introduce the questions that I got about the data.
 1. How was the distribution of the contributions? 
 2. How did the sum of contributions differ by parties or dates or candidates, etc?
 3. I want to know the results of above questions by comparing Clinton and Trump. And also for Republican party and Democratic party.
 4. Which contributer contributed the maximum amount? And who contributed most often?
 
 I'm sure that I'll get more questions as I explore the datas. I'll show you which questions I get when I explore the data and show the results of the exploration.
 
# Basic Data Exploration Section

 Fisrt, I'm going to explore financial data. I loaded the data as a name 'Data'.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Basic_Exploration_for_Data}

# Get the number of observations and columns.
dim(Data)

# Show the structure of the data.
str(Data)

# Show the distributions of the data by summary.
summary(Data)

# Get the number of candidates and columns.
print('Number of candidates in Data:')
length(unique(Data$cand_nm))

# Show some contributers' names.
print("Show top 5 most common contributers' names")
summary(Data$contbr_nm)[1:5]
```

 The Data has about 550,000 observations and 18 variables. And there are 25 candidates' information in it.
 
 And I can see from the structure and the summary of the Data that it would be difficult to identify unique contributers. If I try to sort out distinctive contributers by using names, there can be lots of different people with same name. Even if I try to use city or occupation variables too, I cannot know whether the contributions are made by diffrent people who live in the same city or by same person who moved to other city and changed his or her job. 
 
 I'm going to show an example using one of most common contributer name 'RUDOLPH, BONNIE' in the following.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Same_name_different_people_example}

# Show the states of 'ROBINSON, ROBBIE'.
bonnie <- subset(Data, contbr_nm == 'RUDOLPH, BONNIE')

print('How many unique city names are registered for him?')
length(unique(bonnie$contbr_city))

print('What are the city names?')
unique(bonnie$contbr_city)

print('How many unique zip codes are registered for him?')
length(unique(bonnie$contbr_zip))

print('What are the zip codes?')
unique(bonnie$contbr_zip)

print('How many of people are registered as his employer?')
length(unique(bonnie$contbr_employer))

print('Who are his employers?')
unique(bonnie$contbr_employer)

print('How many occupations are registered for him?')
length(unique(bonnie$contbr_occupation))

print('What are his occupations?')
unique(bonnie$contbr_occupation)
```

 When I explored top 5 most common contributer names, I found that RUDOLPH, BONNIE contributed 463 times. It looked like he lives in Laredo, Texas. When I see the city name with zip codes and search about it using google, I think that Laredo is wrongly written to 'Latedo'.
 But when I see the employer and occupation of him, I cannot but have a question about the name's uniqueness. Is he retired? Or is he working at a university? Or did he work at a university and retired? I cannot know about it using only this data.
 Therefore I cannot answer the questions related with 'unique' contributers using this data.
 
 I also found from the structure of Data that the type of contb_receipt_dt should be date, not string. I'm going to change the type. And I also think that it is better to make types of file_num and tran_id as string, not num or factor because there are too many different cases.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Change_types}

# Set locale of R.
print("Set locale as 'C'")
Sys.setlocale("LC_TIME", "C")

# Change the type to Date.
Data$contb_receipt_dt <- as.Date(
  Data$contb_receipt_dt, '%d-%b-%y'
)

# Change the types of file_num and tran_id.
Data$file_num <- as.character(Data$file_num)
Data$tran_id <- as.character(Data$tran_id)

# Check the result using str function.
str(Data$contb_receipt_dt)
str(Data$file_num)
str(Data$tran_id)
```

 The type of contb_receipt_dt values are changed to date. And the types of file_num and tran_id values are changed to string.
 
 Now I'm going to explore candidates data, too. I loaded the data as a name 'Cand'.

```{r echo=FALSE, message=FALSE, warning=FALSE, Basic_exploration_for_Cand}

# Get the number of observations and columns.
dim(Cand)

# Show the structure of the data.
str(Cand)

# Show the distributions of the data by summary.
summary(Cand)

# Show whether the unique values of the candidates' names are all equal to the cand's X values.
print("If the unique values of the candidates' names in Data are all equal to the values in cand:")

all(unique(Data$cand_nm) == Cand$X)

```

 There are 25 candidates' information in Cand. 17 of them are included in Repulican party and 5 of them are in Democratic party. Only 3 of all candidates are female.

 I think that it is good to add new column which notify who became presidential nominee from each party. I'm going to name the column as 'if_nominee' and assign TRUE for nominees and FALSE for others. And I'm going to regard Evan McMullin, independent presidential candidate as a nominee.

```{r echo=FALSE, message=FALSE, warning=FALSE, Make_if_nominee_column}

nominees = c('Clinton, Hillary Rodham', 
             'Trump, Donald J.',
             'Johnson, Gary',
             'McMullin, Evan',
             'Stein, Jill')

Cand$if_nominee <- ifelse(Cand$cand_nm %in% nominees, TRUE, FALSE)

summary(Cand$if_nominee)
```

 5 candidates became nominees and others couldn't(or didn't).

 When I explored the candidates' data, I found that the values of the candidates' names are same for Data and Cand that I can merge Data and Cand by outer join using 'cand_nm' column. I'm going to use only 'cand_nm', 'cand_party' and 'if_nominee' columns to join because I think that these columns would help me to understand more about contributions for parties and nominees.

```{r echo=FALSE, message=FALSE, warning=FALSE, Outer_join_Data_and_Cand}

# Merge the 'Data' and 'Cand' by outer join.
Data <- merge(x = Data, 
              y = Cand[,c('cand_nm', 'cand_party', 'if_nominee')],
              by = 'cand_nm',
              all.x = TRUE)

print('Show the names of the columns after merged')
names(Data)
```

 I can see that 'cand_party' and 'if_nominee' columns are added to Data.

# Univariate Plots Section

 Now I'm going to get univariate plots.
 
 First, I'm going to show the distribution of the contribution counts for each candidates. And I'll also show what is the percentages of total for each count.
 
 I decided to use bar plot because the x variable is names, which are categorical values and the y variable is counts. I think that I will use bar plot lots of times because most of variables in Data are categorical values.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_cand_nm_counts}

#From now on, I'll make some variables pass through table, sort and as.data.frame function. I'll do them to draw ggplot graph in order by counts. And I can filter the values by counts easily.

# I designated dnn in table function to give a name to the dimensions in the result. If I don't designate it, the default name is 'Var1'.
# I also designated responseName in as.data.frame function to give a name to the responses in the result. If I don't designate it, the default name is 'Freq'.

# I designated decreasing equal T in sort function to make names with more counts appear at head.
name_counts <- as.data.frame(sort(table(Data$cand_nm, dnn = 'name'),
                                  decreasing = T),
                             responseName = 'count')

name_counts

# I'm going to often use theme function in ggplot. It is for rotating x labels  90 degrees when the label names are long. And I designated hjust = 1 to align the texts to the bottom of the plot and vjust = 0.5 to align them to each ticks.
ggplot(aes(x = name, y = count),
       data = name_counts) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# I chose count/sum(count) as y variable to express percentage.
ggplot(aes(x = name, y = count/sum(count)),
       data = name_counts) +
  geom_bar(stat = 'identity') +
  scale_y_continuous(labels = percent_format()) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

  Clinton got the most number of contributions. After her, Cruz, Sanders, Trump, Carson, etc. follow. When I changed the counts to percentages, Clinton got about 38% and Cruz, Sanders, Trump and Carson followed by about 25%, 15%, 14%, 5% each. I wonder if a candidate who got more contributions, got bigger sum of contribution amounts. I'm going to explore it by adding amounts for each candidate in the bivariate section later.

 In this time, I'm going to show the distribution of the count of contributions in each cities in Texas.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_city_counts}

cities_counts <- as.data.frame(sort(table(Data$contbr_city, dnn = 'name'),
                                    decreasing = T),
                               responseName = 'count')

print("Number of contributers' cities")
length(cities_counts$name)

ggplot(aes(x = name, y = count),
       data = cities_counts[1:10, ]) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 There are 2252 city names in the data that I showed top 10 results.
 
 The top 10 cities of Texas by population ordered by rank are Houston, San Antonio, Dallas, Austin, Fort Worth, El paso, Arlington, Corpus Christi, Plano and Laredo. But when I see the top 10 cities by counts, the rank is a little different from the rank by population. San Antonio changed the rank with Austin. And while Corpus Christi and Laredo are included in populated top 10, they didn't appear at top 10 counts.
 
 In this time, I'm going to show the histogram of contribution amounts. I can use histogram this time, because I can consider contribution amounts as continuous variable.

```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_contribution_amount}

summary(Data$contb_receipt_amt)

print('Top 6 most frequent contribution amounts:')
head(sort(table(Data$contb_receipt_amt), decreasing = T))

# I designated binwidth as $100, because the range of amounts is about $30,000.
# I think that $100 is the right binwidth compared to the range.
ggplot(aes(x = contb_receipt_amt),
       data = Data) +
  geom_histogram(binwidth = 100) +
  scale_x_continuous(breaks = seq(-15000, 15000, 5000))
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# Show the histogram for center 50% of amounts. I designated binwidth as 1$, because the range of IQR range is $80.
ggplot(aes(x = contb_receipt_amt),
       data = subset(Data,
                     contb_receipt_amt >= quantile(Data$contb_receipt_amt, 0.25) & contb_receipt_amt <= quantile(Data$contb_receipt_amt, 0.75))) +
  geom_histogram(binwidth = 1) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_x_continuous(limits = c(15, 105))

```
 
 The minimum and maximum amount was -\$16,600 and \$15,000 each. It is weird that there are - values of contributions. I wonder what causes it.
 
 The most common contribution amount is \$25. \$50, \$100 follow it. It is unexpected to see that 25$ is the most common one because I thought that it is convenient to contribute amounts in the tens.
 
 I can see from the table and the histogram that most of contributions are in between \$20 and \$100.

 I'm going to investigate about negative values in contribution amounts in the following.

```{r echo = FALSE, message=FALSE, warning=FALSE, Univariate_more_about_amounts}

minus_amounts <- subset(Data, contb_receipt_amt < 0)

summary(minus_amounts$contb_receipt_amt)

print('Top 6 negative amounts:')
head(sort(minus_amounts$contb_receipt_amt))

print('Datas whose contribution amounts are less than -$10,000:')
subset(minus_amounts, contb_receipt_amt < -10000)

print('Datas about DURHAM, JOE and CLARK, ELLOINE M.')
subset(Data, contbr_nm %in% c('DURHAM, JOE', 'CLARK, ELLOINE M.'))

# Get the descriptions about them to know the causes.
minus_amounts.description <- as.data.frame(sort(
  table(minus_amounts$receipt_desc, dnn = 'description'),
  decreasing = T),
  responseName = 'count'
  )

print('Top 6 most common descriptions for negative values:')
head(minus_amounts.description)

ggplot(aes(x = description, y = count),
       data = head(minus_amounts.description)) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# I also want to know why there are amounts whose values are above $5,400.
more_than_5400_amounts <- subset(Data, contb_receipt_amt > 5400)

# Get the descriptions about them to know the causes.
more_than_5400_amounts.description <-  as.data.frame(sort(
  table(more_than_5400_amounts$receipt_desc, dnn = 'description'),
  decreasing = T),
  responseName = 'count'
  )

print('Show the counts of contribution amounts whose values are above $5,400.')
sort(table(more_than_5400_amounts$contb_receipt_amt),
     decreasing = T)

print('Top 6 descriptions for amounts whose values are above $5,400:')
head(more_than_5400_amounts.description)
```

 First I got the summary of the negative amounts to know the distribution of the values. And I found that the difference of mean and median are more than -\$700. I could guess that it happened because of the existence of outliers like -\$16,600. The result gave me one more question. Why there are outliers of amounts?
 
 And then, I got the top 6 negative amounts to investigate outliers. The 2 biggest negative amounts were huge compared to the others. Therefore I printed all columns of the values from the data. And I found that they were refunds. How there can be refunds which are more than \$10,000? When I got datas about the contributers from Data, their record were all refunds. 
 
 Therefore I searched their names in FEC site. For 'Durham, Joe' there were some records that aren't in Data. For 'Clark, Elloine M.' there were lots of records but there were some variety of similar names. But I can't find the refund records. I couldn't answer the questions even by searching the site.
 
 Let's come back to the original question. To know why there are negative values, in this time, I got the top 6 most common receipt descriptions and showed using graph, too. The most common reasons were refund, redesignation and reattribution. When I found about them using the internet, there were [contribution limits](https://www.opensecrets.org/overview/limits.php) and [remedying excessive contributions methods](https://www.fec.gov/help-candidates-and-committees/candidate-taking-receipts/remedying-excessive-contribution/). Therefore I found that negative values of contribution amounts aren't wrong datas even though -\$16,600 refund is still weird.
 
 But I got one more question when I saw the contribution limits. An indivisual can contribute maximum \$5,400 for primary and general elections. I want to know why there are some contributions which are more than \$5,400. Therefore I got the counts of them. And I found that most of them are \$10,800. And when I got the counts of receipt descriptions of the contributions, I found that most of them are for reattribution and redesignation. When I saw the result, I guessed that it can be possible that \$10,800 are made for 2 people requesting reattribution. But I cannot know the exact reason using this data.
 
 In this time, I'm going to show the distribution of contribution dates. I can use line graph this time because I thought that it is the right method to express time series data. I'll show the counts for each day and each month.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_contribution_date_counts}

dates_counts <- as.data.frame(table(Data$contb_receipt_dt, dnn = 'date'),
                              responseName = 'count')

# Make the type of date variable in date_counts as Date. I needed to do it because as I made new data.frame, the type of the variable changed to factor.
dates_counts$date <- as.Date(dates_counts$date, format = '%Y-%m-%d')

print('Top 6 most frequent contribution dates:')
head(dates_counts[order(-dates_counts$count),])

ggplot(aes(x = date, y = count),
       data = dates_counts) +
  geom_line() +
  scale_x_date() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# In this time, I needed to designate stat and fun.y in geom_line to aggregate daily data to each month. And I used sum as fun.y to get monthly counts of contributions.
# I made all day parts in date to '01' to aggregate daily counts and to get monthly counts. The type of result is character that I used as.Date to change the type to Date.
ggplot(aes(x = as.Date(format(date, '%Y-%m-01'), format = '%Y-%m-%d'), 
           y = count),
       data = dates_counts) +
  geom_line(stat = 'summary',
            fun.y = sum) +
  scale_x_date() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 When I see the top 6 dates of counts, there was the most number of contributions in July 12nd of 2016.
 
 And I can see that the overall contribution counts are increased from March, 2015 to March, 2016. And then it decreased and increased 2 times peaking at July, 2016 and October, 2016. It is overall trend and I can see from the line graph for each date that there are really lots of and huge fluctuations.
 
 In this time, I'm going to show the counts of contributions for each election type. I'll use bar plot again.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_election_type_counts}

election_types_counts <- as.data.frame(
  sort(table(Data$election_tp, dnn = 'type'),
       decreasing = T), 
  responseName = 'count')

election_types_counts

ggplot(aes(x = type, y = count),
       data = election_types_counts) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Most of contributions are for P2016 and G2016 and the contributions for P2016 are more than 2 times of the contributions for G2016. US presidential elections are divided into 2 stages, primary and general. Therefore I think that it is right that most of  financial contributions were for presidential election in 2016. 
 
 But there are some contributions whose election types, to my thinking, aren't related with it. I'm going to investigate deeply about them in the following.

```{r echo=FALSE, message=FALSE, warning=FALSE, Consider_election_types_other_than_P2016_and_G2016}

# I'm going to print head and tail of the o2016 data because there are 68 rows.
print('Head and tail parts of datas whose election type is O2016')
o2016 = subset(Data, election_tp == 'O2016')
head(o2016)
tail(o2016)

print('Are all O2016 contributions for Jill Stein?')
all(o2016$cand_nm == 'Stein, Jill')

print('Since When and till when the contributions are received?')
range(o2016$contb_receipt_dt)

print('Datas whose election type is P2012')
subset(Data, election_tp == 'P2012')

# I'm going to print head and tail of the data whose election type is '' because there are more than 1500 rows.
print("Head and tail parts of datas whose election type is ''")
election_type_vacant <- subset(Data, election_tp == '')
head(election_type_vacant)
tail(election_type_vacant)

print('For whom are the contributions?')
unique(election_type_vacant$cand_nm)

print('When the nominees received them?')

# I chose barplot to show the daily counts of type '' contributions.
ggplot(aes(x = contb_receipt_dt), 
       data = subset(election_type_vacant, 
                     if_nominee == T)) + 
  geom_bar(stat = 'count') + 
  scale_x_date(date_breaks = '1 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

print("Since when and till when the candidates who aren't nominees received them?")
range(subset(election_type_vacant, if_nominee == F)$contb_receipt_dt)

ggplot(aes(x = contb_receipt_dt), 
       data = subset(election_type_vacant, 
                     if_nominee == F)) + 
  geom_bar(stat = 'count') + 
  scale_x_date(date_breaks = '1 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 First, when I see the head and tail parts about datas whose type is O2016, it looked like they are all for Jill Stein. Thus, I checked whether the contributions are all for her and it was.
 Jill Stein was Green party nominee that I want to know since when and till when the contributions are given. And I found that they are from 2016-11-23 to 2016-11-28. It is after the election. And compared to the the term of presidential election campaigns it is really short.
 I searched about her and I found that it is related with [presidential election recount fundraising](https://en.wikipedia.org/wiki/Jill_Stein#2016). This is why they were classified as O2016, 'other' type election in 2016. It is not related with primary or general election. Therefore I think that it is better to exclude O2016 when exploring the data later.
 
 
 Second, I got the datas whose type is P2012. I really wondered why the contributions related with 2012 election are in here. 
 There are only 2 contributions and I searched about ['Gunn, George'](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=Gunn%2C+george&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016&contributor_employer=HSI) and ['Wylie, Wayne'](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=wylie%2C+wayne&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016&contributor_employer=jpmorgan+chase) in FEC site. And I found that they did them to 'Rick Santorum for president, inc.(2012)'. Even though they contributed in 2015 and 2016, it looked like they did them for debt retirement for [Rick Santorum's 2012 US president election](https://en.wikipedia.org/wiki/Republican_Party_presidential_primaries,_2012). Therefore I think that the contributions are not for Santorum's 2016 election and I'll exclude P2012 when exploring the data.
 
 Third, when I see the head and tail parts about datas whose type is '', it looked like they are all for several candidates. Thus, I checked who received the contributions and 11 candidates got them.
 In the 11 candidates, there were nominees and those who aren't. Therefore I thought that it is good to investigate the data dividing by if_nominee variable.
 When I got the distribution of daily counts of contributions, I found that most of them were contributed when the nominees were running campaigns for general election. It made me think that it is better to assign the type of contributions for nominees to G2016 and for other candidates to P2016.
 I checked that it can be applied to candidates who weren't nominees by drawing the barplot of daily contribution counts for them. The contributions for them almost stopped from July, 2016. July, 2016 was the month that [Republican](https://en.wikipedia.org/wiki/Republican_Party_presidential_primaries,_2016) and [Democratic](https://en.wikipedia.org/wiki/Democratic_Party_presidential_primaries,_2016) presidential primaries took place.
 
 
 But to be more confident to my thinking for assigning types, I searched for contributers whose names appear in the head and tail part of election_type_vacant using [FEC site](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016). 
 Refunded contributions didn't appear in the site. Some contributions are recorded as primary(one example in [here](https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&contributor_name=WAY%2C+RICHARD+A+MR.&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016) 2016-08-26 data) instead of ''. And others are recorded '' as it is in Data. 
 I also found from [FEC contribution brochure site](https://transition.fec.gov/pages/brochures/contrib.shtml) that there can be presumptive redesignations for the cases of 'Is not designated in writing for a particular election;'. 
 Therefore I decided to leave the election type '' as it is. And I'll also use the data of the type too distinguishing from P2016 and G2016. I think that it is the case of not designating for a particular election in writing the contribution form.

 In this time, I'm going to show the distribution of contributions for each parties in Texas. I'll use bar plot again.

```{r echo=FALSE, message=FALSE, warning=FALSE, Univarite_party_counts}

parties_counts <- as.data.frame(sort(table(Data$cand_party, dnn = 'party'),
                                     decreasing = T),
                                responseName = 'count')

parties_counts

ggplot(aes(x = party, y = count),
       data = parties_counts) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 The most number of contributions were for Democratic party candidates. And they got about 20,000 more contributions than Republican party candidates. And rest of the parties and independent candidate got few contributions compared to the above 2 parties.
 
 It is a lot amazing that even though the candidates who were included in Democratic party were just 5, the sum of contribution counts for them were more than the sum of counts for Republican party in which there were 17 candidates. I think that it means that in Democratic party there was at least one candidate who was famous in Texas and got support a lot. And I think that the candidate was Hillary Clinton, Democratic party nominee because when I drew univariate plot of contribution counts by each candidate name, I found that she got most number of them.
 
 But related with contributions, the sum of amounts can be more important than the sum of counts. Therefore I'm going to explore how the sum of amounts were different by parties, candidates, election types, etc. in the bivariate sections.

 I'm going to show the distribution of contributions for nominees and those who aren't. I want to compare the 2 groups. Therefore I need to use if_nominee column which I made.

```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_if_nominee_counts}

nominees_counts <- as.data.frame(sort(table(Data$if_nominee, 
                                            dnn = 'if_nominee'),
                                      decreasing = T),
                                 responseName = 'count')

nominees_counts

ggplot(aes(x = if_nominee, y = count),
       data = nominees_counts) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

 I wanted to compare the counts of the contributions for nominees and those who weren't. The number of contributions for nominees was about 20,000 more than the number of contributions for the others. But compared to the size of the numbers, the difference wasn't much big. I think that it means that there was at least one candidate who was likely to be a nominee but couldn't be. And I think that they were Ted Cruz, Senate from Texas and Bernie Sanders, Senate from Vermont because when I drew univariate plot of contribution counts by each candidate name, I found that they got distinctive number of contributions.


# Univariate Analysis

### What is the structure of your dataset?

 There are 548396 contributions in the dataset with 20 features after merging. In these 20 features, 14 got factor type, 3 got string type and each of the rest 3 got Date, num, and logical types.
 
 Other observations by summary: 
 1. There are 25 unique candidates in the dataset.
 2. Hillary Clinton got the most number of contributions.
 3. The mean of contribution amounts was \$138. And the median was \$38.
 4. There are some amounts which are negatives.
 5. IQR range of the contribution dates were from 2016-02-06 to 2016-08-12.
 6. The counts of contributions for Democratic party were bigger than the counts of contributions for Republican party.

### What is/are the main feature(s) of interest in your dataset?
 
 The main feature is contb_receipt_amt, the contribution amount. And I want to know how contribution counts, sums, quantiles, and averages became different by other variables.
 
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

 I think that cand_nm, cand_party, if_nominee, election_tp, contbr_city and contb_receipt_dt columns can help me to understand the political characteristics in Texas related with 2016 US presidential election. And I expect that receipt_desc column can sometimes give me information about the cause of some situations.

### Did you create any new variables from existing variables in the dataset?

 I created the variable 'if_nominee' from cand_nm. If cand_nm is one of 'Clinton, Hillary Rodham', 'Johnson, Gary', 'McMullin, Evan', 'Stein, Jill' and 'Trump, Donald J.', the value is TRUE. And if cand_nm is not in above 5 names, the value is FALSE.
 
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

 I drew the histogram of the contb_receipt_amt to IQR range and got the top 6 most common contribution amounts as a table. And I found that the most common  amount was \$25. It is not the number of tens that I thought that it was unusual. And when I drew the bar plot of contribution counts by parties, I thought that it was also unusual that a party in which had 5 candidates got more contributions by count than a party in which had 17 candidates.

 I changed the type of contb_receipt_dt column from character to Date. I did it to express the character typed values as time series. And I expected that the variable can be used to express the trends of other variables and their values.

# Bivariate Plots Section

 Before I start drawing bivariate plots, using what I learned from univariate analysis, I'm going to make a new dataframe by subsetting Data to get only concise and essential parts.
 
 I'm going to select contb_receipt_amt, cand_nm, cand_party, if_nominee, election_tp, contbr_city, contb_receipt_dt and receipt_desc columns. And I'm going to exclude election type P2012 and O2016. And one more thing that I'm going to do is to set the order of types to make 'P2016' ahead of 'G2016'.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Make_new_concise_dataframe}

chosenColumns = c('contb_receipt_amt', 
                  'cand_nm', 
                  'cand_party', 
                  'if_nominee', 
                  'election_tp',
                  'contbr_city', 
                  'contb_receipt_dt', 
                  'receipt_desc')

new_Data <- Data[, chosenColumns]

new_Data <- subset(new_Data, !election_tp %in% c('P2012', 'O2016'))

new_Data$election_tp <- factor(new_Data$election_tp,
                               levels = c('P2016', 'G2016', ''))

str(new_Data)

```

 There are 548326 contribution datas and 8 features in new dataframe.

 I now start to draw bivariate plots. I want to investigate about how the contribution amounts are differed by different values in other variables first.
 
 In this time, to be more specific, I want to know how the contribution amounts were changed as time went on. Therefore I'm going to use contb_receipt_amt and contb_receipt_dt columns. And I'll draw line graphs to show the trend and scatter plot to see how the distribution of contribution amounts are changed. 

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date1}

ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt),
       data = new_Data) +
  geom_line(stat = 'summary', fun.y = sum) +
  scale_x_date(date_breaks = '3 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 First plot is line graph which shows the trend of sum of the contribution amounts. The sum amounts sky rocketed 5 times and collapsed 2 times. And one collapsion which happened at about June, 2016 was amazingly severe. I wonder why it happened. It looked like lots of refund or redesignation or reattribution happened at the time. But I'm going to explore more about it in the multivariate section because I want to use receipt_desc variable too with candidates' names and contribution amounts.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date2}

# I designated aes(color = 'method_name') in each geom_line to make legend. 'method_name' becomes one of the name of line in legend. And I used scale_color_brewer to assign each line colors automatically. I specified size in override.aes argument to make legend lines look thicker.
# And I aggregated daily data to monthly to see the overall trend of the statistical results without severe fluctuations.
ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt),
       data = new_Data) +
  geom_line(stat = 'summary', fun.y = mean,
            aes(color = 'mean')) +
  geom_line(stat = 'summary', fun.y = median,
            aes(color = 'median')) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.05),
            aes(color = '0.05quantile')) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.95),
            aes(color = '0.95quantile')) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Methods',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Second plot is line graph which shows the trend of mean, median, 0.05 quantile, and 0.95 quantile amounts. And I chose to draw after aggregating daily dates to monthly to see the overall trends with less fluctuations. I can see that mean values were always bigger than median values. And I can also see that the difference from median values to 0.95 quantile values are much bigger than to 0.05 quantile values. I think it is because there are some outlier amounts. I think that I can check it by drawing scatter plot of dates and amounts. And I can see that 0.95 quantile and mean were biggest at March, 2015. I wonder why it happened especially at the time. But I'm going to explore more about it in the multivariate section because I want to use receipt_desc variable too with candidates' names and contribution amounts.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date3}

ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt),
       data = new_Data) +
  geom_jitter(alpha = 0.05) +
  scale_x_date(date_breaks = '3 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```
 
 Third plot is scatter plot which shows how the distribution of contribution amounts were changed as time went on. I can see some outlier amounts which are about \$10,000 and some lines of dots. It makes me wonder how the outlier values are distributed by each party, candidates. I'll find about it later. And I can also see that almost all of contribution amounts are less than about $3,000. It means that there are some common contribution amounts. I think that I can see lines of dots more clearly as I transform the y axis to log scale. I chose log scale because there were less dots as amounts are bigger.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_date4}

# I subsetted to use only amounts equal to or above 0 to represent in log scale. And I plused 1 to amounts to make them above 0 after I transformed the scale to log. I think that adding $1 won't affect much to the values especially for amounts whose original values are more than $30.
ggplot(aes(x = contb_receipt_dt, y = contb_receipt_amt + 1),
       data = subset(new_Data, contb_receipt_amt >= 0)) +
  geom_jitter(alpha = 0.05) +
  scale_x_date(date_breaks = '3 month') +
  scale_y_log10(breaks = c(30, 100, 300, 1000, 3000, 10000)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Fourth plot is scatter plot whose y axis is transformed to log scale. And to represent in log scale, I used only amounts equal to or above 0. Therefore I showed the distribution of amounts without paying back values. I can see more clearly about lines of dots. And I can also find that from about August, 2015 lots of contributions are done with amounts of \$10 to \$300.

 I could see how the contribution amounts were changed as time went on by drawing above plots.
 
 In this time, I want to know how the contribution amounts are differed by nominees group and other candidates group. Therefore I'm going to use contb_receipt_amt and if_nominee columns. And I'll use boxplot and barplot to show the differences. Boxplot is for seeing the distribution of amounts for each group. And barplot is for comparing the sum amounts for each group.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_nominees}

amt_vs_if_nominee <- ggplot(aes(x = if_nominee, y = contb_receipt_amt),
                                data = new_Data) +
                       geom_boxplot(alpha = 0.2) +
                       scale_x_discrete()

amt_vs_if_nominee

# I designated aes(shape = 'mean') in geom_point to show the means in each group of boxplots and show what the symbols mean using legend. To designate shape I used scale_shape_manual.
# And I used coord_cartesian function to see the IQR range in detail. To designate proper ylim, I used trial and error method.
amt_vs_if_nominee +
  geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
  scale_shape_manual('', values=c('mean' = 8)) +
  coord_cartesian(ylim = c(-100, 300))

ggplot(aes(x = if_nominee, y = contb_receipt_amt, group = 1),
       data = new_Data) +
  geom_bar(stat = 'summary', fun.y = sum) +
  scale_x_discrete()

```

 When I see the first boxplot, I can find that lots of outlier amounts are for candidates who weren't nominees. To compare the distribution without outliers, I drew second boxplot limiting the amounts from -\$100 to \$300 values. And I also mark symbol of mean to compare between mean and median. And I can see that the mean and median of contribution amounts for nominees are a little smaller than those for the others. But I can find from barplot that the sum of amounts for nominees are a little bigger than the others.
 
 In this time, I want to know how the contribution amounts are differed by parties. Therefore I'm going to use contb_receipt_amt and cand_party columns. And I'll use boxplot and barplot to show the differences as I did for if_nominee.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party1}

amt_vs_party <- ggplot(aes(x = cand_party, y = contb_receipt_amt),
                       data = new_Data) +
                  geom_boxplot(alpha = 0.1) +
                  scale_x_discrete()

amt_vs_party

```

 I can see from above boxplot that lots of outlier amounts, especially those of more than \$10,000 are for Republican party candidates. To see more clearly IQR range without outliers, I limited the result from -\$100 to \$600 in second boxplot. And it gave me interesting result. 
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party2}

# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group. I also used again coord_cartesian to close up the IQR range.
amt_vs_party +
  geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
  scale_shape_manual('', values = c('mean' = 8)) +
  coord_cartesian(ylim = c(-100, 600))

new_Data %>%
  group_by(cand_party) %>%
  summarise(mean_amount = mean(contb_receipt_amt),
            median_amount = median(contb_receipt_amt)) %>%
  t() %>%
  as.data.frame()

```

 The mean and median amounts for Independent candidate and Libertarian party are bigger than those for the others. Even their IQR is large. But they almost don't have outliers. Considering the fact that almost all of the contributions are for Democratic and Republican parties which I found in univariate analysis, it looked like the contributions for independent candidate and Libertarian party are done with larger amounts at a time than those for the 2 main parties. The mean and median amounts for Green party are slightly bigger than those for Republican party. And Democratic party got smallest mean and median. It is interesting that the mean, median of the party are less than the values of Republican party even though the number of contributions for the party are biggest. 

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_party3}

ggplot(aes(x = cand_party, y = contb_receipt_amt),
       data = new_Data) +
  geom_bar(stat = 'summary', fun.y = sum) +
  scale_x_discrete()

```

 And third plot shows that the sum of contribution amounts for Republican party was bigger than the value of Democratic party. And compared to the 2 parties, what others got were really few.
 
 In this time, I want to know how the contribution amounts are differed by candidates. Therefore I'm going to use contb_receipt_amt and cand_nm columns. And I'll use boxplot and barplot to show the differences as I did for cand_party.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates1}

amt_vs_cands <- ggplot(aes(x = cand_nm, y = contb_receipt_amt),
                       data = new_Data) +
                  geom_boxplot(alpha = 0.2) +
                  scale_x_discrete() +
                  theme(axis.text.x = element_text(angle = 90, 
                                                   hjust = 1, 
                                                   vjust = 0.5))

amt_vs_cands

```

 In first boxplot, the outlier amounts for Cruz are outstanding. Except for them, I can see that there are lots of outliers for Carson, Clinton, Paul, Sanders and Trump. And the distribution of the amounts for Jeb Bush is also interesting because IQR is large but there are some huge outlier amounts.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates2}

# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group. I also used again coord_cartesian to close up the IQR range.
amt_vs_cands +
  geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
  scale_shape_manual('', values = c('mean' = 8)) +
  coord_cartesian(ylim = c(-100, 3000))

new_Data %>%
  group_by(cand_nm) %>%
  summarise(mean_amount = mean(contb_receipt_amt),
            median_amount = median(contb_receipt_amt)) %>%
  arrange(mean_amount)

```

 I limited from -\$100 to \$3,000 amounts in second plot. And I also showed the mean and median amounts for all candidates arranged by mean values. The mean and median for Clinton and Sanders are smallest. But they have lots of outliers. Jeb Bush got second largest IQR. The distributions for Carson, Cruz, Paul and Rubio show that the values of mean and median are small compared to the others, but there are lots of outliers.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_candidates3}

# I arranged by n to order the candidates' names and their sum of amounts as I ordered the names by contribution counts.
cands_sum_amounts <- new_Data %>%
  group_by(cand_nm, cand_party) %>%
  summarise(sum_amounts = sum(contb_receipt_amt),
            n = n()) %>%
  arrange(desc(n))

# I used factor function to order the levels in the factor of candidates' names as I arranged the names in data.frame. Ordering the data.frame rows is different from ordering the factor's levels.
cands_sum_amounts$cand_nm <- factor(cands_sum_amounts$cand_nm,
                                    levels = cands_sum_amounts$cand_nm)

ggplot(aes(x = cand_nm, y = sum_amounts),
       data = cands_sum_amounts) +
  geom_bar(stat = 'identity') +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

subset(cands_sum_amounts, select = -cand_party)
```

 I drew the barplot of the sum of amounts for each candidates. When I represent the graph, I arranged the order by contribution counts, not by the sum of amounts. I did it intentionally to compare the order of counts and sums. And it shows really interesting result. I found that the top 5 candidates by contribution counts were Clinton, Cruz, Sanders, Trump and Carson in order. Clinton and Cruz still got first and second places. But Trump went to 3rd place beating Sanders by huge gap. What is most interesting is that Jeb Bush went to 4th place beating Sanders, Rubio and Carson. It looked like his outlier contribution amounts played huge role.
 
  I think that it is also worth trying to know how the contribution amounts are differed by election types. Therefore I'm going to use contb_receipt_amt and election_tp columns. And I'll use boxplot and barplot to show the differences as I did for cand_nm.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_election_type}

# I won't show the result of the datas whose election type '' because I don't know what exact type they got.
amt_vs_type <- ggplot(aes(x = election_tp, y = contb_receipt_amt),
                      data = subset(new_Data, election_tp != '')) +
                 geom_boxplot(alpha = 0.2) +
                 scale_x_discrete()

amt_vs_type

# I used again aes(shape = 'mean') and scale_shape_manual to show mean of each group.
amt_vs_type +
  geom_point(stat = 'summary', fun.y = mean, size = 2, aes(shape = 'mean')) +
  scale_shape_manual('', values = c('mean' = 8)) +
  coord_cartesian(ylim = c(-100, 200))

ggplot(aes(x = election_tp, y = contb_receipt_amt),
       data = subset(new_Data, election_tp != '')) +
  geom_bar(stat = 'summary', fun.y = sum) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 I can see from the first boxplot that there are a lot more outliers for P2016 than G2016. Without outliers in second boxplot, I can see that the mean and median of P2016 values are bigger than those of G2016. In the barplot of the sum of amounts by election types I can see that the sum for P2016 is more than 2 times of the sum for G2016.

 In this time, I want to know how the contribution amounts are differed by cities. Therefore I'm going to use contb_receipt_amt and contbr_city columns. But I found in the univariate analysis that there are 2252 different city names in the Data. Therefore I'm going to show only the result of the top 10 cities by contribution counts. 

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city1}

sum_amounts_by_city <- new_Data %>%
  group_by(contbr_city) %>%
  summarise(sum_amount = sum(contb_receipt_amt),
            mean_amount = mean(contb_receipt_amt),
            median_amount = median(contb_receipt_amt),
            n = n()) %>%
  arrange(desc(n)) %>%
  head(10)

top10_sum_amounts_city <- subset(new_Data,
                                 contbr_city %in% 
                                   sum_amounts_by_city$contbr_city)

ggplot(aes(x = contbr_city, y = contb_receipt_amt),
       data = top10_sum_amounts_city) +
  geom_boxplot(alpha = 0.1) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 First plot is boxplot which shows the distribution of amounts in each top 10 city. And I can see lots of outlier amounts in Houston and Dallas.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city2}

# I designated position_nudge and width in geom_bar to represent 0.95quantiles, means and medians in a graph. Position_nudge moves the x position of each bar and width controls widths of bars.
# To show the legend, I used again aes(color = 'method_name') in geom_bar and scale_color_brewer.
ggplot(aes(x = contbr_city, y = contb_receipt_amt, group = 1),
       data = top10_sum_amounts_city) +
  geom_bar(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.95),
           aes(fill = '0.95quantile'), 
           position = position_nudge(x = 0.33),
           width = 0.33) +
  geom_bar(stat = 'summary', fun.y = mean,
           aes(fill = 'mean'), 
           width = 0.33) +
  geom_bar(stat = 'summary', fun.y = median,
           aes(fill = 'median'), 
           position = position_nudge(-0.33),
           width = 0.33) +
  scale_x_discrete() +
  scale_fill_brewer(type = 'qual',
                    guide = guide_legend(title = 'Methods',
                                         override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Second plot is bar graph which shows the median, mean and 0.95 quantile amounts by each top 10 city. It shows that 0.95 quantile value of Dallas was really big. The mean of Dallas was biggest too but not much different from Houston. In the following, I'm going to explore more about top 5% amounts from Dallas.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Explore_more_about_Dallas_top_0.05}

# I grouped by city and use filter to get top 5% amounts of each city. And I made new column 'if_dallas' using mutate to specify Dallas datas.
top_5percent_by_city <- new_Data %>%
  group_by(contbr_city) %>%
  filter(contb_receipt_amt >= quantile(contb_receipt_amt, 0.95)) %>%
  mutate(if_dallas = ifelse(contbr_city == 'DALLAS', TRUE, FALSE))

# I used histogram because I want to know the distribution of contribution amounts. And I designated fill as if_dallas to compare Dallas datas to other cities' datas at a time.
ggplot(aes(x = contb_receipt_amt, fill = if_dallas), 
       data = top_5percent_by_city) +
  geom_histogram(binwidth = 100, alpha = 0.7)

# To help know the distribution of amounts more in detail, I printed top 6 common values in top 5% amounts from Dallas.
head(as.data.frame(sort(table(subset(top_5percent_by_city,
                                     if_dallas == T)$contb_receipt_amt,
                              dnn = 'amount'),
                        decreasing = T),
                   responseName = 'count'))

```

 I can see that almost all of top 5% contributions from Dallas were \$2,700. And there were really small counts for the amounts less than \$2,700 which I could expect a lot more from the other cities. It looked like that is why their 0.95 quantile was biggest.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_amount_and_city3}

ggplot(aes(x = contbr_city, y = contb_receipt_amt, group = 1),
       data = top10_sum_amounts_city) +
  geom_bar(stat = 'summary', fun.y = sum) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Third plot is barplot of the sum of amounts of each top 10 city. It shows that the sum from Houston is bigger than the sum from Dallas. It looked like even if there were some high contributions from Dallas, but the sum from Houston was bigger because of bigger contribution counts.

==================================================

 Contribution amounts are main feature that I was interested in. And I drew lots of bivariate graphs related with them.
 
 From now on, I'm going to draw bivariate graphs taking support features. I want to know more about the political differences between cities in Texas.
 
 First, I want to know how the contribution counts for each candidate are differed by each city. Therefore I'm going to use cand_nm and cand_party columns. And I'll use barplot because I need to show the differences of several candidates by each city, which are categorical values. And I used stacked bar to compare several candidates by each city at a time.

 But before I'm going to draw the plot, I'm going to make a variable named main_cands_counts. Its one column contains the names of candidates and the other column contains the counts of contributions. And there will be only 7 names because I chose to select only 7 main candidates by contribution counts.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Get_main_candidates_by_contribution_counts}

main_cands_counts <- new_Data %>%
  group_by(cand_nm) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  head(7)

```


```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_cand_nm_and_city}

# I used sum_amounts_by_city which I already made to make top10_sum_amounts_city. In its contbr_city column, there are top 10 city names by contribution counts.
# I also used main_cands_counts to show only the result of 7 main candidates.
ggplot(aes(x = contbr_city, fill = cand_nm, group = cand_nm),
       data = subset(new_Data, 
                     contbr_city %in% sum_amounts_by_city$contbr_city &
                       cand_nm %in% main_cands_counts$cand_nm)) +
  geom_bar(stat = 'count', position = 'stack') +
  scale_fill_brewer(type = 'qual', 
                    palette = 2,
                    guide = guide_legend(title = 'Candidates',
                                         override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 It also gave me really funny result. I can see that Clinton got the most number of contributions from all cities except Spring. But second places are changed by each city. In Austin Sanders got far more contributions than Cruz. But in Dallas and San Antonio Sanders got almost same to Cruz. And the relation became reversed in Houston. The result makes me wonder how the distribution for 2 main parties would be different by each city. I think that I will see a huge gap of Democratic party and Republican party in Austin. And I expect that there will be smaller gap in Houston. I already found from univariate section that almost all contributions from Texas were for Democratic or Republican party candidates. Therefore I decided to compare only 2 main parties. And I'll use barplot to compare between categorical values like city and party.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_party_and_city}

ggplot(aes(x = contbr_city, fill = cand_party, group = cand_party),
       data = subset(new_Data, 
                     contbr_city %in% sum_amounts_by_city$contbr_city &
                       cand_party %in% c('Democratic', 'Republican'))) +
  geom_bar(stat = 'count', position = 'dodge') +
  scale_fill_brewer(type = 'qual',
                    palette = 2,
                    guide = guide_legend(title = 'Party',
                                         override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 Yes, I can see the huge gap in Austin and smaller gap in Houston. The count to Democratic party was highest in Austin. And the count to Republican party was highest in Houston. And I can see that the contributions from almost all cities are higher for Democratic party than for Republican party.
 
========================================================== 
 
 In the above part, I drew bivariate graphs related with cities. In this time, I'm going to draw bivariate graphs related with dates. And I chose dates because I think that contribution date is useful to let me know about the trends.
 
 First, I want to know how the daily contribution counts were changed as time went on. Especially, I want to compare between the candidates who became nominees and those who couldn't. Therefore I'm going to use contb_receipt_dt and if_nominee columns. And I'll use line graph to show the trend of contribution counts.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_nominees}

ggplot(aes(x = contb_receipt_dt, color = if_nominee),
       data = new_Data) +
  geom_line(stat = 'count') +
  scale_x_date(date_breaks = '3 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 I can see that until about June, 2016 lots of contributions are for candidates who weren't nominees. It is likely because at the time nominees weren't decided yet. But the contributions for nominees increased from about February, 2016 and soared highly on about July, 2016. In this time the contributions for candidates who weren't nominees decreased to almost 0. But overall contributions for nominees decreased until about September, 2016 and again increased until the general election. I wonder what happened at September, 2016 that the contributions are decreased.
 
 In this time, I want to know how the contribution counts were changed as time went on especially comparing between the parties. Therefore I'm going to use contb_receipt_dt and cand_party columns. And I'll use line graph this time too to show the trend of contribution counts.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_party}

ggplot(aes(x = contb_receipt_dt, color = cand_party),
       data = new_Data) +
  geom_line(stat = 'count') +
  scale_x_date(date_breaks = '3 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 I can see that there are a lot more dramatic fluctuations of contributions for Republican party than those for Democratic party. The contributions for Republican party increased more highly and decreased more deeply but the counts are generally more than the counts for Democratic party. But I can see that the contributions for Democratic party catched up almost to the contributions for Republican party at about March, 2016. And I can also see some terms that the contributions for Democratic party are more than those for Republican party. They were about from May to June of 2016 and August to November of 2016. I guess that it is somewhat because of Trump. I think that I can find more about it when I draw the plots of dates and candidates. 
 
 Therefore in this part, I'm going to see how the contribution counts were changed as time went on, especially comparing between Trump and Clinton. I chose the 2 candidates because they became main parties' nominees. I'm going to use contb_receipt_dt and cand_nm columns. And I'll use line graph again.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_main_nominees}

ggplot(aes(x = contb_receipt_dt, color = cand_nm),
           data = subset(new_Data,
                         cand_nm %in% c('Trump, Donald J.', 
                                        'Clinton, Hillary Rodham'))) +
  geom_line(stat = 'count') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_x_date(date_breaks = '3 month')

```

 I can see that the part of from June to Noverber of 2016 is almost same to the date vs party graph. I think that it should be because Clinton and Trump were elected as nominees on July, 2016. 
 
 I can see that there are lots of huge fluctuations for Trump between July to August, 2016. I wonder if it is related with the comments and news of Trump. In contrast to Trump, Clinton got more and more contributions generally as time went on.
 
 In the graph of dates vs contribution counts for Trump and Clinton, I cannot get the explanation about the fall of contributions for Republican party in May, 2016. Therefore in this time, I'm going to compare Republican party candidates. And I decided to use main_cands_counts to get only main candidates' names.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_main_republican_candidates}

ggplot(aes(x = contb_receipt_dt, color = cand_nm),
           data = subset(new_Data,
                         cand_nm %in% main_cands_counts$cand_nm &
                           cand_party == 'Republican')) +
  geom_line(stat = 'count') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Republican main candidates',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  scale_x_date(date_breaks = '3 month')

```

 The result shows that the fall of May, 2016 was because of Cruz. I wonder whether Cruz decided to stop his campaign at about May, 2016. And I found from [internet](https://en.wikipedia.org/wiki/Republican_Party_presidential_primaries,_2016#May_3.2C_2016:_Indiana_primary) that it is true. It looked like that lots of people in Texas who support Republican party supported Cruz, the Senator from the state that after his dropping out of the race the contributions for the party collapsed.
 
 Now I'm also going to compare Democratic party candidates. I expect that it explains more about the contributions for the party.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_for_date_and_main_democratic_candidates}

ggplot(aes(x = contb_receipt_dt, color = cand_nm),
           data = subset(new_Data,
                         cand_nm %in% main_cands_counts$cand_nm &
                           cand_party == 'Democratic')) +
  geom_line(stat = 'count') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(
                       title = 'Democratic main candidates',
                       override.aes = list(size = 2))
                     ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  scale_x_date(date_breaks = '3 month')

```

 The contributions for Sanders decreased as those for Clinton increased from May, 2016. I can find that there were some time when Sanders got more contributions than Clinton. I wonder why he became a lot more support from Texas than Clinton at the time. Anyway it explains why the contributions for Democratic party catched up those for Republican party on about March, 2016.
 
 In this time, I want to know how the contribution counts for each election type changed as time went on. Therefore I'm going to use contb_receipt_dt and election_tp columns. And I'll again use line graph. I expect that division between the types are almost flat at about July, 2016 because at that time primary elections happened.

```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_date_and_election_type}

ggplot(aes(x = contb_receipt_dt, color = election_tp),
       data = subset(new_Data, election_tp != '')) +
  geom_line(stat = 'count') +
  scale_x_date(date_breaks = '3 month') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 As I expected, the division point of the contributions for election types occured at about July, 2016. And I can see that this graph is really similar to dates vs if_nominee graph, except May to August of 2016 term.

# Bivariate Analysis

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

 I drew several bivariate plots to know how the contribution amounts are differed by the values in the other variables. 
 
 After I drew the date vs amount graphs, I could see that there was a date when really huge collapsion of amounts happened. And I found that it was because of contributions to Cruz. I also saw that there were some lines of dots in scatter plot. It means that there are some common contribution amounts.
 
 As I drew plots using if_nominee, cand_party, cand_nm, election_tp variables as x axis with amount as y axis one by one, I could see that lots of outliers are for primary election and for Cruz, Republican candidate who couldn't be elected as a nominee. And I also found that the mean and median amounts for Democratic party were smallest compared to the other parties. But Clinton, Democratic party nominee got the biggest sum of contribution amounts. Cruz and Trump got second and third most biggest sums.
 
 As I drew plots using contbr_city with amount, I found that the city whose mean of the amounts was biggest was Dallas. 0.95 quantile amount of Dallas was exceptionally big. But when I got the sum of amounts by city, the sum of Dallas was smaller than Houston.
 
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

 When I drew plots for contbr_city vs cand_party and contbr_city vs cand_nm, it was really interesting that there were really big difference between Houston and Austin. The sum of contribution amounts for Clinton and Sanders was really huge compared to Cruz and Trump in Austin. And even though the sum of contribution in Houston are bigger for Democratic party than for Republican party, the latter value was higest compared to the other cities. And I could find that it was because of the contributions for Cruz. It was interesting to know that there can be huge political contribution differences between cities in the same state.
 
 And as I drew the plots for date vs candidates, I found that at the time of primary election, about July of 2016, the trends for contributions are changed hugely for both of main 2 parties. For primary election, lots of contributions were for Cruz and Sanders. And they were changed to Clinton and Trump for general election from about July, 2016. But overall contributions for Clinton increased until the general election but those for Trump decreased.
 
### What was the strongest relationship you found?

 I cannot do correlation test because there are only 1 numeric column in data. But as I see relationships between variables by drawing graphs, I found that if_nominee and election_tp are most similar when I used the variables in y axis as count and date in x axis. If a data got if_nominee as TRUE, it is really likely to get election_tp as G2016. If a data got if_nominee as FALSE, it is also really likely to get election_tp as P2016. And I think that it should be because lots of contributions for primary election go to candidates who wouldn't be a nominee, and most of contributions for general election go to nominees.
 
# Multivariate Plots Section

 I now start to draw multivariate plots. I left 2 questions that I couldn't answer in bivariate section. The first one was the collapsion of the sum of contribution amounts on about June, 2016. The second one was the outstanding 0.95 quantile value on March, 2015. I wanted to know the causes of them. Therefore I'm going to answer the questions first using multivariate plots.
 
 I'll explore about the collapsion on about June, 2016 in the following. As I said before, I want to use receipt_desc variable too to know the causes of the situation. I'm going to use scatter plot to know the distribution of amounts by each candidate. But I'll designate color as receipt_desc to know the explanations written in the data.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Explore_more_about_worst_collapsion_date}

# I used mutate function to get the sum of amounts of each date. And I used filter to get only contribution datas which occured at the minimum sum date.
contbs_in_worst_collapsion_date <- new_Data %>%
  group_by(contb_receipt_dt) %>%
  mutate(sum_amounts = sum(contb_receipt_amt)) %>%
  ungroup() %>%
  filter(sum_amounts == min(sum_amounts))

print('When did the worst collapsion happened?')
unique(contbs_in_worst_collapsion_date$contb_receipt_dt)

print('How many contributions were done at the date?')
nrow(contbs_in_worst_collapsion_date)

# I designated x as names of candidates, y as amounts and color as receipt_desc. I did it to know the distribution of the amounts at the date by each candidate and the reasons for them.
ggplot(aes(x = cand_nm, y = contb_receipt_amt, color = receipt_desc),
       data = contbs_in_worst_collapsion_date) +
  geom_jitter(alpha = 0.8) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
  
```

 The collapsion occured on July 2nd, 2016. And there were 2725 contributions at the date. And I can see from the graph that most of negative amounts are related with Cruz. There were lots of redesignations and some of refunds from the contributions for Cruz. The designations were especially for his senate actions. I think that because Cruz stopped his campaign for the race in May, 2016, lots of his supporters decided to change the object of contributions they had done. Because of this exploration, I can understand about the mysterious collapsion. But I couldn't know why lots of redignations and refunds for him happened especially at the date using the data.
 
 Now I'm goint to explore about top 5% contribution amounts on March, 2015 in the following. In this case again, I'm going to use scatter plot to know the distribution of amounts by each candidate and the causes of them. 
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Explore_more_about_March2016_top_0.05}

# filter for March, 2016 and top 5% amounts.
top_5percent_in_march2016 <- new_Data %>%
  filter(contb_receipt_dt >= as.Date('2015-03-01') &
           contb_receipt_dt <= as.Date('2015-03-31')) %>%
  filter(contb_receipt_amt >= quantile(contb_receipt_amt, 0.95))

ggplot(aes(x = cand_nm, y = contb_receipt_amt, color = receipt_desc),
       data = top_5percent_in_march2016) +
  geom_jitter()

head(as.data.frame(sort(table(top_5percent_in_march2016$receipt_desc,
                              dnn = 'amount'),
                        decreasing = T),
                   responseName = 'count'))
     
```

 It is amazing that all top 5% contribution amounts are for Cruz. And more than half of them requested for reattribution or redesignation. I think that lots of high-paying contributions for Cruz means that he started his campaign in the month and it was [true](https://en.wikipedia.org/wiki/Ted_Cruz#2016_presidential_campaign).
 
 Through using multivariate plots, I could answer the questions that I got in bivariate parts and got to know about the causes of the events.
 
 From now on, I'm going to draw multivariate plots to explore more about the data, make new questions and answer them. In the bivariate plots section, I drew several plots to investigate how the contribution amounts are differed by different values in other variables in the front part. And in the latter part, I drew some plots using contribution receipt date variable to investigate how different values in other variables were changed as time went on.
 
 In this time, to see more higher dimensional relationships between variables, I want to use both contb_receipt_amt and contb_receipt_dt variables. And then I'm going to add other variables as color or facet. I expect that this would make me possible to compare the changing aspect of contributions by counts and sums. I'll draw line graphs to show the trends. And I'll aggregate daily datas to monthly to see overall trends with less fluctuations.
 
 First, I'm going to draw a line graph which shows how the monthly sums of contributions for Clinton and Trump were changed as time went on.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_dt_main_nominees}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_nm),
       data = subset(new_Data, 
                     cand_nm %in% c('Trump, Donald J.', 
                                    'Clinton, Hillary Rodham'))) +
  geom_line(stat = 'summary', fun.y = sum, size = 1.2) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Main nominees',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 I can see that the sum of amounts for Trump increased a lot from May, 2016. From April to May, the sum increased dramatically. I saw that the contribution counts for him became far low from September and far lower than those for Clinton. But I can see that what he got at the term was almost same to what Clinton got. It is amazing. I think that average contribution amounts for him at the time were higher than for Clinton.

  Because I drew the plot for comparing Trump and Clinton, I think that I also need to compare the trends by each main candidates in each party. Therefore in this time, I'm going to draw a line graph which shows how the monthly sums of contributions for Republican main candidates were changed as time went on.
  
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_dt_Republican_main_cands}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_nm),
       data = subset(new_Data, 
                     cand_nm %in% main_cands_counts$cand_nm &
                       cand_party == 'Republican')) +
  geom_line(stat = 'summary', fun.y = sum, size = 1.2) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Republican main candidates',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 The 2 most outstanding lines are related with Cruz and Trump. The collapsion of the contributions to Cruz on July, 2016 is sticking out. And I can see that there was weaker collapsion of the contributions to Rubio on May, 2016.
 
 Now I'm going to draw a line graph which shows how the monthly sums of contributions for Democratic main candidates were changed as time went on.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_dt_Democratic_main_cands}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_nm),
       data = subset(new_Data, 
                     cand_nm %in% main_cands_counts$cand_nm &
                       cand_party == 'Democratic')) +
  geom_line(stat = 'summary', fun.y = sum, size = 1.2) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Democratic main candidates',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 The graph shows interesting result. I already found that there were some time that Sanders got more contributions than Clinton. But when I represented the sum of monthly contribution amounts, I could find that there wasn't any month that what Sanders got in a month was more than what Clinton got. The difference between them was smallest on March, 2016. I think that the average contribution amounts for Sanders was smaller than the value of Clinton.
 
 Because I drew the plot for comparing the trends by each main candidates in each party I think that it is necessary to draw a line graph which shows how the monthly sums of contributions for each party were changed as time went on. It will show the aggregate result of above plots.

```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_dt_party}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_party),
       data = new_Data) +
  geom_line(stat = 'summary', fun.y = sum, size = 1.2) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 I already found that the overall contribution counts were more for Republican party than for Democratic party. But from May to June and August to November of 2016, the contributions for the latter party were more than the former one. When I drew the plots by sum of monthly amounts, I could see that the overall trend is similar. Until May, 2016 the monthly sum of contributions for Republican party was far more than the sum of Democratic party. On May and July of 2016, there were huge collapsions for Republican party. But big difference happened from September to November. Even though the counts for Republican were smaller than those for Democratic party in the term, the sum values are almost same to both parties. It is same as what I saw from the graph of the sum of amounts for Clinton and Trump. It is definitely the aggregated version of above plots.
 
 Until now, I drew line graphs using contb_receipt_amt and contb_receipt_dt as main variables to see the trend of sum of amounts by other variables. From now on, I'm still going to use the 2 main variables, but draw scatter plots instead. I already drew the scatter plot of amounts vs dates in bivariate section. But I want to know more about the distribution of amounts. I'll use cand_party to see the relationship between amounts, dates and parties.

```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_scatter_for_amt_dt_party}

ggplot(aes(x = contb_receipt_dt, 
           y = contb_receipt_amt + 1,
           color = cand_party),
       data = subset(new_Data, contb_receipt_amt >= 0)) +
  geom_jitter(alpha = 0.05) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 1,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2,
                                                              alpha = 1))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_y_log10(breaks = c(10, 50, 1000))

```

 Again, I only used the amounts equal to or above 0 to express amounts using log scale. And I plused 1 to make transformed values above 0. The result is interesting. I think that I can divide the graph into 4 parts. The standard is June 2016 from dates, and about 50 from amounts. Until May, 2016 the overall contribution amounts for Republican party were bigger than those for Democratic party. But in the term, the contributions of higher amounts for latter party increased as time went on. Even though from June to August, most of contributions were for Republican party regardless of amounts, from May to Noverber, most of contributions for Republican party were more than \$50. And most of them for Democratic party were less than \$50. And there were few contributions for independent and Libertarian candidates which were less than \$50.
 
 From this graph, I could know why Republican party got a lot more sum of contribution amounts than Democratic party from Texas even though the contribution counts for the former party were smaller than those of the latter party.
 
 I learned from above line plots and bivariate plots that before May, 2016 lots of contributions were for Cruz and Sanders. And after that time, lots of them were for Trump and Clinton. I think that as I draw the scatter plot using main candidates' names as color variable, I can know more about it.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_scatter_for_amt_dt_main_cands}

ggplot(aes(x = contb_receipt_dt, 
           y = contb_receipt_amt + 1,
           color = cand_nm),
       data = subset(new_Data, cand_nm %in% main_cands_counts$cand_nm)) +
  geom_jitter(alpha = 0.05) +
  scale_x_date(date_breaks = '3 month') +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Main candidates',
                                          override.aes = list(size = 2,
                                                              alpha = 1))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  scale_y_log10()

```

 Yes, I can see that the distribution of contribution amounts are approximately divided into 4 parts. I can check from this graph that Cruz vs Sander before May, 2016 and Trump vs Clinton after that.
 
===========================================================

 Until now, I drew graphs using contb_receipt_amt and contb_receipt_dt as main variables. From now on, I'm going to use other combination of variables as main variables.
 
 First, I'm going to show the distribution of amounts by each party and each election type. I'm going to use boxplot to express the distribution for each party, which is categorical variable. I already knew from the bivariate section that most of outlier amounts are for primary election and Cruz. I expect that I can see lots of outliers for Republican party in P2016 facet graph.

```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_box_for_amt_party_facet_type}

ggplot(aes(x = cand_party, 
           y = contb_receipt_amt, color = cand_party),
       data = subset(new_Data,
                     election_tp != '')) +
  facet_wrap(~election_tp, ncol = 2, dir = 'v') +
  geom_boxplot() +
  scale_color_brewer(type = 'qual',
                     palette = 1,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 As I expected, there are lots of outlier amounts for Republican party and P2016. I can see that the number outliers decreased from P2016 to G2016 for 2 main parties. But I can see that the number of outliers for other parties increased. I also can see that the all contributions for independent candidate were for G2016.
 
 In this time, I'm going to show the number of contributions by each city and each election type. I'm going to use barplot again to express the distribution for each city, which is categorical variable. I already knew from the bivariate section that the counts for Democratic party were bigger than those for Republican party in almost all top 10 cities. And I knew that the contribution counts for the latter party were similar to the former party before May, 2016. I expect that I can see that most of counts for G2016 are for Democratic party, but I wonder how the counts for P2016 look like. And I also wonder about the difference between Houston and Austin.

```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_party_city_facet_type}

ggplot(aes(x = contbr_city, 
           group = cand_party,
           fill = cand_party),
       data = subset(new_Data, 
                     election_tp != '' & 
                       contbr_city %in% top10_sum_amounts_city$contbr_city &
                       cand_party %in% c('Democratic', 'Republican'))) +
  facet_wrap(~election_tp) +
  geom_bar(stat = 'count', position = 'dodge') +
  scale_fill_brewer(type = 'qual',
                     palette = 1,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2,
                                                              alpha = 1))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 The counts of contributions for primary election show that except Austin, the counts for 2 main parties were almost same in top 10 cities. In Austin, the count for Democratic party were far bigger than the value for Republican party. But the counts for general election show that there isn't any top 10 city in which the number of contributions for the latter party were bigger than the former one. Even the result of Houston shows the collapsion.
 
 I compared by counts. How the result will be if I change counts to the sum of amounts? I'll try this one. I only need to change the y axis.

```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_city_party_type}

ggplot(aes(x = contbr_city, 
           y = contb_receipt_amt,
           group = cand_party,
           fill = cand_party),
       data = subset(new_Data, 
                     election_tp != '' & 
                       contbr_city %in% top10_sum_amounts_city$contbr_city &
                       cand_party %in% c('Democratic', 'Republican'))) +
  facet_wrap(~election_tp) +
  geom_bar(stat = 'summary', fun.y = sum, position = 'dodge') +
  scale_fill_brewer(type = 'qual',
                    palette = 1,
                    guide = guide_legend(title = 'Party',
                                         override.aes = list(size = 2,
                                                             alpha = 1))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 The sums of contribution amounts for primary election show that without Austin, the values of Republican party were bigger than those of Democratic party. But the sums of contribution amounts for general election show that the values of the latter party were similar to or bigger than those of the former party. Even the values of Houston follow the trend.
 
 The sum of amounts don't show the trends of contributions. I think that using sum of amounts and contribution receipt date as main variables again to know how the monthly sum of amounts from each city are differed by each party can be interesting, too. Therefore I'm going to use line graph, designate party as color and city as facet to see the contribution trends of each city. And I'll draw for just 4 main cities in Texas. I expect that Houston and Austin show interesting result in this time, too.
 
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_line_for_amt_dt_party_city}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_party),
       data = subset(new_Data, 
                     contbr_city %in% c('AUSTIN', 'HOUSTON', 
                                        'DALLAS', 'SAN ANTONIO'))) +
  facet_wrap(~contbr_city, scales = 'free_y', ncol = 2, dir = 'v') +
  geom_line(stat = 'summary', fun.y = sum) +
  scale_x_date(date_breaks = '6 month') +
  scale_color_brewer(type = 'qual',
                     palette = 1,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

 It is interesting that there were some time that the monthly sums of contributions for Republican party were bigger than those for Democratic party in Austin. But they were in 2015 and in 2016 the difference between them became bigger. Houston and Dallas show similar pattern. Until February, 2016 the sums for the latter party were far less than the sums for the former party. But after that, there were 2 times of collapsion for Republican party. And the sums for the party narrowly catched up the sums for Democratic party in September, 2016. San Antonio shows that there were a lot more dynamic fluctuations for both parties, but generally, the sums for Republican party were bigger than those for Democratic party.
 
# Multivariate Analysis

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

 contb_receipt_date and cand_nm variables help me to understand more about the distribution of contribution amounts. And contb_receipt_date and cand_party variables also help me to understand the distribution.

### Were there any interesting or surprising interactions between features?
 
 When I drew the boxplot which shows the number of contributions by each city and by the facets of each election type, I found that the numbers of contributions for Democratic party from top 10 cities weren't much changed by election types. But the values for Republican party from top 10 cities decreased a lot from P2016 to G2016. The only one exception was Austin. The contributions from Austin were always higher for Democratic party regardless election types. The collapsion for Republican party and the consistency for Democratic party can be also found after changing the y axis from counts to sum of amounts.

### OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

 
I cannot create models with my dataset because there were only 1 numeric variable.

# Final Plots and Summary

### Plot One
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One}

# I designated position_nudge and width in geom_bar to represent n and sum in a graph without melting cands_sum_amounts. position_nudge moves the x position of each bar and width controls widths of bars.
ggplot(aes(x = cand_nm),
       data = cands_sum_amounts) +
  geom_bar(aes(y = n*100, fill = 'count'),
           stat = 'identity', 
           position = position_nudge(x = -0.2),
           width = 0.4) +
  geom_bar(aes(y = sum_amounts, fill = 'sum'),
           stat = 'identity', 
           position = position_nudge(x = 0.2),
           width = 0.4) +
  scale_x_discrete() +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(type = 'qual',
                    palette = 2,
                    guide = guide_legend(title = 'Statistics',
                                         override.aes = list(size = 2))) +
  xlab('Names of Candidates') +
  ylab('Contribution Counts(x100)\nSum of Contribution Amounts($)') +
  ggtitle(
    'Contributions to 2016 US Presidential Candidates from Texas'
    ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        plot.title = element_text(hjust = 0.5, face = 'bold', size = 20),
        axis.title.x = element_text(face = 'bold', size = 15),
        axis.title.y = element_text(face = 'bold', size = 15),
        legend.position = 'top')

```

### Description One

 If a candidate got more numbers of contributions, then there was a higher possibility that the candidate got more sum of contribution amounts. But it didn't apply to Sanders, Trump and Bush. Trump and Bush got more sum of contribution amounts compared to the number of contributions. And Sanders got less sum of contribution amounts compared to the numbers of contributions.
 
 It is combined version of barplot of 2 barplots that I made already.
 
### Plot Two
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Two}

ggplot(aes(x = as.Date(format(contb_receipt_dt, '%Y-%m-01'), 
                       format = '%Y-%m-%d'), 
           y = contb_receipt_amt, color = cand_party),
       data = subset(new_Data, 
                     contbr_city %in% c('AUSTIN', 'HOUSTON', 
                                        'DALLAS', 'SAN ANTONIO'))) +
  facet_wrap(~contbr_city, scales = 'free_y', ncol = 2, dir = 'v') +
  geom_line(stat = 'summary', fun.y = sum) +
  scale_x_date(date_breaks = '6 month') +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(type = 'qual',
                     palette = 1,
                     guide = guide_legend(title = 'Party',
                                          override.aes = list(size = 2))) +
  xlab('Contribution Receipt Dates') +
  ylab('Sum of Contribution Amounts($)') +
  ggtitle(
    'Contributions to Political Parties from Texas\nfor 2016 US Presidential Election',
    subtitle = '- The Trend of Sum of Amounts by 4 Main Cities -'
    ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        plot.title = element_text(hjust = 0.5, face = 'bold', size = 20),
        plot.subtitle = element_text(hjust = 0.5, face = 'bold', size = 15),
        axis.title.x = element_text(face = 'bold', size = 15),
        axis.title.y = element_text(face = 'bold', size = 15),
        legend.position = 'top')

```

### Description Two

 From Austin, Democratic party got more and more monthly sum of contribution amounts as time went on. And what the party got became far bigger than what Republican party got from the city. From Houston and Dallas, Republican party got more monthly sum of contribution amounts than Democratic party until April, 2016. But after the month, the sum of contribution amounts for Republican party collapsed 2 times. In this time, Democratic party got bigger sum of contribution amounts. From San Antonio, there were more dynamic changes of the sum of amounts for Democratic and Republican parties than the other cities. Green and Libertarian party candidates and independent candidate got far fewer sum of contribution amounts that it is hard to see their values in the graphs.

### Plot Three

```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three}

ggplot(aes(x = contb_receipt_dt, 
           y = contb_receipt_amt + 1,
           color = cand_nm),
       data = subset(new_Data, cand_nm %in% main_cands_counts$cand_nm)) +
  geom_jitter(alpha = 0.05) +
  scale_x_date(date_breaks = '3 month') +
  scale_y_log10(labels = comma) +
  scale_color_brewer(type = 'qual',
                     palette = 2,
                     guide = guide_legend(title = 'Main Candidates',
                                          override.aes = list(size = 2,
                                                              alpha = 1))) +
  xlab('Contribution Receipt Dates') +
  ylab('Contribution Amounts+1($)') +
  ggtitle(
    'Contribution Amounts from Texas\nfor 2016 US Presidential Candidates',
    subtitle = '- Distributed by Time For Main 7 Candidates -'
    ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        plot.title = element_text(hjust = 0.5, face = 'bold', size = 20),
        plot.subtitle = element_text(hjust = 0.5, face = 'bold', size = 15),
        axis.title.x = element_text(face = 'bold', size = 15),
        axis.title.y = element_text(face = 'bold', size = 15),
        legend.position = 'top') 

```

### Description Three

 The distribution of contribution amounts can be divided into 4 parts. Before May, 2016 Cruz got lots of higher contribution amounts than Sanders. After May, 2016 Trump got lots of higher contribution amounts than Clinton.

------

# Reflection

 I used 2 datasets. One was financial contribution datas of Texas. It has about 555,000 contributions across 18 variables. The other one was 2016 US presidential candidates datas which I made. It has 25 candidates' information across 5 variables like party, age, etc.
 
 I first did basic exploration about the datasets. And I made useful variable 'if_nominee'. And then I merged the datasets using names of candidates. Through merging, I could add the information about the party of each candidate and I could know if a candidate was nominated or not.
 
 With the merged dataset, I did univariate analyses. Representing higher dimension observations on plots, I could answer some questions that I got from the analyses and again got some new questions. Finally, I could especially explore about contribution amounts across many variables.
 
 I found contribution trends during the analyses. It was especially interesting to see that from Texas, Cruz, Republican party candidate and Senate from Texas, got lots of contributions until May, 2016 pushing aside Sanders, Clinton and Trump. The number of contributions and the sum of them for Cruz were far bigger than the other candidates at the time. But after that, the contributions for Cruz almost stopped. And Clinton, Democratic party nominee got more and more contributions than Trump. But Trump got averagely higer contributions than Clinton that even if he got far less number of contributions than her, he could get third biggest sum of contributions.
 
 In the process of getting the contribution trend, I could find interesting facts. Most of huge outlier contribution amount were for Cruz. And he got lots of contributions from Houston. But contrary to Houston, lots of contributions to Democratic party were made from Austin. 
 
 As I expressed the amounts by dates, I could see lots of fluctuations. I struggled to understand the causes of them. Sometimes, I could answer the reason of fluctuations using only the data by exploring deeper for the part, but there were some cases that I needed to use Internet. And there was a case that it was hard to find the cause even I used Internet. It was one of a hard thing to explore the data.
 
 There were some limitations of the financial data. First, There was only 1 numeric variable that I cannot make model to predict about candidates. Second, it just shows the contributions of Texas. I wonder if I can predict that Trump would win using all contribution datas of all US states. Third, I couldn't get information about unique contributers. The existing variables were insufficient to detect unique contributers. Fourth, I couldn't match the rise and fall of contributions for specific candidates with the actions of his or her campaigns. I think that I can understand more about the changes if I can do natural language processing for news about each candidate. I expect that NLP can give me information about the changes of the positive and negative words and news keywords about the candidate and I can match the trend with contribution fluctuations.
 
# References

1. The source of the financial data:

http://classic.fec.gov/disclosurep/pnational.do

2. To make Cand data:

https://www.washingtonpost.com/news/the-fix/wp/2015/09/18/people-only-care-about-how-tall-the-2016-candidates-are-so-here/?utm_term=.aeb12816dfe3

http://presidential-candidates.insidegov.com/

3. To understand the variables in Data:

ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt

4. To know more about redesignations and reattributions:

https://www.fec.gov/help-candidates-and-committees/candidate-taking-receipts/remedying-excessive-contribution/

https://www.youtube.com/watch?v=BdWY_HF2KAM&feature=youtu.be

5. To know about contribution limits:

https://www.opensecrets.org/overview/limits.php

6. To compare and check weird contributions information with other data sources:

https://www.campaignmoney.com/

https://www.fec.gov/data/receipts/individual-contributions/?two_year_transaction_period=2016&min_date=01%2F01%2F2015&max_date=12%2F31%2F2016

7. To know the form of the project and writing:

https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html

8. Got basic RMD form from Udacity.

https://www.udacity.com/