-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR Project Group I.Rmd
465 lines (318 loc) · 16.4 KB
/
R Project Group I.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
---
title: "ST661 (R for Data Analytics)- Group Project"
author: "Group I- BHANU NAGA SAI VAMSI BITRA,SILVIU-DANIEL FÁZEKAS,SANTHOSH PRABHAKARAN,ANJU BABY"
output:
html_document:
code_folding: hide
fig_caption: true
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,warning = FALSE, message = FALSE)
```
```{r,include=FALSE}
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(plotly)
library(ggiraph)
library(lubridate)
```
# "Analyzing Trends and Patterns in Olympic Performance Using Athlete and Medal Data"{.tabset .tabset-fade .tabset-pills}
## Introduction {.tabset .tabset-fade .tabset-pills}
In this project, we analyze two datasets:
**Athletes Dataset:** Contains detailed information about athletes, including their names, sports, countries, physical attributes (height, weight), and date of birth.
**Medals Dataset:** Provides details about the medals won, including the type of medal (Gold, Silver, or Bronze), the event, the athlete's country, and the date the medal was awarded.
These datasets are sourced from reliable Olympic data repositories, ensuring the accuracy and depth of the analysis. The goal is to uncover insights such as:
Patterns in medal distribution across countries.
Trends in age and performance.
The dominance of certain countries or athletes in specific sports.
Relationship between athlete heights and their sports.
**Dataset Structure:**
Variables such as Full_Name, Sports, Country, Medal_Type, Event, Medal_Date and Height are used for analysis.
These variables allow exploration of athlete characteristics, performance trends, and medal distributions.
__Dataset information:__
**Athletes:**
```{r}
# Define file paths for the datasets (correcting file extensions and slashes)
athletes_path <- "C:/Users/USER/OneDrive/Desktop/ST661[A] PROJECT GROUP I/athletes.csv"
medals_path <- "C:/Users/USER/OneDrive/Desktop/ST661[A] PROJECT GROUP I/medals.csv"
# Load datasets
library(readr)
# Load the datasets using read_csv
athletes_df <- read_csv(athletes_path, show_col_types = FALSE)
medals_df <- read_csv(medals_path, show_col_types = FALSE)
# Preview the first few rows of each dataset
head(athletes_df)
```
**Medals:**
```{r}
head(medals_df)
```
## Data Wrangling {.tabset .tabset-fade .tabset-pills}
In this project, we handled missing values and cleaned the datasets to prepare them for analysis.
***Handling Missing Values:***
Critical columns like Full_Name, Country, and Sports were filtered to remove rows with missing or irrecoverable values.
**Column Selection and Renaming:**
Unnecessary columns from both datasets were removed to focus only on relevant fields.
Columns were renamed for consistency and clarity (e.g., name_tv to Full_Name and medal_type to Medal_Type).
**Data Cleaning:**
Brackets, quotes, and formatting inconsistencies in categorical columns were removed.
Columns such as Birth_Date and Medal_Date were converted to proper date formats.
**Merging Datasets:**
The datasets were merged using common columns (Full_Name, Sports, Country) to form a unified dataset (combined_df).
This combined dataset includes all essential variables like Medal_Type, Event, Height, Weight, and Medal_Date.
```{r}
## Data cleaning
# Remove the specified columns in athletes
df_clean_ath <- athletes_df %>%
select(-c(current, name, name_short, country_code, country,
country_long, nationality_long, nationality_code, birth_place,
birth_country, residence_place, residence_country, nickname,
hobbies, occupation, education, family, lang, coach, reason,
hero, influence, philosophy, sporting_relatives, ritual,
other_sports, events))
# Remove the specified columns in medals
Me_cleaned <- medals_df %>%
select(-c(medal_code,event_type,url_event,code,country_code,country_long,gender))
#Rename columns in athletes
df_ath <- df_clean_ath %>%
rename(
`Full_Name` = name_tv,
Sports = disciplines,
Country = nationality,
Height = height,
Weight = weight,
Birth_Date = birth_date,
Gender = gender,
Code = code
)
#Rename columns in medals
df_med <- Me_cleaned %>%
rename(
Medal_Type= medal_type,
Medal_Date= medal_date,
Full_Name = name,
Sports= discipline,
Country = country,
Event =event
)
# Athletes Data set
# Remove brackets and quotes from 'sports'
library(stringr)
df_cl <- df_ath %>%
mutate(
Sports = str_replace_all(Sports, "\\[|\\]|'", ""),
)
# Convert character to date in "dd-mm-yyyy" format
df_cl$Birth_Date <- as.Date(df_cl$Birth_Date, format="%d-%m-%Y")
#head(df_cl$Birth_Date)
#str(df_cl)
## Count NA and empty string values in the entire data frame
#count_na_empty <- apply(df_cl, 2, function(x) sum(is.na(x) | x == ""))
#count_na_empty
# removing rows with missing values for key columns
df_clea <- df_cl %>%
filter(!is.na(`Full_Name`), !is.na(Country), !is.na(Sports), !is.na(Birth_Date), !is.na(Height), !is.na(Weight))
#count_na_empty <- apply(df_clea, 2, function(x) sum(is.na(x) | x == ""))
#count_na_empty
#Remove duplicate rows, if any
df_cleaned <- df_clea %>% distinct()
#checking data types of dataframe
#str(df_cleaned)
#convert char to date
df_med$Medal_Date <- as.Date(df_med$Medal_Date, format="%d-%m-%Y")
#head(df_med$Medal_Date)
#str(df_med)
#checking null values
#count_na_empty <- apply(df_med, 2, function(x) sum(is.na(x) | x == ""))
#count_na_empty
## Merge athletes and medals datasets using common columns
combined_df <- inner_join(df_med, df_cleaned, by = c("Full_Name", "Sports", "Country"))
head(combined_df)
#write.csv(combined_df, "combined_Clen_dataset.csv", row.names = FALSE)
#Transform the Data
#combined_df <- read.csv("E:/New folder (4)/olympics/combined_Clen_dataset1.csv")
# Assuming medals_df has a column 'medal_date' with values like '2024-07-27'
medals_df <- combined_df %>%
mutate(year = year(as.Date(Medal_Date)))
#write.csv(combined_df, "combined_Clen_dataset1.csv", row.names = FALSE)
```
**Final Output:**
The cleaned and merged dataset has no missing values and is ready for analysis.
Numerical and categorical data were validated for correctness, and duplicates were removed.
This cleaned dataset provides a reliable foundation for uncovering trends and patterns in Olympic performance.
## Data Analysis {.tabset .tabset-pills}
### Story 1
__Mean Age of Medal-Winning Athletes by Sport and Gender __
The goal of this plot is to analyze the average age of medal-winning athletes across sports, highlighting differences between male and female athletes. It identifies sport-specific age trends and gender-based variations in peak performance. This insight helps understand how age impacts success in various disciplines.
```{r}
# Calculate the mean age and round to the nearest whole number
mean_age_df <- combined_df %>%
mutate(Age = as.numeric(difftime(as.Date(Medal_Date), Birth_Date, units = "days")) / 365.25) %>%
group_by(Sports, Gender) %>%
summarise(
Mean_Age = round(mean(Age, na.rm = TRUE)), # Round the mean age
.groups = "drop"
)
# Create a custom color palette for gender
gender_colors <- c("Male" = "blue", "Female" = "red")
# Create the line plot using plotly
plot <- plot_ly(mean_age_df,
x = ~Sports,
y = ~Mean_Age,
color = ~Gender,
colors = gender_colors, # Apply custom colors
type = 'scatter',
mode = 'lines+markers') %>%
layout(
title = "Mean Age of Medal-Winning Athletes by Sport and Gender",
xaxis = list(title = "Sports", tickangle = 45), # Rotate x-axis labels
yaxis = list(title = "Mean Age"),
legend = list(title = list(text = "Gender"))
)
# Print the plot
plot
```
### Story 2
__Athletes Who Won Multiple Medals In A Particular Sport__
The goal of this plot is to showcase athletes who have won multiple medals in a particular sport, emphasizing their exceptional performance and consistency. It highlights individual achievements and identifies athletes who dominate their respective disciplines. This insight helps recognize top performers and trends in medal-winning excellence.
```{r}
# Group by athlete to find athletes who won multiple medals
multi_event_medalists <- combined_df %>%
group_by(Full_Name, Sports, Country) %>%
summarise(medals_won = n_distinct(Event), .groups = "drop") %>%
filter(medals_won > 1) # Filter only athletes who won more than 1 medal
# Create Plotly scatter plot for multi-event medalists with both athlete's name, sport and country
multi_event_medalists <- multi_event_medalists %>%
arrange(desc(medals_won)) # Ensure medals_won is ordered
# Create the plot
plot <- plot_ly(
data = multi_event_medalists,
x = ~medals_won, # Number of medals on the x-axis
y = ~reorder(Full_Name, medals_won), # Reorder athletes by medals won
type = 'scatter',
mode = 'markers', # Scatter plot with markers
marker = list(color = 'purple', size = 10, opacity = 0.7), # Marker customization
text = ~paste(
"Athlete: ", Full_Name,
"<br>Sport: ", Sports,
"<br>Country: ", Country,
"<br>Medals Won: ", medals_won
), # Tooltip text
hoverinfo = "text" # Display tooltip text
) %>%
layout(
title = 'Athletes Who Won Multiple Medals In A Particular Sport',
xaxis = list(title = 'Number of Medals Won'),
yaxis = list(title = 'Athlete', categoryorder = "total ascending"), # Sort y-axis
template = 'plotly_white'
)
# Display the plot
plot
```
### Story 3
__Top 10 Countries by Medal Count__
The goal of this plot is to highlight the top 10 countries by total medal count, breaking it down by gold, silver, and bronze medals. It showcases which nations excel in international sports competitions and their relative strengths in achieving higher-tier medals. This insight reflects the countries' dominance and investment in athletics.
```{r}
# Summarize medal distribution by country and medal type
medal_distribution <- combined_df %>%
group_by(Country, Medal_Type) %>%
summarise(medal_count = n(), .groups = "drop")
# Summarize the total medal count per country
total_medals_by_country <- medal_distribution %>%
group_by(Country) %>%
summarise(total_medals = sum(medal_count), .groups = "drop") %>%
arrange(desc(total_medals)) %>%
slice_head(n = 10) # Get top 10 countries
# Filter for the top 10 countries in the original medal distribution
top_10_medal_distribution <- medal_distribution %>%
filter(Country %in% total_medals_by_country$Country)
# Set custom colors for the medals
medal_colors <- c("Gold Medal" = "gold", "Silver Medal" = "gray", "Bronze Medal" = "darkorange3")
# Ensure the Medal_Type column has the correct order
top_10_medal_distribution$Medal_Type <- factor(
top_10_medal_distribution$Medal_Type,
levels = c("Gold Medal", "Silver Medal", "Bronze Medal")
)
# Add tooltips for interactivity
top_10_medal_distribution <- top_10_medal_distribution %>%
mutate(tooltip = paste0(
"Country: ", Country, "<br>",
"Medal Type: ", Medal_Type, "<br>",
"Medal Count: ", medal_count
))
interactive_plot <- ggplot(top_10_medal_distribution, aes(
x = reorder(Country, -medal_count),
y = medal_count,
fill = Medal_Type
)) +
geom_bar_interactive(
stat = "identity",
position = "stack",
aes(tooltip = tooltip, data_id = interaction(Country, Medal_Type))
) +
theme_minimal() +
labs(
title = "Top 10 Countries by Medal Count",
x = "Country",
y = "Medal Count",
fill = "Medal Type"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 20) # Increase the font size of the title
) +
scale_fill_manual(values = medal_colors)
# Render the interactive plot using 'girafe'
girafe(ggobj = interactive_plot, width_svg = 12, height_svg = 8)
```
### Story 4
__Analyzing the Relationship Between Athlete Heights and Their Sports__
The goal of this plot is to compare the height distributions of male and female athletes across different sports. It helps identify gender-based height differences and variations in height trends within each sport. This can provide insights into height's role in athletic performance and suitability for specific sports.
```{r}
# Load required libraries
library(ggplot2)
library(dplyr)
# Filter out rows where height is 0
filtered_data <- combined_df %>%
filter(Height > 0)
# Create a tooltip column for interactivity (optional for advanced features)
filtered_data <- filtered_data %>%
mutate(tooltip = paste("Athlete:", Full_Name,
"<br>Gender:", Gender,
"<br>Sport:", Sports,
"<br>Height:", Height, "cm"))
# Create the boxplot
plot <- ggplot(filtered_data, aes(x = Sports, y = Height, fill = Gender)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) + # Standard boxplot
labs(
title = "Boxplot of Height vs Sports Grouped by Gender",
x = "Sport",
y = "Height (cm)",
fill = "Gender"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels for readability
)
# Display the plot
print(plot)
```
## Conclusion
The analysis of Olympic datasets has provided significant insights into the performance trends of athletes and countries. Here are the key takeaways:
__1) Mean Age of Medal-Winning Athletes by Sport and Gender__
The analysis revealed significant variations in the average age of medal-winning athletes across different sports and genders. Certain sports, such as gymnastics for males(22) and skateboarding for females (16) tend to favor younger athletes, while others, like equestrian events for both males (40) and females (39), see more experienced participants excelling. The interactive visualization showcased these patterns, providing insights into age-specific competitiveness and the physical and mental readiness required for various disciplines.
__2) Athletes Who Won Multiple Medals in a Particular Sport__
This analysis identified athletes who secured multiple medals in a single sport, highlighting their exceptional performance and contribution to their country's success. The visualization emphasized individual dominance, allowing users to explore the achievements of these athletes. Athletes such as Summer McINTOSH from Canada and Leon MARCHAND from France both won four medals each highlighting exceptional achievement.
__3) Top 10 Countries by Medal Count:__
The analysis of the top 10 countries with the highest medal counts showcased their dominance in Olympic sports. Countries such as USA and China consistently performed well, excelling across various disciplines. Japan performed well in comparison to the gold medal ratio it has to the other countries. The stacked bar chart provided a clear breakdown of medal types (Gold, Silver, Bronze) for each country, offering a comprehensive understanding of their achievements.
__4) Height Correlation by Sport__
The analysis of athletes’ heights across various sports revealed notable trends. Sports like **athletics** and **badminton** were dominated by taller athletes, as their height offers a competitive edge in terms of reach and performance. On the other hand, sports such as **sport climbing** saw a higher prevalence of shorter athletes, where agility, flexibility, and a lower center of gravity are crucial for success.
Additionally, gender-based differences in height were observed, with male athletes generally displaying greater height variability compared to female athletes. The box-plot visualization effectively showcased these patterns, emphasizing how physical demands shape the height distribution across sports and genders.
**Authors’ statements :**
"I ,BHANU NAGA SAI VAMSI BITRA had primary responsibility for the material in the DATA WRANGLING section, involving the conceptualization of insights and their visualization."
"I,SILVIU-DANIEL FAZEKAS, had primary responsibility for the material in ANALYTICS section."
"I, SANTHOSH PRABHAKARAN, had primary responsibility for the material in the ANALYTICS section."
"I, ANJU BABY, had primary responsibility for the material in the INTRODUCTION and CONCLUSION section."