Lucio Colonna
Last updated: Tuesday September 10, 2024
📌 Missing Data
- Approximately 15% to 16% of values are missing in the start/end station names and start/end station IDs columns
- Around 0.12% of values are missing in the end latitude/longitude columns
- All rows containing empty or NA values were dropped
📌 Data Adjustments
- Bike type: “docked bike” was replaced with “classic bike”, as “docked bike” is a legacy term
- Station names and IDs: filtered out entries with specific keywords in station IDs, and cleaned station names by removing unwanted characters and prefixes
📌 Data Consistency
-
Ensured bijective correspondence between:
Specifically:
- for a. –> the mode (most frequent value) of station names/IDs was used
- for b. –> the mean of start/end station coordinates (latitude, longitude) was used
-
Fixed stations with coordinates showing significant standard deviation
📌 New features introduced
- New features added to the dataframe,
specifically:
- trip duration, as the difference between end date/time and start date/time
- additional temporal features (based on the start date) such as quarter, month, day of the week, and hour
- weekday vs weekend rides, round trips
- estimation of trip distances and speed
📌 Bad data removed
- Trip duration:
- Rides with negative duration were removed
- Minimum ride duration set to 1 minute, maximum to 1440 minutes (1 day). Rides outside these thresholds were removed
- Rides with impossible speed (higher than 20 MPH / 32 km/h) were removed
📌 Total rows dropped
- A total of 1,554,444 rows were dropped during the process phase (27.18 % of raw DF)
To evaluate how many rows have empty or NA values, I’m using the
skim_without_charts
function from the
skimr
library which is very useful to obtain information from the dataset.
Next, I’m selecting specific columns such as skim_variable
(name of
variable), n_missing
(NA values count), and character.empty
(empty
values count).
Then, I’m creating new columns total_empty_or_NA
,
percentage_missing
, and percentage_available
to add some useful
information on data availability.
skimmer <- skim_without_charts(df) %>%
select(skim_variable, n_missing, character.empty) %>%
rename(`NA` = n_missing, empty = character.empty) %>%
mutate(
class = skim_without_charts(df)$skim_type,
empty = if_else(is.na(empty), 0, empty),
`NA` = if_else(is.na(`NA`), 0, `NA`),
total_empty_or_NA = `NA` + empty,
total_available = nrow(df) - total_empty_or_NA,
p_not_available = round(total_empty_or_NA/nrow(df) * 100, 2),
p_available = 100 - p_not_available
) %>%
select(skim_variable, class, total_available, everything()) %>%
arrange(factor(skim_variable, levels = colnames(df)))
knitr::kable(skimmer)
skim_variable | class | total_available | NA | empty | total_empty_or_NA | p_not_available | p_available |
---|---|---|---|---|---|---|---|
ride_id | character | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
rideable_type | character | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
started_at | character | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
ended_at | character | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
start_station_name | character | 4844161 | 0 | 875716 | 875716 | 15.31 | 84.69 |
start_station_id | character | 4844029 | 0 | 875848 | 875848 | 15.31 | 84.69 |
end_station_name | character | 4790675 | 0 | 929202 | 929202 | 16.25 | 83.75 |
end_station_id | character | 4790534 | 0 | 929343 | 929343 | 16.25 | 83.75 |
start_lat | numeric | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
start_lng | numeric | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
end_lat | numeric | 5712887 | 6990 | 0 | 6990 | 0.12 | 99.88 |
end_lng | numeric | 5712887 | 6990 | 0 | 6990 | 0.12 | 99.88 |
member_casual | character | 5719877 | 0 | 0 | 0 | 0.00 | 100.00 |
Well, results are quite disappointing 😥:
Between ~15% and ~16% of values on
start_station_name
,start_station_id
,end_station_name
andend_station_id
are empty. Also there are some NA values in theend_lat
andend_lng
columnsIf this was a project in the real world, I would (have to 😕) interface with the relevant stakeholders to understand what is happening here
Since this is not the real world, I would happily drop rows that contain empty or NA values, but please note that this is not a practice to be taken lightly
First, I’m replacing any empty strings with NA values across the entire data frame. Then, I’m dropping any rows that contain missing values, leaving me with a cleaned-up version of the original data frame.
df <- df %>%
replace(., . == "", NA) %>%
drop_na()
# Re-examine the size of the cleaned dataframe
dim(df)
## [1] 4331707 13
Let’s find out how many rows we have dropped:
rows_dropped_empty_NA_rows <- original_df_rows - nrow(df)
paste0(
"Rows dropped due to NA or empty values: ",
comma(rows_dropped_empty_NA_rows),
" (",
round(rows_dropped_empty_NA_rows * 100 / original_df_rows, 2),
"% of raw dataframe)"
)
## [1] "Rows dropped due to NA or empty values: 1,388,170 (24.27% of raw dataframe)"
Let’s examine the different bike types that appear in the DF:
unique(df$rideable_type)
## [1] "electric_bike" "classic_bike" "docked_bike"
I’ve discovered through online research that “classic bikes” and “docked bikes” refer to the same type of bike.
Considering this, I’ll update the rideable_type
column in the dataset
by replacing docked_bike
with classic_bike
.
df <- df %>%
mutate(
rideable_type = if_else(
rideable_type == "docked_bike",
"classic_bike",
rideable_type
)
)
unique(df$rideable_type) # check if replace was successful
## [1] "electric_bike" "classic_bike"
Some station names and IDs refer to test and/or charging stations and therefore they must be eliminated, as they are not relevant to the analysis. Also, I want to standardize some names to ensure consistency in the DF (e.g. stations whose names start with “Public Rack” or stations with asterisks in their name).
To achieve this, I am using the stingr library, which is very useful for searching and modifying strings using regular expressions:
nrow_before_filtering_id_names <- nrow(df)
df <- df %>%
filter(
# Filter out station IDs that include "chargingstx" in their name
!str_detect(start_station_id, "chargingstx") &
!str_detect(end_station_id, "chargingstx") &
# Filter out station IDs that include "test", "testing", or "repair"
!str_detect(start_station_id, regex("test|testing|repair", ignore_case = TRUE)) &
!str_detect(end_station_id, regex("test|testing|repair", ignore_case = TRUE)) &
# Filter out station names that include "test", "testing", or "repair"
!str_detect(start_station_name, regex("test|testing|repair", ignore_case = TRUE)) &
!str_detect(end_station_name, regex("test|testing|repair", ignore_case = TRUE))
) %>%
mutate(
# Remove parentheses and included text at the end of station names
start_station_name = str_replace(start_station_name, " \\s*\\(.*\\)$", ""),
end_station_name = str_replace(end_station_name, " \\s*\\(.*\\)$", ""),
# Remove asterisks from station names
start_station_name = str_replace(start_station_name, "\\*$", ""),
end_station_name = str_replace(end_station_name, "\\*$", ""),
# Remove prefix "Public Rack - " from station names
start_station_name = str_replace(start_station_name, "^Public Rack - ", ""),
end_station_name = str_replace(end_station_name, "^Public Rack - ", "")
)
Let’s see how many rows we have dropped due to inconsistent names:
rows_dropped_bad_names <- nrow_before_filtering_id_names - nrow(df)
paste0(
"Rows dropped due to bad IDs/names: ",
comma(rows_dropped_bad_names),
" (",
round(rows_dropped_bad_names * 100 / original_df_rows, 2),
"% of raw dataframe)"
)
## [1] "Rows dropped due bad IDs/names: 78,388 (1.37% of raw dataframe)"
I now want to find out if there is a bijective correspondence
between Station IDs and Station names.
To achieve this, first of all
I am combining the start and end station IDs and names from the
dataframe into a new dataframe called all_stations
.
This will allow me to analyze the relationship between station IDs and names across all trips.
all_stations <- bind_rows(
select(df, station_id = start_station_id, station_name = start_station_name),
select(df, station_id = end_station_id, station_name = end_station_name)
)
head(all_stations)
## station_id station_name
## 1 TA1309000058 Lincoln Ave & Fullerton Ave
## 2 TA1309000037 Kimbark Ave & 53rd St
## 3 RP-005 Western Ave & Lunt Ave
## 4 TA1309000037 Kimbark Ave & 53rd St
## 5 TA1309000037 Kimbark Ave & 53rd St
## 6 TA1309000019 Lakeview Ave & Fullerton Pkwy
I now want to determine:
- How many IDs have multiple station names?
- How many station names have multiple IDs?
I will create a checking function called check_problems_stations
that
I will invoke again after making any interventions to verify if I have
resolved the issues.
check_problems_stations <- function(dataframe) {
by_station_id <- dataframe %>%
group_by(station_id, station_name) %>%
summarise(count = n()) %>%
filter(n() > 1) %>%
arrange(station_id, desc(count))
by_station_name <- dataframe %>%
group_by(station_name, station_id) %>%
summarise(count = n()) %>%
filter(n() > 1) %>%
arrange(station_name, desc(count))
problem_check <- data.frame(
Problem = c("Problem #1", "Problem #2"),
Description = c(
"IDs with multiple station names",
"Station names with multiple IDs"
),
Total = c(
length(unique(by_station_id$station_id)),
length(unique(by_station_name$station_name))
)
)
return(problem_check)
}
check_problems_stations(all_stations)
## Problem Description Total
## 1 Problem #1 IDs with multiple station names 74
## 2 Problem #2 Station names with multiple IDs 31
To solve problem #1 (IDs with multiple station names), I will take the following steps:
- calculate the
Mode
(most frequent station name) for each station ID - join this mode mapping data with the original df twice, first (i) for the start station IDs and then (ii) for the end station IDs
- update the
start_station_name
andend_station_name
columns with the mode station names where available, using thecoalesce
function to prioritize the mode station names over the original ones
The coalesce
function selects the first non-null value from a list of
arguments. So, in this context, it selects the mode station name if
available, otherwise keeps the original station name.
mode_stations <- all_stations %>%
group_by(station_id) %>%
summarise(station_name = Mode(station_name))
# Update DF
df <- df %>%
left_join(mode_stations, by = c(start_station_id = "station_id")) %>%
rename(start_station_name_mode = station_name) %>%
left_join(mode_stations, by = c(end_station_id = "station_id")) %>%
rename(end_station_name_mode = station_name) %>%
mutate(
start_station_name = coalesce(start_station_name_mode, start_station_name),
end_station_name = coalesce(end_station_name_mode, end_station_name)
) %>%
select(-start_station_name_mode, -end_station_name_mode)
To solve problem #2 (Station names with multiple IDs), I will take the same approach applied as in problem #1:
- calculate the
Mode
(most frequent station ID) for each station name - join this mode mapping data with the original df twice, first (i) for the start station names and then (ii) for the end station names
- update the
start_station_id
andend_station_id
columns with the mode station names where available, using thecoalesce
function to prioritize the mode station IDs over the original ones
mode_ids <- all_stations %>%
group_by(station_name) %>%
summarise(station_id = Mode(station_id))
# Update DF
df <- df %>%
left_join(mode_ids, by = c(start_station_name = "station_name")) %>%
rename(start_station_id_mode = station_id) %>%
left_join(mode_ids, by = c(end_station_name = "station_name")) %>%
rename(end_station_id_mode = station_id) %>%
mutate(
start_station_id = coalesce(start_station_id_mode, start_station_id),
end_station_id = coalesce(end_station_id_mode, end_station_id)
) %>%
select(-start_station_id_mode, -end_station_id_mode)
Let’s now verify if the steps taken previously have solved the issues.
We’ll do this by invoking the check_problem_stations
function again.
However, this time, I’ll provide the all_stations_updated
dataframe as
an argument, which reflects the updated data within our DF.
all_stations_updated <- bind_rows(select(df, station_id = start_station_id,
station_name = start_station_name), select(df, station_id = end_station_id,
station_name = end_station_name))
check_problems_stations(all_stations_updated)
## Problem Description Total
## 1 Problem #1 IDs with multiple station names 0
## 2 Problem #2 Station names with multiple IDs 0
It looks like there is now a bijective correspondence between station IDs and station names! We can move to the next task.
I want now to make sure that there is a bijective correspondence between station IDs / names and sets of coordinates.
To achieve this, I am going to combining the start/end station IDs,
names and start/end latitude and longitude from the dataframe into a new
dataframe called all_coords
.
all_coords <- df %>%
select(
station_id = start_station_id,
station_name = start_station_name,
lat = start_lat,
lng = start_lng
) %>%
bind_rows(
df %>%
select(
station_id = end_station_id,
station_name = end_station_name,
lat = end_lat,
lng = end_lng
)
)
This time I reckon that it is more appropriate to use the mean
of
coordinates for each station, in order to identify a unique set.
In order to find the average coordinates for each station, I’m creating
a new dataframe called all_coords_mean
by further processing
all_coords
, where I group the data by station_id
and station_name
and then I calculate the mean latitude and longitude for each group.
all_coords_mean <- all_coords %>%
group_by(station_id, station_name) %>%
summarise(mean_lat = mean(lat), mean_lng = mean(lng))
head(all_coords_mean)
## # A tibble: 6 × 4
## # Groups: station_id [6]
## station_id station_name mean_lat mean_lng
## <chr> <chr> <dbl> <dbl>
## 1 021320 MTV Hubbard St 41.9 -87.7
## 2 1011 Fullerton Ave & Narragansett Ave 41.9 -87.8
## 3 1012 Langley Ave & 49th St 41.8 -87.6
## 4 1015 Peterson Ave & Drake Ave 42.0 -87.7
## 5 1016 Peterson Ave & Bernard Ave 42.0 -87.7
## 6 1017 Foster Ave & Drake Ave 42.0 -87.7
I now want to determine:
- How many ID have multiple sets of coordinates?
- How many sets of coordinates have multiple IDs?
Similarly to what I have done before, I will create a checking function
called check_problems_coords
that I will invoke again after making any
interventions to verify if I have resolved the issues.
check_problems_coords <- function(dataframe) {
check_by_station_id <- dataframe %>%
mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
group_by(station_id) %>%
filter(n() > 1)
check_by_conc_coordinates <- dataframe %>%
mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
group_by(conc_coordinates) %>%
filter(n() > 1)
problem_check_coords <- data.frame(
Problem = c("Problem #3", "Problem #4"),
Description = c(
"IDs with multiple sets of coordinates",
"Sets of coordinates with multiple IDs"
),
Total = c(
length(unique(check_by_station_id$station_id)),
length(unique(check_by_conc_coordinates$conc_coordinates))
)
)
return(problem_check_coords)
}
check_problems_coords(all_coords_mean)
## Problem Description Total
## 1 Problem #3 IDs with multiple sets of coordinates 0
## 2 Problem #4 Sets of coordinates with multiple IDs 2
N/A
It seems that each station ID already has a unique set of coordinates, so there’s no need to address problem #3.
It seems that 2 sets of coordinates are assigned to more than one Station ID. To identify these cases, I’m grouping by concatenated coordinates, and filtering instances where there are multiple station IDs associated with the same set of coordinates.
all_coordinates_check_02 <- all_coords_mean %>%
mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
group_by(conc_coordinates) %>%
filter(n() > 1)
print(all_coordinates_check_02)
## # A tibble: 4 × 5
## # Groups: conc_coordinates [2]
## station_id station_name mean_lat mean_lng conc_coordinates
## <chr> <chr> <dbl> <dbl> <chr>
## 1 526 Legler Regional Library 41.9 -87.7 41.88, -87.73
## 2 853 Keeler Ave & Madison St S 41.9 -87.7 41.88, -87.73
## 3 899 Longwood Dr & 111th St 41.7 -87.7 41.69, -87.67
## 4 901 Hale Ave & 111th St 41.7 -87.7 41.69, -87.67
Let’s see how many rides correspond to those station IDs:
duplicated_coordinates <- df %>%
filter(
start_station_id %in% c(526, 853, 899, 901) |
end_station_id %in% c(526, 853, 899, 901)
) %>%
summarize(n_rows = n()) %>%
pull(n_rows)
duplicated_coordinates
## [1] 11
Due to the lack of sufficient information to reallocate coordinates in these instances, and considering the minimal number of rides linked to those station IDs (11 in total), I’ve decided to remove those lines from the dataset
df <- df %>%
filter(
!start_station_id %in% c(526, 853, 899, 901),
!end_station_id %in% c(526, 853, 899, 901)
)
Before continuing, I intend to check if there are stations with significant variations in their coordinates. These variations could potentially affect the accuracy of their representation on maps during the analysis phase.
In order to do this, I’m calculating the standard deviation (SD) of latitude and longitude for each station, filtering the results to identify where either the latitude or longitude SD is higher than 1.
high_sd <- all_coords %>%
group_by(station_id, station_name) %>%
summarise(dev_lat = sd(lat), dev_lng = sd(lng)) %>%
filter(dev_lat > 1 | dev_lng > 1)
print(high_sd)
## # A tibble: 1 × 4
## # Groups: station_id [1]
## station_id station_name dev_lat dev_lng
## <chr> <chr> <dbl> <dbl>
## 1 653B Stony Island Ave & 63rd St 1.67 3.50
Luckily, only 1 station is affected by this issue. Let’s see how many rides correspond to that station ID:
df %>%
filter(start_station_id == "653B" | end_station_id == "653B") %>%
summarize(n_rows = n()) %>%
pull(n_rows)
## [1] 585
In this case, the number of rides associated to this station is too high to be dropped, therefore I am going to hardcode the coordinates in the DF, to make sure that they are consistent:
# New coordinates values
new_lat <- 41.78
new_lng <- -87.59
df <- df %>%
mutate(
start_lat = if_else(start_station_id == "653B", new_lat, start_lat),
start_lng = if_else(start_station_id == "653B", new_lng, start_lng),
end_lat = if_else(end_station_id == "653B", new_lat, end_lat),
end_lng = if_else(end_station_id == "653B", new_lng, end_lng)
)
Let’s now verify if the actions taken above have solved the problems:
all_coords_mean_update <- df %>%
select(
station_id = start_station_id,
station_name = start_station_name,
lat = start_lat,
lng = start_lng
) %>%
bind_rows(
df %>%
select(
station_id = end_station_id,
station_name = end_station_name,
lat = end_lat,
lng = end_lng
)
) %>%
group_by(station_id, station_name) %>%
summarise(
mean_lat = mean(lat),
mean_lng = mean(lng)
)
check_problems_coords(all_coords_mean_update)
## Problem Description Total
## 1 Problem #3 IDs with multiple sets of coordinates 0
## 2 Problem #4 Sets of coordinates with multiple IDs 0
It looks like all problems are solved, therefore we can now update the DF:
all_coords_mean_update_clean <- all_coords_mean_update %>%
select(station_id, mean_lat, mean_lng)
df <- df %>%
left_join(all_coords_mean_update_clean, by = c(start_station_id = "station_id")) %>%
rename(start_lat_mean = mean_lat, start_lng_mean = mean_lng) %>%
left_join(all_coords_mean_update_clean, by = c(end_station_id = "station_id")) %>%
rename(end_lat_mean = mean_lat, end_lng_mean = mean_lng) %>%
mutate(
start_lat = start_lat_mean,
start_lng = start_lng_mean,
end_lat = end_lat_mean,
end_lng = end_lng_mean
) %>%
select(-start_lat_mean, -start_lng_mean, -end_lat_mean, -end_lng_mean)
I now want to check if there any other unpleasant surprises in my
DF.
Specifically, I want to:
- check if there are any duplications in the
ride_id
column - verify that
member_casual
column contains only the valuesmember
andcasual
sum(duplicated(df$ride_id))
## [1] 0
unique(df$member_casual)
## [1] "member" "casual"
Looks like everything is ok (some good news at last!), therefore we can proceed to the next task.
I am now adding new features to the DF, which will be useful for later analysis.
For this task, I will leverage on the functionalities of the lubridate and geosphere libraries.
Specifically I will:
- Convert the columns
started_at
andended_at
intoPOSIXct
(date-time) format - Calculate trip durations as the
difference between
ended_at
andstarted_at
columns - Derive additional temporal features (based on the
started_at
column) such as quarter, month, day of the week, and hour - Implement features to distinguish whether a trip occurred on a weekday or weekend, and to identify round trips by comparing start and end station IDs
- Provide estimation of trip distances and speed
df <- df %>%
mutate(
started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
ended_at = as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
duration_minutes = as.numeric(round((ended_at - started_at) / 60, 2)),
quarter_name = factor(
case_when(
quarter(started_at) == 1 ~ "1st Quarter",
quarter(started_at) == 2 ~ "2nd Quarter",
quarter(started_at) == 3 ~ "3rd Quarter",
quarter(started_at) == 4 ~ "4th Quarter"
),
levels = c("1st Quarter", "2nd Quarter", "3rd Quarter", "4th Quarter")
),
month_name = factor(month.abb[month(started_at)], levels = month.abb),
day_of_week_name = factor(
wday(started_at, label = TRUE, abbr = TRUE, week_start = 1, locale = "en_US.UTF-8")
),
hour_of_day = hour(started_at),
weekend = ifelse(day_of_week_name %in% c("Sat", "Sun"), "Weekend", "Monday to Friday"),
round_trip = if_else(start_station_id == end_station_id, TRUE, FALSE),
trip_distance_km = distGeo(cbind(start_lng, start_lat), cbind(end_lng, end_lat)) / 1000,
trip_speed_kmph = round(trip_distance_km / duration_minutes * 60, 2),
member_casual = factor(member_casual, levels = c("casual", "member")) # modify existing feature
)
Let’s check the updated df:
glimpse(df)
## Rows: 4,253,308
## Columns: 22
## $ ride_id <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
## $ rideable_type <chr> "electric_bike", "classic_bike", "electric_bike", "…
## $ started_at <dttm> 2023-01-21 20:05:42, 2023-01-10 15:37:36, 2023-01-…
## $ ended_at <dttm> 2023-01-21 20:16:33, 2023-01-10 15:46:05, 2023-01-…
## $ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
## $ start_station_id <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
## $ end_station_name <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
## $ end_station_id <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
## $ start_lat <dbl> 41.92416, 41.79957, 42.00857, 41.79957, 41.79957, 4…
## $ start_lng <dbl> -87.64638, -87.59474, -87.69046, -87.59474, -87.594…
## $ end_lat <dbl> 41.93188, 41.80983, 42.03973, 41.80983, 41.80983, 4…
## $ end_lng <dbl> -87.64191, -87.59937, -87.69942, -87.59937, -87.599…
## $ member_casual <fct> member, member, casual, member, member, member, mem…
## $ duration_minutes <dbl> 10.85, 8.48, 13.23, 8.77, 15.32, 3.22, 14.00, 9.35,…
## $ quarter_name <fct> 1st Quarter, 1st Quarter, 1st Quarter, 1st Quarter,…
## $ month_name <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…
## $ day_of_week_name <ord> Sat, Tue, Mon, Sun, Thu, Tue, Sun, Wed, Wed, Fri, T…
## $ hour_of_day <int> 20, 15, 7, 10, 13, 7, 21, 10, 20, 16, 17, 17, 19, 2…
## $ weekend <chr> "Weekend", "Monday to Friday", "Monday to Friday", …
## $ round_trip <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ trip_distance_km <dbl> 0.9344032, 1.2025276, 3.5405369, 1.2025276, 1.20252…
## $ trip_speed_kmph <dbl> 5.17, 8.51, 16.06, 8.23, 4.71, 13.26, 5.15, 7.72, 5…
Now I aim to explore the duration of the trips to identify if there are any anomalies:
summary(df$duration_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -54.57 5.62 9.78 15.94 17.48 6891.22
The summary
function shows that there are some negative values in trip
duration, which naturally cannot be possible. Therefore, I will be
dropping those lines (also in this case it should be appropriate to
discuss the matter with the relevant stakeholders).
rows_dropped_negative_duration <- df %>%
filter(duration_minutes <= 0) %>%
summarize(n_row = n()) %>%
pull(n_row)
# Keep only positive durations
df <- subset(df, duration_minutes > 0)
paste0(
"Rows dropped due to negative duration: ",
comma(rows_dropped_negative_duration),
" (",
round(100 * rows_dropped_negative_duration / original_df_rows, 3),
"% of raw dataframe)"
)
## [1] "Rows dropped due to negative duration: 550 (0.01% of raw dataframe)"
The result of summary
function for duration_minutes
above suggests a
wide range of ride durations, with potential outliers at both ends of
the spectrum.
To identify outliers, we could apply different methodologies such as Tukey’s fences or z-score (spoiler alert: they both don’t yield results that I find acceptable for this case study). Alternatively, we could employ common sense and industry knowledge.
What constitutes too little and too much for a bike ride?
Quite arbitrarily, I’ve posited that:
- the minimum duration
(relevant for this analysis) for a bike sharing ride is 1 minute.
Rides shorter than this could be due to technical malfunctions, user
changes of mind, unforeseen circumstances (like sudden rain), or perhaps
a desire to test out the service
- the maximum duration of a
ride (relevant for this analysis) is 1440 minutes (1 day), which
might align with the 1-day ride pass offered by the company. Rides
exceeding this limit could be attributed to various factors such as
payment issues, theft/loss, or other technical problems
On the basis of above assumptions, I’ll filter the dataframe accordingly:
rows_dropped_outliers <- df %>%
filter(duration_minutes < 1 | duration_minutes > 1440) %>%
summarise(n_row = n()) %>%
pull(n_row)
# Filter DF
df <- df %>%
filter(duration_minutes >= 1, duration_minutes <= 1440)
paste0(
"Rows dropped due to outliers: ",
comma(rows_dropped_outliers),
" (",
round(100 * rows_dropped_outliers / original_df_rows, 2),
"% of raw dataframe)"
)
## [1] "Rows dropped due to outliers: 85,498 (1.49 % of raw dataframe)"
summary(df$trip_speed_kmph)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.82 10.73 10.54 13.59 1151.89
The summary
function on trip_speed_kmph
suggests the presence of
some values that indicate impossible speeds.
According to company’s website, the maximum reachable speed for electric bikes is 20 MPH (approximately 32 km/h). Therefore, I will filter out all trips with speeds higher than 32 km/h, to ensure data accuracy:
rows_dropped_impossible_speed <- df %>%
filter(trip_speed_kmph > 32) %>%
summarise(n_row = n()) %>%
pull(n_row)
# Filter DF
df <- subset(df, trip_speed_kmph <= 32)
paste0(
"Rows dropped due to impossible speed: ",
comma(rows_dropped_impossible_speed),
" (",
round(100 * rows_dropped_impossible_speed / original_df_rows, 2),
"% of raw dataframe)"
)
## [1] "Rows dropped due to impossible speed: 1,827 (0.03 % of raw dataframe)"
I’ll drop the columns ride_id
, started_at
and ended_at
, as they
will not be needed for the analysis phase:
df <- df %>%
select(-ride_id, -started_at, -ended_at)
Let’s count how many rows in total were dropped during the whole process phase and examine the dimension of the df:
rows_dropped_df <- data.frame(
dropping_reason = c(
"Empty or NA Values",
"Bad Station IDs/Names",
"Duplicated Coordinates",
"Negative trip duration",
"Trip duration outliers",
"Impossible speed"
),
n_rows_dropped = c(
rows_dropped_empty_NA_rows,
rows_dropped_bad_names,
duplicated_coordinates,
rows_dropped_negative_duration,
rows_dropped_outliers,
rows_dropped_impossible_speed
)
) %>%
mutate(
percentage_on_raw_DF = percent(n_rows_dropped / original_df_rows)
)
print(rows_dropped_df)
## dropping_reason n_rows_dropped percentage_on_raw_DF
## 1 Empty or NA Values 1388170 24.2692%
## 2 Bad Station IDs/Names 78388 1.3704%
## 3 Duplicated Coordinates 11 0.0002%
## 4 Negative trip duration 550 0.0096%
## 5 Trip duration outliers 85498 1.4948%
## 6 Impossible speed 1827 0.0319%
rows_dropped_total <- original_df_rows - nrow(df)
paste0(
"Total rows dropped during the whole data processing phase: ",
comma(rows_dropped_total),
" (",
round(100 * rows_dropped_total / original_df_rows, 2),
"% of raw dataframe)"
)
## [1] "Total rows dropped during the whole data processing phase: 1,554,444 (27.18 % of raw dataframe)"
dim(df)
## [1] 4165433 19