Cyclistic 2023 | Data Processing

Lucio Colonna
Last updated: Tuesday September 10, 2024

⬅️ Back to Index

Summary

📌 Missing Data

Approximately 15% to 16% of values are missing in the start/end station names and start/end station IDs columns
Around 0.12% of values are missing in the end latitude/longitude columns
All rows containing empty or NA values were dropped

📌 Data Adjustments

Bike type: “docked bike” was replaced with “classic bike”, as “docked bike” is a legacy term
Station names and IDs: filtered out entries with specific keywords in station IDs, and cleaned station names by removing unwanted characters and prefixes

📌 Data Consistency

Ensured bijective correspondence between:
- 1. Station IDs VS station names (for both start, end)
- 1. Station IDs/names (for both start, end) VS coordinate sets
Specifically:
- for a. –> the mode (most frequent value) of station names/IDs was used
- for b. –> the mean of start/end station coordinates (latitude, longitude) was used
Fixed stations with coordinates showing significant standard deviation

📌 New features introduced

New features added to the dataframe, specifically:
- trip duration, as the difference between end date/time and start date/time
- additional temporal features (based on the start date) such as quarter, month, day of the week, and hour
- weekday vs weekend rides, round trips
- estimation of trip distances and speed

📌 Bad data removed

Trip duration:
- Rides with negative duration were removed
- Minimum ride duration set to 1 minute, maximum to 1440 minutes (1 day). Rides outside these thresholds were removed
Rides with impossible speed (higher than 20 MPH / 32 km/h) were removed

📌 Total rows dropped

A total of 1,554,444 rows were dropped during the process phase (27.18 % of raw DF)

1. Empty and NA Values

⬅️ Back to Summary

To evaluate how many rows have empty or NA values, I’m using the skim_without_charts function from the skimr library which is very useful to obtain information from the dataset.

Next, I’m selecting specific columns such as skim_variable (name of variable), n_missing (NA values count), and character.empty (empty values count).

Then, I’m creating new columns total_empty_or_NA, percentage_missing, and percentage_available to add some useful information on data availability.

skimmer <- skim_without_charts(df) %>%
  select(skim_variable, n_missing, character.empty) %>%
  rename(`NA` = n_missing, empty = character.empty) %>%
  mutate(
    class = skim_without_charts(df)$skim_type, 
    empty = if_else(is.na(empty), 0, empty),
    `NA` = if_else(is.na(`NA`), 0, `NA`),
    total_empty_or_NA = `NA` + empty, 
    total_available = nrow(df) - total_empty_or_NA,
    p_not_available = round(total_empty_or_NA/nrow(df) * 100, 2), 
    p_available = 100 - p_not_available
  ) %>%
  select(skim_variable, class, total_available, everything()) %>%
  arrange(factor(skim_variable, levels = colnames(df)))

knitr::kable(skimmer)

skim_variable	class	total_available	NA	empty	total_empty_or_NA	p_not_available	p_available
ride_id	character	5719877	0	0	0	0.00	100.00
rideable_type	character	5719877	0	0	0	0.00	100.00
started_at	character	5719877	0	0	0	0.00	100.00
ended_at	character	5719877	0	0	0	0.00	100.00
start_station_name	character	4844161	0	875716	875716	15.31	84.69
start_station_id	character	4844029	0	875848	875848	15.31	84.69
end_station_name	character	4790675	0	929202	929202	16.25	83.75
end_station_id	character	4790534	0	929343	929343	16.25	83.75
start_lat	numeric	5719877	0	0	0	0.00	100.00
start_lng	numeric	5719877	0	0	0	0.00	100.00
end_lat	numeric	5712887	6990	0	6990	0.12	99.88
end_lng	numeric	5712887	6990	0	6990	0.12	99.88
member_casual	character	5719877	0	0	0	0.00	100.00

Well, results are quite disappointing 😥:

Between ~15% and ~16% of values on start_station_name, start_station_id, end_station_name and end_station_id are empty. Also there are some NA values in the end_lat and end_lng columns

If this was a project in the real world, I would (have to 😕) interface with the relevant stakeholders to understand what is happening here

Since this is not the real world, I would happily drop rows that contain empty or NA values, but please note that this is not a practice to be taken lightly

First, I’m replacing any empty strings with NA values across the entire data frame. Then, I’m dropping any rows that contain missing values, leaving me with a cleaned-up version of the original data frame.

df <- df %>%
    replace(., . == "", NA) %>%
    drop_na()

# Re-examine the size of the cleaned dataframe
dim(df)

## [1] 4331707      13

Let’s find out how many rows we have dropped:

rows_dropped_empty_NA_rows <- original_df_rows - nrow(df)

paste0(
  "Rows dropped due to NA or empty values: ", 
  comma(rows_dropped_empty_NA_rows),
  " (", 
  round(rows_dropped_empty_NA_rows * 100 / original_df_rows, 2), 
  "% of raw dataframe)"
)

## [1] "Rows dropped due to NA or empty values: 1,388,170 (24.27% of raw dataframe)"

2. Bike type

⬅️ Back to Summary

Let’s examine the different bike types that appear in the DF:

unique(df$rideable_type)

## [1] "electric_bike" "classic_bike"  "docked_bike"

I’ve discovered through online research that “classic bikes” and “docked bikes” refer to the same type of bike.

Considering this, I’ll update the rideable_type column in the dataset by replacing docked_bike with classic_bike.

df <- df %>%
  mutate(
    rideable_type = if_else(
      rideable_type == "docked_bike", 
      "classic_bike", 
      rideable_type
    )
  )

unique(df$rideable_type)  # check if replace was successful

## [1] "electric_bike" "classic_bike"

3. Names and IDs adjustments

⬅️ Back to Summary

Some station names and IDs refer to test and/or charging stations and therefore they must be eliminated, as they are not relevant to the analysis. Also, I want to standardize some names to ensure consistency in the DF (e.g. stations whose names start with “Public Rack” or stations with asterisks in their name).

To achieve this, I am using the stingr library, which is very useful for searching and modifying strings using regular expressions:

nrow_before_filtering_id_names <- nrow(df)

df <- df %>%
  filter(
    # Filter out station IDs that include "chargingstx" in their name
    !str_detect(start_station_id, "chargingstx") &
    !str_detect(end_station_id, "chargingstx") &
    
    # Filter out station IDs that include "test", "testing", or "repair"
    !str_detect(start_station_id, regex("test|testing|repair", ignore_case = TRUE)) &
    !str_detect(end_station_id, regex("test|testing|repair", ignore_case = TRUE)) &
    
    # Filter out station names that include "test", "testing", or "repair"
    !str_detect(start_station_name, regex("test|testing|repair", ignore_case = TRUE)) &
    !str_detect(end_station_name, regex("test|testing|repair", ignore_case = TRUE))
  ) %>%

  mutate(
    # Remove parentheses and included text at the end of station names
    start_station_name = str_replace(start_station_name, " \\s*\\(.*\\)$", ""),
    end_station_name = str_replace(end_station_name, " \\s*\\(.*\\)$", ""),
    
    # Remove asterisks from station names
    start_station_name = str_replace(start_station_name, "\\*$", ""),
    end_station_name = str_replace(end_station_name, "\\*$", ""),
    
    # Remove prefix "Public Rack - " from station names
    start_station_name = str_replace(start_station_name, "^Public Rack - ", ""),
    end_station_name = str_replace(end_station_name, "^Public Rack - ", "")
  )

Let’s see how many rows we have dropped due to inconsistent names:

rows_dropped_bad_names <- nrow_before_filtering_id_names - nrow(df)

paste0(
  "Rows dropped due to bad IDs/names: ", 
  comma(rows_dropped_bad_names),
  " (", 
  round(rows_dropped_bad_names * 100 / original_df_rows, 2), 
  "% of raw dataframe)"
)

## [1] "Rows dropped due bad IDs/names: 78,388 (1.37% of raw dataframe)"

4. Station IDs VS Station Names

⬅️ Back to Summary

I now want to find out if there is a bijective correspondence between Station IDs and Station names.
To achieve this, first of all I am combining the start and end station IDs and names from the dataframe into a new dataframe called all_stations.

This will allow me to analyze the relationship between station IDs and names across all trips.

all_stations <- bind_rows(
  select(df, station_id = start_station_id, station_name = start_station_name),
  select(df, station_id = end_station_id, station_name = end_station_name)
)

head(all_stations)

##     station_id                  station_name
## 1 TA1309000058   Lincoln Ave & Fullerton Ave
## 2 TA1309000037         Kimbark Ave & 53rd St
## 3       RP-005        Western Ave & Lunt Ave
## 4 TA1309000037         Kimbark Ave & 53rd St
## 5 TA1309000037         Kimbark Ave & 53rd St
## 6 TA1309000019 Lakeview Ave & Fullerton Pkwy

4.1 Problems statement

⬅️ Back to Summary

I now want to determine:

How many IDs have multiple station names?
How many station names have multiple IDs?

I will create a checking function called check_problems_stations that I will invoke again after making any interventions to verify if I have resolved the issues.

check_problems_stations <- function(dataframe) {
  by_station_id <- dataframe %>%
    group_by(station_id, station_name) %>%
    summarise(count = n()) %>%
    filter(n() > 1) %>%
    arrange(station_id, desc(count))

  by_station_name <- dataframe %>%
    group_by(station_name, station_id) %>%
    summarise(count = n()) %>%
    filter(n() > 1) %>%
    arrange(station_name, desc(count))

  problem_check <- data.frame(
    Problem = c("Problem #1", "Problem #2"),
    Description = c(
      "IDs with multiple station names", 
      "Station names with multiple IDs"
    ),
    Total = c(
      length(unique(by_station_id$station_id)), 
      length(unique(by_station_name$station_name))
    )
  )

  return(problem_check)
}

check_problems_stations(all_stations)

##      Problem                     Description Total
## 1 Problem #1 IDs with multiple station names    74
## 2 Problem #2 Station names with multiple IDs    31

4.2 Solve Problem #1

⬅️ Back to Summary

To solve problem #1 (IDs with multiple station names), I will take the following steps:

calculate the Mode (most frequent station name) for each station ID
join this mode mapping data with the original df twice, first (i) for the start station IDs and then (ii) for the end station IDs
update the start_station_name and end_station_name columns with the mode station names where available, using the coalesce function to prioritize the mode station names over the original ones

The coalesce function selects the first non-null value from a list of arguments. So, in this context, it selects the mode station name if available, otherwise keeps the original station name.

mode_stations <- all_stations %>%
  group_by(station_id) %>%
  summarise(station_name = Mode(station_name))

# Update DF
df <- df %>%
  left_join(mode_stations, by = c(start_station_id = "station_id")) %>%
  rename(start_station_name_mode = station_name) %>%
  left_join(mode_stations, by = c(end_station_id = "station_id")) %>%
  rename(end_station_name_mode = station_name) %>%
  mutate(
    start_station_name = coalesce(start_station_name_mode, start_station_name),
    end_station_name = coalesce(end_station_name_mode, end_station_name)
  ) %>%
  select(-start_station_name_mode, -end_station_name_mode)

4.3 Solve Problem #2

⬅️ Back to Summary

To solve problem #2 (Station names with multiple IDs), I will take the same approach applied as in problem #1:

calculate the Mode (most frequent station ID) for each station name
join this mode mapping data with the original df twice, first (i) for the start station names and then (ii) for the end station names
update the start_station_id and end_station_id columns with the mode station names where available, using the coalesce function to prioritize the mode station IDs over the original ones

mode_ids <- all_stations %>%
  group_by(station_name) %>%
  summarise(station_id = Mode(station_id))

# Update DF
df <- df %>%
  left_join(mode_ids, by = c(start_station_name = "station_name")) %>%
  rename(start_station_id_mode = station_id) %>%
  left_join(mode_ids, by = c(end_station_name = "station_name")) %>%
  rename(end_station_id_mode = station_id) %>%
  mutate(
    start_station_id = coalesce(start_station_id_mode, start_station_id),
    end_station_id = coalesce(end_station_id_mode, end_station_id)
  ) %>%
  select(-start_station_id_mode, -end_station_id_mode)

4.4 Verify problems resolution

⬅️ Back to Summary

Let’s now verify if the steps taken previously have solved the issues.

We’ll do this by invoking the check_problem_stations function again. However, this time, I’ll provide the all_stations_updated dataframe as an argument, which reflects the updated data within our DF.

all_stations_updated <- bind_rows(select(df, station_id = start_station_id,
    station_name = start_station_name), select(df, station_id = end_station_id,
    station_name = end_station_name))

check_problems_stations(all_stations_updated)

##      Problem                     Description Total
## 1 Problem #1 IDs with multiple station names     0
## 2 Problem #2 Station names with multiple IDs     0

It looks like there is now a bijective correspondence between station IDs and station names! We can move to the next task.

5. Station IDs/Names VS Coordinates

⬅️ Back to Summary

I want now to make sure that there is a bijective correspondence between station IDs / names and sets of coordinates.

To achieve this, I am going to combining the start/end station IDs, names and start/end latitude and longitude from the dataframe into a new dataframe called all_coords.

all_coords <- df %>%
  select(
    station_id = start_station_id, 
    station_name = start_station_name,
    lat = start_lat, 
    lng = start_lng
  ) %>%
  bind_rows(
    df %>%
      select(
        station_id = end_station_id, 
        station_name = end_station_name,
        lat = end_lat, 
        lng = end_lng
      )
  )

This time I reckon that it is more appropriate to use the mean of coordinates for each station, in order to identify a unique set.

In order to find the average coordinates for each station, I’m creating a new dataframe called all_coords_mean by further processing all_coords, where I group the data by station_id and station_name and then I calculate the mean latitude and longitude for each group.

all_coords_mean <- all_coords %>%
    group_by(station_id, station_name) %>%
    summarise(mean_lat = mean(lat), mean_lng = mean(lng))

head(all_coords_mean)

## # A tibble: 6 × 4
## # Groups:   station_id [6]
##   station_id station_name                     mean_lat mean_lng
##   <chr>      <chr>                               <dbl>    <dbl>
## 1 021320     MTV Hubbard St                       41.9    -87.7
## 2 1011       Fullerton Ave & Narragansett Ave     41.9    -87.8
## 3 1012       Langley Ave & 49th St                41.8    -87.6
## 4 1015       Peterson Ave & Drake Ave             42.0    -87.7
## 5 1016       Peterson Ave & Bernard Ave           42.0    -87.7
## 6 1017       Foster Ave & Drake Ave               42.0    -87.7

5.1 Problems statement

⬅️ Back to Summary

I now want to determine:

How many ID have multiple sets of coordinates?
How many sets of coordinates have multiple IDs?

Similarly to what I have done before, I will create a checking function called check_problems_coords that I will invoke again after making any interventions to verify if I have resolved the issues.

check_problems_coords <- function(dataframe) {
  check_by_station_id <- dataframe %>%
    mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
    group_by(station_id) %>%
    filter(n() > 1)

  check_by_conc_coordinates <- dataframe %>%
    mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
    group_by(conc_coordinates) %>%
    filter(n() > 1)

  problem_check_coords <- data.frame(
    Problem = c("Problem #3", "Problem #4"),
    Description = c(
      "IDs with multiple sets of coordinates",
      "Sets of coordinates with multiple IDs"
    ),
    Total = c(
      length(unique(check_by_station_id$station_id)),
      length(unique(check_by_conc_coordinates$conc_coordinates))
    )
  )

  return(problem_check_coords)
}

check_problems_coords(all_coords_mean)

##      Problem                           Description Total
## 1 Problem #3 IDs with multiple sets of coordinates     0
## 2 Problem #4 Sets of coordinates with multiple IDs     2

5.2 Solve Problem #3

⬅️ Back to Summary

N/A

It seems that each station ID already has a unique set of coordinates, so there’s no need to address problem #3.

5.3 Solve Problem #4

⬅️ Back to Summary

It seems that 2 sets of coordinates are assigned to more than one Station ID. To identify these cases, I’m grouping by concatenated coordinates, and filtering instances where there are multiple station IDs associated with the same set of coordinates.

all_coordinates_check_02 <- all_coords_mean %>%
    mutate(conc_coordinates = paste(mean_lat, mean_lng, sep = ", ")) %>%
    group_by(conc_coordinates) %>%
    filter(n() > 1)

print(all_coordinates_check_02)

## # A tibble: 4 × 5
## # Groups:   conc_coordinates [2]
##   station_id station_name              mean_lat mean_lng conc_coordinates
##   <chr>      <chr>                        <dbl>    <dbl> <chr>           
## 1 526        Legler Regional Library       41.9    -87.7 41.88, -87.73   
## 2 853        Keeler Ave & Madison St S     41.9    -87.7 41.88, -87.73   
## 3 899        Longwood Dr & 111th St        41.7    -87.7 41.69, -87.67   
## 4 901        Hale Ave & 111th St           41.7    -87.7 41.69, -87.67

Let’s see how many rides correspond to those station IDs:

duplicated_coordinates <- df %>%
  filter(
    start_station_id %in% c(526, 853, 899, 901) | 
    end_station_id %in% c(526, 853, 899, 901)
  ) %>%
  summarize(n_rows = n()) %>%
  pull(n_rows)

duplicated_coordinates

## [1] 11

Due to the lack of sufficient information to reallocate coordinates in these instances, and considering the minimal number of rides linked to those station IDs (11 in total), I’ve decided to remove those lines from the dataset

df <- df %>%
  filter(
    !start_station_id %in% c(526, 853, 899, 901), 
    !end_station_id %in% c(526, 853, 899, 901)
  )

5.4 Coordinates with high SD

⬅️ Back to Summary

Before continuing, I intend to check if there are stations with significant variations in their coordinates. These variations could potentially affect the accuracy of their representation on maps during the analysis phase.

In order to do this, I’m calculating the standard deviation (SD) of latitude and longitude for each station, filtering the results to identify where either the latitude or longitude SD is higher than 1.

high_sd <- all_coords %>%
    group_by(station_id, station_name) %>%
    summarise(dev_lat = sd(lat), dev_lng = sd(lng)) %>%
    filter(dev_lat > 1 | dev_lng > 1)

print(high_sd)

## # A tibble: 1 × 4
## # Groups:   station_id [1]
##   station_id station_name               dev_lat dev_lng
##   <chr>      <chr>                        <dbl>   <dbl>
## 1 653B       Stony Island Ave & 63rd St    1.67    3.50

Luckily, only 1 station is affected by this issue. Let’s see how many rides correspond to that station ID:

df %>%
    filter(start_station_id == "653B" | end_station_id == "653B") %>%
    summarize(n_rows = n()) %>%
    pull(n_rows)

## [1] 585

In this case, the number of rides associated to this station is too high to be dropped, therefore I am going to hardcode the coordinates in the DF, to make sure that they are consistent:

# New coordinates values
new_lat <- 41.78
new_lng <- -87.59

df <- df %>%
  mutate(
    start_lat = if_else(start_station_id == "653B", new_lat, start_lat),
    start_lng = if_else(start_station_id == "653B", new_lng, start_lng),
    end_lat = if_else(end_station_id == "653B", new_lat, end_lat),
    end_lng = if_else(end_station_id == "653B", new_lng, end_lng)
  )

5.5 Verify problems resolution

⬅️ Back to Summary

Let’s now verify if the actions taken above have solved the problems:

all_coords_mean_update <- df %>%
  select(
    station_id = start_station_id, 
    station_name = start_station_name,
    lat = start_lat, 
    lng = start_lng
  ) %>%
  bind_rows(
    df %>%
      select(
        station_id = end_station_id, 
        station_name = end_station_name,
        lat = end_lat, 
        lng = end_lng
      )
  ) %>%
  group_by(station_id, station_name) %>%
  summarise(
    mean_lat = mean(lat), 
    mean_lng = mean(lng)
  )

check_problems_coords(all_coords_mean_update)

##      Problem                           Description Total
## 1 Problem #3 IDs with multiple sets of coordinates     0
## 2 Problem #4 Sets of coordinates with multiple IDs     0

It looks like all problems are solved, therefore we can now update the DF:

all_coords_mean_update_clean <- all_coords_mean_update %>%
  select(station_id, mean_lat, mean_lng)

df <- df %>%
  left_join(all_coords_mean_update_clean, by = c(start_station_id = "station_id")) %>%
  rename(start_lat_mean = mean_lat, start_lng_mean = mean_lng) %>%
  left_join(all_coords_mean_update_clean, by = c(end_station_id = "station_id")) %>%
  rename(end_lat_mean = mean_lat, end_lng_mean = mean_lng) %>%
  mutate(
    start_lat = start_lat_mean, 
    start_lng = start_lng_mean,
    end_lat = end_lat_mean, 
    end_lng = end_lng_mean
  ) %>%
  select(-start_lat_mean, -start_lng_mean, -end_lat_mean, -end_lng_mean)

6. Additional checks

⬅️ Back to Summary

I now want to check if there any other unpleasant surprises in my DF.
Specifically, I want to:

check if there are any duplications in the ride_id column
verify that member_casual column contains only the values member and casual

sum(duplicated(df$ride_id))

## [1] 0

unique(df$member_casual)

## [1] "member" "casual"

Looks like everything is ok (some good news at last!), therefore we can proceed to the next task.

7. New features introduction

⬅️ Back to Summary

I am now adding new features to the DF, which will be useful for later analysis.

For this task, I will leverage on the functionalities of the lubridate and geosphere libraries.

Specifically I will:

Convert the columns started_at and ended_at into POSIXct (date-time) format
Calculate trip durations as the difference between ended_at and started_at columns
Derive additional temporal features (based on the started_at column) such as quarter, month, day of the week, and hour
Implement features to distinguish whether a trip occurred on a weekday or weekend, and to identify round trips by comparing start and end station IDs
Provide estimation of trip distances and speed

df <- df %>%
    mutate(
        started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
        ended_at = as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
        duration_minutes = as.numeric(round((ended_at - started_at) / 60, 2)),
        quarter_name = factor(
            case_when(
                quarter(started_at) == 1 ~ "1st Quarter",
                quarter(started_at) == 2 ~ "2nd Quarter",
                quarter(started_at) == 3 ~ "3rd Quarter",
                quarter(started_at) == 4 ~ "4th Quarter"
            ),
            levels = c("1st Quarter", "2nd Quarter", "3rd Quarter", "4th Quarter")
        ),
        month_name = factor(month.abb[month(started_at)], levels = month.abb),
        day_of_week_name = factor(
            wday(started_at, label = TRUE, abbr = TRUE, week_start = 1, locale = "en_US.UTF-8")
        ),
        hour_of_day = hour(started_at),
        weekend = ifelse(day_of_week_name %in% c("Sat", "Sun"), "Weekend", "Monday to Friday"),
        round_trip = if_else(start_station_id == end_station_id, TRUE, FALSE),
        trip_distance_km = distGeo(cbind(start_lng, start_lat), cbind(end_lng, end_lat)) / 1000,
        trip_speed_kmph = round(trip_distance_km / duration_minutes * 60, 2),
        member_casual = factor(member_casual, levels = c("casual", "member"))  # modify existing feature
    )

Let’s check the updated df:

glimpse(df)

## Rows: 4,253,308
## Columns: 22
## $ ride_id            <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
## $ rideable_type      <chr> "electric_bike", "classic_bike", "electric_bike", "…
## $ started_at         <dttm> 2023-01-21 20:05:42, 2023-01-10 15:37:36, 2023-01-…
## $ ended_at           <dttm> 2023-01-21 20:16:33, 2023-01-10 15:46:05, 2023-01-…
## $ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
## $ start_station_id   <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
## $ end_station_name   <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
## $ end_station_id     <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
## $ start_lat          <dbl> 41.92416, 41.79957, 42.00857, 41.79957, 41.79957, 4…
## $ start_lng          <dbl> -87.64638, -87.59474, -87.69046, -87.59474, -87.594…
## $ end_lat            <dbl> 41.93188, 41.80983, 42.03973, 41.80983, 41.80983, 4…
## $ end_lng            <dbl> -87.64191, -87.59937, -87.69942, -87.59937, -87.599…
## $ member_casual      <fct> member, member, casual, member, member, member, mem…
## $ duration_minutes   <dbl> 10.85, 8.48, 13.23, 8.77, 15.32, 3.22, 14.00, 9.35,…
## $ quarter_name       <fct> 1st Quarter, 1st Quarter, 1st Quarter, 1st Quarter,…
## $ month_name         <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…
## $ day_of_week_name   <ord> Sat, Tue, Mon, Sun, Thu, Tue, Sun, Wed, Wed, Fri, T…
## $ hour_of_day        <int> 20, 15, 7, 10, 13, 7, 21, 10, 20, 16, 17, 17, 19, 2…
## $ weekend            <chr> "Weekend", "Monday to Friday", "Monday to Friday", …
## $ round_trip         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ trip_distance_km   <dbl> 0.9344032, 1.2025276, 3.5405369, 1.2025276, 1.20252…
## $ trip_speed_kmph    <dbl> 5.17, 8.51, 16.06, 8.23, 4.71, 13.26, 5.15, 7.72, 5…

8. Trip duration

⬅️ Back to Summary

Now I aim to explore the duration of the trips to identify if there are any anomalies:

summary(df$duration_minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -54.57    5.62    9.78   15.94   17.48 6891.22

The summary function shows that there are some negative values in trip duration, which naturally cannot be possible. Therefore, I will be dropping those lines (also in this case it should be appropriate to discuss the matter with the relevant stakeholders).

8.1 Remove negative duration trips

⬅️ Back to Summary

rows_dropped_negative_duration <- df %>%
  filter(duration_minutes <= 0) %>%
  summarize(n_row = n()) %>%
  pull(n_row)

# Keep only positive durations
df <- subset(df, duration_minutes > 0)

paste0(
  "Rows dropped due to negative duration: ", 
  comma(rows_dropped_negative_duration), 
  " (", 
  round(100 * rows_dropped_negative_duration / original_df_rows, 3), 
  "% of raw dataframe)"
)

## [1] "Rows dropped due to negative duration: 550 (0.01% of raw dataframe)"

8.2 Remove outliers

⬅️ Back to Summary

The result of summary function for duration_minutes above suggests a wide range of ride durations, with potential outliers at both ends of the spectrum.

To identify outliers, we could apply different methodologies such as Tukey’s fences or z-score (spoiler alert: they both don’t yield results that I find acceptable for this case study). Alternatively, we could employ common sense and industry knowledge.

What constitutes too little and too much for a bike ride?

Quite arbitrarily, I’ve posited that:
- the minimum duration (relevant for this analysis) for a bike sharing ride is 1 minute. Rides shorter than this could be due to technical malfunctions, user changes of mind, unforeseen circumstances (like sudden rain), or perhaps a desire to test out the service
- the maximum duration of a ride (relevant for this analysis) is 1440 minutes (1 day), which might align with the 1-day ride pass offered by the company. Rides exceeding this limit could be attributed to various factors such as payment issues, theft/loss, or other technical problems

On the basis of above assumptions, I’ll filter the dataframe accordingly:

rows_dropped_outliers <- df %>%
  filter(duration_minutes < 1 | duration_minutes > 1440) %>%
  summarise(n_row = n()) %>%
  pull(n_row)

# Filter DF
df <- df %>%
  filter(duration_minutes >= 1, duration_minutes <= 1440)

paste0(
  "Rows dropped due to outliers: ", 
  comma(rows_dropped_outliers), 
  " (", 
  round(100 * rows_dropped_outliers / original_df_rows, 2), 
  "% of raw dataframe)"
)

## [1] "Rows dropped due to outliers: 85,498 (1.49 % of raw dataframe)"

9. Trip Speed

⬅️ Back to Summary

summary(df$trip_speed_kmph)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.82   10.73   10.54   13.59 1151.89

The summary function on trip_speed_kmph suggests the presence of some values that indicate impossible speeds.

According to company’s website, the maximum reachable speed for electric bikes is 20 MPH (approximately 32 km/h). Therefore, I will filter out all trips with speeds higher than 32 km/h, to ensure data accuracy:

rows_dropped_impossible_speed <- df %>%
  filter(trip_speed_kmph > 32) %>%
  summarise(n_row = n()) %>%
  pull(n_row)

# Filter DF
df <- subset(df, trip_speed_kmph <= 32)

paste0(
  "Rows dropped due to impossible speed: ", 
  comma(rows_dropped_impossible_speed), 
  " (", 
  round(100 * rows_dropped_impossible_speed / original_df_rows, 2), 
  "% of raw dataframe)"
)

## [1] "Rows dropped due to impossible speed: 1,827 (0.03 % of raw dataframe)"

10. Concluding Activities

10.1 Remove colums not necessary for analysis

⬅️ Back to Summary

I’ll drop the columns ride_id, started_at and ended_at, as they will not be needed for the analysis phase:

df <- df %>%
    select(-ride_id, -started_at, -ended_at)

10.2 Total rows dropped during process phase

⬅️ Back to Summary

Let’s count how many rows in total were dropped during the whole process phase and examine the dimension of the df:

rows_dropped_df <- data.frame(
  dropping_reason = c(
    "Empty or NA Values",
    "Bad Station IDs/Names", 
    "Duplicated Coordinates", 
    "Negative trip duration",
    "Trip duration outliers", 
    "Impossible speed"
  ), 
  n_rows_dropped = c(
    rows_dropped_empty_NA_rows,
    rows_dropped_bad_names, 
    duplicated_coordinates, 
    rows_dropped_negative_duration,
    rows_dropped_outliers, 
    rows_dropped_impossible_speed
  )
) %>%
  mutate(
    percentage_on_raw_DF = percent(n_rows_dropped / original_df_rows)
  )

print(rows_dropped_df)

##          dropping_reason n_rows_dropped percentage_on_raw_DF
## 1     Empty or NA Values        1388170             24.2692%
## 2  Bad Station IDs/Names          78388              1.3704%
## 3 Duplicated Coordinates             11              0.0002%
## 4 Negative trip duration            550              0.0096%
## 5 Trip duration outliers          85498              1.4948%
## 6       Impossible speed           1827              0.0319%

rows_dropped_total <- original_df_rows - nrow(df)

paste0(
  "Total rows dropped during the whole data processing phase: ", 
  comma(rows_dropped_total), 
  " (", 
  round(100 * rows_dropped_total / original_df_rows, 2), 
  "% of raw dataframe)"
)

## [1] "Total rows dropped during the whole data processing phase: 1,554,444 (27.18 % of raw dataframe)"

dim(df)

## [1] 4165433      19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-processing.md

Data-processing.md

Cyclistic 2023 | Data Processing

Summary

1. Empty and NA Values

2. Bike type

3. Names and IDs adjustments

4. Station IDs VS Station Names

4.1 Problems statement

4.2 Solve Problem #1

4.3 Solve Problem #2

4.4 Verify problems resolution

5. Station IDs/Names VS Coordinates

5.1 Problems statement

5.2 Solve Problem #3

5.3 Solve Problem #4

5.4 Coordinates with high SD

5.5 Verify problems resolution

6. Additional checks

7. New features introduction

8. Trip duration

8.1 Remove negative duration trips

8.2 Remove outliers

9. Trip Speed

10. Concluding Activities

10.1 Remove colums not necessary for analysis

10.2 Total rows dropped during process phase

Files

Data-processing.md

Latest commit

History

Data-processing.md

File metadata and controls

Cyclistic 2023 | Data Processing

Summary

1. Empty and NA Values

2. Bike type

3. Names and IDs adjustments

4. Station IDs VS Station Names

4.1 Problems statement

4.2 Solve Problem #1

4.3 Solve Problem #2

4.4 Verify problems resolution

5. Station IDs/Names VS Coordinates

5.1 Problems statement

5.2 Solve Problem #3

5.3 Solve Problem #4

5.4 Coordinates with high SD

5.5 Verify problems resolution

6. Additional checks

7. New features introduction

8. Trip duration

8.1 Remove negative duration trips

8.2 Remove outliers

9. Trip Speed

10. Concluding Activities

10.1 Remove colums not necessary for analysis

10.2 Total rows dropped during process phase