This project is part of the Google Data Analytics Certificate on Coursera and the dataset and the fictional problem comes from Track 1 How does a bike-share navigate speedy business. I used SQL for data appending the datasets and R for data import, cleaning and visualization as well as the final reporting.
TABLE OF CONTENTS
Cyclist is a bike/share program with over 5,800 bikes and 600 docking stations. Currently there are two types of riders: 1) casual riders (those who buy single-ride or full-day passes), and 2) annual members a.k.a. Cyclistic members.
The goal is to maximize the number of annual memberships for the company by converting more casual riders to annual members. Lily Moreno (director of marketing) and the Cyclistic finance analysts have concluded that that annual members are more profitable than casual riders. Before this, Cyclistic have focused on building general awareness and appealing to broad consumer segments.
Currently 8 percent of riders use assistive options (e.g. reclining bikes, hand tricycles, and cargo bikes) while the rest use traditional bikes. Those who use traditional bikes are more likely to be casual riders, but about 30 percent of them use the service to commute to work each day.
Of the the following three questions that will guide the larger scope of the future marketing program, this report will only answer the first question:
The stakeholders are:
library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(openxlsx)
library(anytime)
library(ggridges)
library(plotly)
The dataset to represent Cyclistic bike share service bike data comes
from Divvy, Lyft and Scooters, LLC (“Bikeshare”) data. See the licence
agreement for more information https://ride.divvybikes.com/data-license-agreement. The
past 12 months (March 2022 to Febuary 2023) of historical trip data was
downloaded from the database in CSV format to a Google Drive
location,
then uploaded to BigQuery. After this, the 12 files were appended and
finally exported as a CSV file to be read into this R Markdown report.
Apart from appending the data, the SQL query essentially left the data
unchanged so I could showcase the data cleaning process in R, however,
in practice it should be more practical and efficient to do some of
these steps in the SQL query instead.
df <- read_csv("202203-202302_AppendedData_withCoordinates.csv",
show_col_types = FALSE)
colnames(df)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
## [10] "start_lat" "start_lng" "end_lat"
## [13] "end_lng"
#View(df)
First the headers are renamed to make them clearer. Then the date columns are converted to the datetime format. The start dates are extracted then extracted from the datetime columns, and a trip time column is calculated by taking the difference from the start and the end datetimes. The riginal datetime columns are then removed.
df_2 <- df_1 %>%
mutate(
Start_Date = date(Start_Date_and_Time),
Start_Date_and_Time = anytime(Start_Date_and_Time, tz = "UTC"),
End_Date_and_Time = anytime(End_Date_and_Time, tz = "UTC"),
Trip_Time_Minutes = difftime(End_Date_and_Time,
Start_Date_and_Time, units="mins")
)
Since the trip times and the starting dates have been “extracted” from the original datetime columns, the original columns can be removed.
df_3 <- select(df_3, -c(Start_Date_and_Time, End_Date_and_Time))
Some Trip_Time_Minutes values are negative, which doesn’t make sense. These negative observations will be removed and additionally any trip times that are under 2 minutes will also be removed to make sure that the trips were intentional by the rider.
paste(sum(df_3$Trip_Time_Minutes < 2),
"rows with trip times below 0 minutes have been removed, equal to ",
round(100*(sum(df_3$Trip_Time_Minutes < 2))/(nrow(df_3)),1),
"% of all trips.")
## [1] "235437 rows with trip times below 0 minutes have been removed, equal to 4 % of all trips."
df_clean <- df_3 %>%
filter(Trip_Time_Minutes > 2) #removes rows below 0 minutes
Now to look at missing values and unique values from the cleaned dataset, as well as some statistics by using the skim function.
df_skim <- skim_without_charts(df_clean) %>%
print()
## ── Data Summary ────────────────────────
## Values
## Name df_clean
## Number of rows 5590916
## Number of columns 13
## _______________________
## Column type frequency:
## character 7
## Date 1
## difftime 1
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 Ride_ID 0 1 16 16 0 5590916 0
## 2 Bike_Type 0 1 11 13 0 3 0
## 3 Start_Station_Name 792663 0.858 7 64 0 1687 0
## 4 Start_Station_ID 792773 0.858 3 37 0 1311 0
## 5 End_Station_Name 838246 0.850 9 64 0 1703 0
## 6 End_Station_ID 838380 0.850 3 37 0 1317 0
## 7 Rider_Type 0 1 6 6 0 2 0
##
## ── Variable type: Date ─────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median
## 1 Start_Date 0 1 2022-03-01 2023-02-28 2022-08-01
## n_unique
## 1 365
##
## ── Variable type: difftime ─────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max
## 1 Trip_Time_Minutes 0 1 2.016667 mins 41447.25 mins
## median n_unique
## 1 10.6 mins 22265
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75
## 1 Start_Latitude 0 1 41.9 0.0461 41.6 41.9 41.9 41.9
## 2 Start_Longitude 0 1 -87.6 0.0292 -87.8 -87.7 -87.6 -87.6
## 3 End_Latitude 5934 0.999 41.9 0.0635 0 41.9 41.9 41.9
## 4 End_Longitude 5934 0.999 -87.6 0.0955 -88.1 -87.7 -87.6 -87.6
## p100
## 1 42.1
## 2 -87.5
## 3 42.4
## 4 0
df_skim %>%
ggplot(mapping = aes(x = 100*complete_rate, y = skim_variable)) +
geom_bar(stat= "identity", fill="azure3") +
labs(
title="Data completeness per column",
subtitle = "Station Names and IDs have missing data!",
x = "Completeness [%]",
y = ""
) +
theme_bw() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = "black",
size = 10, face = "bold",)
) +
geom_text(aes(label = round(100*complete_rate, digits = 0)),
position = position_stack(vjust = 0.97))
* Between 15-17% of the data is missing for the start and end station names and IDs.
From the skim function output the following information about unique observations can be seen:
The unique values of all the locations are much greater than the “over 600 docking station” that were stated in the case study description. Start/End station ID come closest (1302 and 1309 unique values). This divergence merits an investigation, but will not be the focus of this analysis.
The location IDs consist of a code while the location names are either one or two street names (presumably their intersection) with a “&” symbol as their denominator. There are more unique location names than location IDs, which could be for a variety or reasons. It could be misspellings, it could be due to some street names being in reverse order, or it could perhaps be due to there being more than one station per intersection. Again, delving deeper into the geographical data is not the focus of this analysis as this is instead something that I might ask a data engineer for more information in a real world problem. However, to make sure I do my due diligence in this analysis, I will export the unique values to Excel and show my observations and recommendations here.
From the investigation I found that: * There are 17 duplicate start station names (for station ID) * There are 341 duplicate start station IDs (for staion names) * There are 17 duplicate end station names (for station ID) * There are 351 duplicate end station IDs (for station names) * Station ID codes are sometimes text e.g. street names
The overlap likely means that there are many locations that should be aggregated into just one if a fair comparison were to be done for the counts of rides per location. For this reason, also due to the ~15% missing values, there is significant bias when using this data, which should be kept in mind when looking at the results so as to not make conclusions and decisions on this misleading data.
By splitting up the casual riders and members, and splitting up the three type of bike types (classic bikes, docked bikes, and electric bikes) we can compare them by how many trips each makes a year:
df_clean %>%
ggplot() +
geom_bar(mapping = aes(Bike_Type, fill = Bike_Type), color = "black") +
facet_wrap(~Rider_Type) +
labs(title="Annual rides per bike and rider type", x = "", y = "Rides [count]") +
theme(plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"))
Observations:
To get some stats on the trip time I’ll create a table and a boxplot of the ride times.
df_clean %>%
group_by(Rider_Type, Bike_Type) %>%
summarise(min = min(Trip_Time_Minutes),
max = max(Trip_Time_Minutes),
mean = mean(Trip_Time_Minutes),
median = median(Trip_Time_Minutes),
"standard deviation" = sd(Trip_Time_Minutes)
)
## `summarise()` has grouped output by 'Rider_Type'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 7
## # Groups: Rider_Type [2]
## Rider_Type Bike_Type min max mean median stand…¹
## <chr> <chr> <drtn> <drtn> <drtn> <drtn> <dbl>
## 1 casual classic_bike 2.016667 mins 1559.9333 mins 29.255… 14.80… 91.6
## 2 casual docked_bike 2.016667 mins 41447.2500 mins 125.119… 28.33… 969.
## 3 casual electric_bike 2.016667 mins 480.4333 mins 16.681… 11.25… 18.1
## 4 member classic_bike 2.016667 mins 1559.9500 mins 14.219… 9.65… 37.8
## 5 member electric_bike 2.016667 mins 614.4000 mins 11.989… 8.66… 14.4
## # … with abbreviated variable name ¹`standard deviation`
df_clean %>%
ggplot(aes(x=Rider_Type, y=Trip_Time_Minutes, fill=Bike_Type)) +
geom_boxplot() +
facet_wrap(~Rider_Type) +
labs(title="Annual ride times per bike and rider type",
x = "",
y = "duration [minutes]") +
theme(plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"))
The boxplot wasn’t very helpful since the spread of the docked bike rides is so wide. I’ll instead do a plot of only the mean times, keeping in mind that the large spread for some of the values.
df_clean %>%
group_by(Rider_Type, Bike_Type) %>%
summarise(mean = mean(Trip_Time_Minutes), n= n()) %>%
ggplot(mapping = aes(x = Bike_Type,
y = mean,
fill = Bike_Type)) +
geom_bar(stat= "identity", color = "black") +
facet_wrap(~Rider_Type) +
labs(title="Annual Mean ride times per bike and rider type",
x = "",
y = "duration [minutes]") +
theme(plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"))
This shows that although there aren’t as many docked bikes as other types of bikes, the total annual ride time is about 4-6 times higher. Also casual riders use bikes for longer than members, especially for classic bikes (about twice as long on average).
ggplot(df_clean,
aes(x = Trip_Time_Minutes,
y = Bike_Type,
fill = Bike_Type)
) +
scale_x_continuous(limits=c(0, 60)) +
geom_density_ridges() +
facet_wrap(~Rider_Type, ncol = 1) +
theme_ridges() +
labs(title = "Distribution of trip times under 1 hour",
x = "Trip time [minutes]", y = "Rides [count]") +
theme(plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"))
This last graph for the distribution of times shows that the docked bike ride times are spread out much more than other bike rides.
Now to plot the distribution of rides throughout the year, divided again into casual riders and members, as well as bike types.
ggplot(df_clean,
aes(x = Start_Date,
fill = Bike_Type)) +
geom_density(alpha = 0.4) +
scale_x_date(date_breaks = "1 months", date_labels = "%b %d") +
facet_wrap(~Rider_Type, ncol = 1) +
labs(title = "Trips distribution per day of the year",
x = "Trip day", y = "") +
theme(axis.text.y=element_blank(), #remove y axis labels
axis.ticks.y=element_blank() #remove y axis ticks
) +
theme(plot.title = element_text(hjust = 0.5, color = "black",
size = 14, face = "bold"))
* Bike usage is in general higher between May-November and lower between November-May. * Electric bike rides seem to be used most in the months of July-October. * Classic and docked bikes overlap more, and are used most in the months of May-Aug. * Members have a more even distribution of usage for throughout the year compared to casual riders. Essentially, the cooler months of September-May casual riders use the service less relative to members while in the warmer months June-October they use the bikes more relative to members.
As the summer months are where most casual riders are using the bikes, there could be an opportunity to run campaigns during these months to hook them for the year.
Since there are a lot of datapoints I will split the whole dataset into a samples of 10000 points of data for casual riders and members, as well as different bike types, to see how they differ in the locations in which the rides start from.
(By clicking the legend we can turn on or off the markers from the map)
df_clean_map <- df_clean %>%
select(Rider_Type,
Bike_Type,
Start_Longitude,
Start_Latitude)
fig <- plot_ly(df_clean_map %>% sample_n(100000),
lat = ~Start_Latitude,
lon = ~Start_Longitude,
marker = list(size = 6, opacity = 0.2),
type = 'scattermapbox',
colors = "Set1",
color = (~Rider_Type)
) %>%
config(displayModeBar = FALSE) %>%
layout(
mapbox = list(
style = 'carto-positron',
zoom = 8,
center = list(lon = -87.63, lat = 41.88)))
fig
## No scattermapbox mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
There is quite a lot of overlap between casual riders and members, so this this map doesn’t really give any new insights. Looking at the bike types gives something interesting to go on however.
fig <- plot_ly(df_clean_map %>% sample_n(100000),
lat = ~Start_Latitude,
lon = ~Start_Longitude,
marker = list(size = 6, opacity = 0.2),
type = 'scattermapbox',
colors = "Set1",
color = (~Bike_Type)
) %>%
config(displayModeBar = FALSE) %>%
layout(
mapbox = list(
style = 'carto-positron',
zoom = 8,
center = list(lon = -87.63, lat = 41.88)))
fig
## No scattermapbox mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
The markers for the docked bike are much more centralized than both classic and electric bikes. The classic bikes have more spread, but the electric bikes have the most spread of all especially throughout western and southern parts of Chicago.
From the analysis I found these to be the most pressing calls to action for Cyclistic: