This project is part of the Google Data Analytics Certificate on Coursera and the dataset and the fictional problem comes from Track 1 How does a bike-share navigate speedy business. I used SQL for data appending the datasets and R for data import, cleaning and visualization as well as the final reporting.

TABLE OF CONTENTS

Business Task Background
Importing Libraries, Data Import and Dataset
Data Exploration and Cleaning
Analyzing the Questions
1. How are the different bike types used by members and non-members?
2. How is the trip time different for members and non-members?
3. How are the bikes used throughout the year?
4. Where are the bikes being used?
Recommendations

1. Business Task Background

Cyclist is a bike/share program with over 5,800 bikes and 600 docking stations. Currently there are two types of riders: 1) casual riders (those who buy single-ride or full-day passes), and 2) annual members a.k.a. Cyclistic members.

The goal is to maximize the number of annual memberships for the company by converting more casual riders to annual members. Lily Moreno (director of marketing) and the Cyclistic finance analysts have concluded that that annual members are more profitable than casual riders. Before this, Cyclistic have focused on building general awareness and appealing to broad consumer segments.

Currently 8 percent of riders use assistive options (e.g. reclining bikes, hand tricycles, and cargo bikes) while the rest use traditional bikes. Those who use traditional bikes are more likely to be casual riders, but about 30 percent of them use the service to commute to work each day.

Of the the following three questions that will guide the larger scope of the future marketing program, this report will only answer the first question:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become annual members?

The stakeholders are:

Lily Moreno, director of marketing and my fictional manager,
The Cyclistic marketing analytics team (which I am part of), which supports the Cyclistic marketing strategy with data, and
The Cyclistic executive team that will decide if the recommended marketing program is approved or not.

2. Importing Libraries, Data Import and Dataset

library(tidyverse)
library(skimr)
library(janitor)
library(lubridate)
library(openxlsx)
library(anytime)
library(ggridges)
library(plotly)

The dataset to represent Cyclistic bike share service bike data comes from Divvy, Lyft and Scooters, LLC (“Bikeshare”) data. See the licence agreement for more information https://ride.divvybikes.com/data-license-agreement. The past 12 months (March 2022 to Febuary 2023) of historical trip data was downloaded from the database in CSV format to a Google Drive location,
then uploaded to BigQuery. After this, the 12 files were appended and finally exported as a CSV file to be read into this R Markdown report. Apart from appending the data, the SQL query essentially left the data unchanged so I could showcase the data cleaning process in R, however, in practice it should be more practical and efficient to do some of these steps in the SQL query instead.

df <- read_csv("202203-202302_AppendedData_withCoordinates.csv", 
               show_col_types = FALSE)
colnames(df)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "member_casual"     
## [10] "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"

#View(df)

3. Data Exploration and Cleaning

First the headers are renamed to make them clearer. Then the date columns are converted to the datetime format. The start dates are extracted then extracted from the datetime columns, and a trip time column is calculated by taking the difference from the start and the end datetimes. The riginal datetime columns are then removed.

df_2 <- df_1 %>%
  mutate(
    Start_Date = date(Start_Date_and_Time),
    Start_Date_and_Time = anytime(Start_Date_and_Time, tz = "UTC"),
    End_Date_and_Time = anytime(End_Date_and_Time, tz = "UTC"),
    Trip_Time_Minutes = difftime(End_Date_and_Time, 
                                 Start_Date_and_Time, units="mins")
    )

Since the trip times and the starting dates have been “extracted” from the original datetime columns, the original columns can be removed.

df_3 <- select(df_3, -c(Start_Date_and_Time, End_Date_and_Time))

Some Trip_Time_Minutes values are negative, which doesn’t make sense. These negative observations will be removed and additionally any trip times that are under 2 minutes will also be removed to make sure that the trips were intentional by the rider.

paste(sum(df_3$Trip_Time_Minutes < 2),
      "rows with trip times below 0 minutes have been removed, equal to ", 
      round(100*(sum(df_3$Trip_Time_Minutes < 2))/(nrow(df_3)),1), 
      "% of all trips.")

## [1] "235437 rows with trip times below 0 minutes have been removed, equal to  4 % of all trips."

df_clean <- df_3 %>%
  filter(Trip_Time_Minutes > 2) #removes rows below 0 minutes

Now to look at missing values and unique values from the cleaned dataset, as well as some statistics by using the skim function.

df_skim <- skim_without_charts(df_clean) %>%
  print()

## ── Data Summary ────────────────────────
##                            Values  
## Name                       df_clean
## Number of rows             5590916 
## Number of columns          13      
## _______________________            
## Column type frequency:             
##   character                7       
##   Date                     1       
##   difftime                 1       
##   numeric                  4       
## ________________________           
## Group variables            None    
## 
## ── Variable type: character ────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate min max empty n_unique whitespace
## 1 Ride_ID                    0         1      16  16     0  5590916          0
## 2 Bike_Type                  0         1      11  13     0        3          0
## 3 Start_Station_Name    792663         0.858   7  64     0     1687          0
## 4 Start_Station_ID      792773         0.858   3  37     0     1311          0
## 5 End_Station_Name      838246         0.850   9  64     0     1703          0
## 6 End_Station_ID        838380         0.850   3  37     0     1317          0
## 7 Rider_Type                 0         1       6   6     0        2          0
## 
## ── Variable type: Date ─────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min        max        median    
## 1 Start_Date            0             1 2022-03-01 2023-02-28 2022-08-01
##   n_unique
## 1      365
## 
## ── Variable type: difftime ─────────────────────────────────────────────────────
##   skim_variable     n_missing complete_rate min           max          
## 1 Trip_Time_Minutes         0             1 2.016667 mins 41447.25 mins
##   median    n_unique
## 1 10.6 mins    22265
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable   n_missing complete_rate  mean     sd    p0   p25   p50   p75
## 1 Start_Latitude          0         1      41.9 0.0461  41.6  41.9  41.9  41.9
## 2 Start_Longitude         0         1     -87.6 0.0292 -87.8 -87.7 -87.6 -87.6
## 3 End_Latitude         5934         0.999  41.9 0.0635   0    41.9  41.9  41.9
## 4 End_Longitude        5934         0.999 -87.6 0.0955 -88.1 -87.7 -87.6 -87.6
##    p100
## 1  42.1
## 2 -87.5
## 3  42.4
## 4   0

df_skim %>%
  ggplot(mapping = aes(x = 100*complete_rate, y = skim_variable)) +
  geom_bar(stat= "identity", fill="azure3") +
  labs(
    title="Data completeness per column", 
    subtitle = "Station Names and IDs have missing data!",
    x = "Completeness [%]", 
    y = ""
    ) +
  theme_bw() + 
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, color = "black", 
                                 size = 10, face = "bold",)
    ) +
  geom_text(aes(label = round(100*complete_rate, digits = 0)), 
            position = position_stack(vjust = 0.97))

* Between 15-17% of the data is missing for the start and end station names and IDs.

From the skim function output the following information about unique observations can be seen:

There are a total of 5,829,084 observations that each have unique Ride_IDs.
Bike_Type has 3 unique values
Rider_Type has 2 unique values.
Start and end station names have close to 1700 unique values
Start and end IDs have close to 1300 unique values.

The unique values of all the locations are much greater than the “over 600 docking station” that were stated in the case study description. Start/End station ID come closest (1302 and 1309 unique values). This divergence merits an investigation, but will not be the focus of this analysis.

The location IDs consist of a code while the location names are either one or two street names (presumably their intersection) with a “&” symbol as their denominator. There are more unique location names than location IDs, which could be for a variety or reasons. It could be misspellings, it could be due to some street names being in reverse order, or it could perhaps be due to there being more than one station per intersection. Again, delving deeper into the geographical data is not the focus of this analysis as this is instead something that I might ask a data engineer for more information in a real world problem. However, to make sure I do my due diligence in this analysis, I will export the unique values to Excel and show my observations and recommendations here.

From the investigation I found that: * There are 17 duplicate start station names (for station ID) * There are 341 duplicate start station IDs (for staion names) * There are 17 duplicate end station names (for station ID) * There are 351 duplicate end station IDs (for station names) * Station ID codes are sometimes text e.g. street names

The overlap likely means that there are many locations that should be aggregated into just one if a fair comparison were to be done for the counts of rides per location. For this reason, also due to the ~15% missing values, there is significant bias when using this data, which should be kept in mind when looking at the results so as to not make conclusions and decisions on this misleading data.

4. Analyzing the Questions

4a. How are the different bike types used by members and non-members?

By splitting up the casual riders and members, and splitting up the three type of bike types (classic bikes, docked bikes, and electric bikes) we can compare them by how many trips each makes a year:

df_clean %>%
  ggplot() +
  geom_bar(mapping = aes(Bike_Type, fill = Bike_Type), color = "black") +
  facet_wrap(~Rider_Type) +
  labs(title="Annual rides per bike and rider type", x = "", y = "Rides [count]") +
  theme(plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"))

Observations:

There are more total rides from members than for casual riders.
Casual riders prefer to take more electric bike rides than classic bike rides by a small margin, while it is the opposite for members.
There are about twice as many classic bike rides for casual riders as for members.
Electric bikes are used about the same amount of times by casual riders and members.
Docked bikes were not used at all by members.
Docked bikes are used much less often than any other type of bike.

4b. How is the trip time different for members and non-members?

To get some stats on the trip time I’ll create a table and a boxplot of the ride times.

df_clean %>%
  group_by(Rider_Type, Bike_Type) %>%
  summarise(min = min(Trip_Time_Minutes), 
            max = max(Trip_Time_Minutes),
            mean = mean(Trip_Time_Minutes),
            median = median(Trip_Time_Minutes),
            "standard deviation" = sd(Trip_Time_Minutes)
            )

## `summarise()` has grouped output by 'Rider_Type'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 7
## # Groups:   Rider_Type [2]
##   Rider_Type Bike_Type     min           max             mean     median stand…¹
##   <chr>      <chr>         <drtn>        <drtn>          <drtn>   <drtn>   <dbl>
## 1 casual     classic_bike  2.016667 mins  1559.9333 mins  29.255… 14.80…    91.6
## 2 casual     docked_bike   2.016667 mins 41447.2500 mins 125.119… 28.33…   969. 
## 3 casual     electric_bike 2.016667 mins   480.4333 mins  16.681… 11.25…    18.1
## 4 member     classic_bike  2.016667 mins  1559.9500 mins  14.219…  9.65…    37.8
## 5 member     electric_bike 2.016667 mins   614.4000 mins  11.989…  8.66…    14.4
## # … with abbreviated variable name ¹`standard deviation`

df_clean %>%
  ggplot(aes(x=Rider_Type, y=Trip_Time_Minutes, fill=Bike_Type)) +
  geom_boxplot() +
  facet_wrap(~Rider_Type) +
  labs(title="Annual ride times per bike and rider type", 
       x = "", 
       y = "duration [minutes]") +
  theme(plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"))

The boxplot wasn’t very helpful since the spread of the docked bike rides is so wide. I’ll instead do a plot of only the mean times, keeping in mind that the large spread for some of the values.

df_clean %>%
  group_by(Rider_Type, Bike_Type) %>%
  summarise(mean = mean(Trip_Time_Minutes), n= n()) %>%
  ggplot(mapping = aes(x = Bike_Type, 
                       y = mean,
                       fill = Bike_Type)) +
  geom_bar(stat= "identity", color = "black") +
  facet_wrap(~Rider_Type) +
  labs(title="Annual Mean ride times per bike and rider type", 
       x = "", 
       y = "duration [minutes]") +
  theme(plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"))

This shows that although there aren’t as many docked bikes as other types of bikes, the total annual ride time is about 4-6 times higher. Also casual riders use bikes for longer than members, especially for classic bikes (about twice as long on average).

ggplot(df_clean,
       aes(x = Trip_Time_Minutes,
           y = Bike_Type,
           fill = Bike_Type)
       ) +
  scale_x_continuous(limits=c(0, 60)) +
  geom_density_ridges() + 
  facet_wrap(~Rider_Type, ncol = 1) +
  theme_ridges() +
  labs(title = "Distribution of trip times under 1 hour", 
       x = "Trip time [minutes]", y = "Rides [count]") +
  theme(plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"))

This last graph for the distribution of times shows that the docked bike ride times are spread out much more than other bike rides.

4c. How are the bikes used throughout the year?

Now to plot the distribution of rides throughout the year, divided again into casual riders and members, as well as bike types.

ggplot(df_clean, 
       aes(x = Start_Date, 
           fill = Bike_Type)) +
  geom_density(alpha = 0.4) +
  scale_x_date(date_breaks = "1 months", date_labels = "%b %d") +
  facet_wrap(~Rider_Type, ncol = 1) +
  labs(title = "Trips distribution per day of the year", 
       x = "Trip day", y = "") +
  theme(axis.text.y=element_blank(),  #remove y axis labels
        axis.ticks.y=element_blank()  #remove y axis ticks
        ) +
  theme(plot.title = element_text(hjust = 0.5, color = "black", 
                              size = 14, face = "bold"))

* Bike usage is in general higher between May-November and lower between November-May. * Electric bike rides seem to be used most in the months of July-October. * Classic and docked bikes overlap more, and are used most in the months of May-Aug. * Members have a more even distribution of usage for throughout the year compared to casual riders. Essentially, the cooler months of September-May casual riders use the service less relative to members while in the warmer months June-October they use the bikes more relative to members.

As the summer months are where most casual riders are using the bikes, there could be an opportunity to run campaigns during these months to hook them for the year.

4d. Where are the bikes being used?

Since there are a lot of datapoints I will split the whole dataset into a samples of 10000 points of data for casual riders and members, as well as different bike types, to see how they differ in the locations in which the rides start from.

(By clicking the legend we can turn on or off the markers from the map)

df_clean_map <- df_clean %>% 
  select(Rider_Type, 
         Bike_Type, 
         Start_Longitude, 
         Start_Latitude)


fig <- plot_ly(df_clean_map %>% sample_n(100000),
    lat = ~Start_Latitude,
    lon = ~Start_Longitude,
    marker = list(size = 6, opacity = 0.2),
    type = 'scattermapbox',
    colors = "Set1",
    color = (~Rider_Type)
    ) %>%
  config(displayModeBar = FALSE) %>%
  layout(
    mapbox = list(
      style = 'carto-positron',
      zoom = 8,
      center = list(lon = -87.63, lat = 41.88)))  

fig

## No scattermapbox mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

There is quite a lot of overlap between casual riders and members, so this this map doesn’t really give any new insights. Looking at the bike types gives something interesting to go on however.

fig <- plot_ly(df_clean_map %>% sample_n(100000),
    lat = ~Start_Latitude,
    lon = ~Start_Longitude,
    marker = list(size = 6, opacity = 0.2),
    type = 'scattermapbox',
    colors = "Set1",
    color = (~Bike_Type)
    ) %>%
  config(displayModeBar = FALSE) %>%
  layout(
    mapbox = list(
      style = 'carto-positron',
      zoom = 8,
      center = list(lon = -87.63, lat = 41.88)))  

fig

## No scattermapbox mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

The markers for the docked bike are much more centralized than both classic and electric bikes. The classic bikes have more spread, but the electric bikes have the most spread of all especially throughout western and southern parts of Chicago.

5. Recommendations

From the analysis I found these to be the most pressing calls to action for Cyclistic:

Talk to data engineers to see if there is anything that can be done about the missing and incorrect labelling of the station IDs and names, as it currently makes the data to biased to use without risking making bad decisions on it.
Look into how memberships for docked bike users currently work and if there is any way to simplify and clarify the process to these riders.
Since docked bike users use bikes for long times, see if that fact can be used to attract them into buying a membership.
Perform interviews or a questionnaire with casual riders to find out more about why they mostly prefer electric bikes to classic bike, and what it would take for them to get a membership.
Run campaigns during the months of heavy casual rider bike use, during the months of May-November.

Case Study: How Does a Bike-Share Navigate Speedy Success?

Erik Emilsson

20 December 2022