Hello, and welcome to my first blog post!
In this post we will combine two of my primary interests: cycling and data analysis. We will look at the Bergen Bicycle Data, which is a dataset (available through a public API) consisting of all data from the rentable bikes that have been set out all over Bergen.
In this first part, we will do some exploratory analysis and get familar with the data at hand. Primarily, we investigate the following:
- What affects the duration of each bike ride?
- When and where are the bikes used the most?
Let’s get started!
Importing the data
We will use the lovely purrr-package to iterate over the url containing the data and combining the data-sets. All available data up to December 2018 are included.
base_url <- "http://data.urbansharing.com/bergenbysykkel.no/trips/v1/2018/" month <- seq(6, 12) list_urls <- paste0(base_url, sprintf('%0.2d', month), ".csv") df <- map_dfr(list_urls, ~ read_csv(.x)) str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 102873 obs. of 13 variables: ## $ started_at : POSIXct, format: "2018-06-29 10:45:12" "2018-06-29 10:53:59" ... ## $ ended_at : POSIXct, format: "2018-06-29 11:05:10" "2018-06-29 11:11:28" ... ## $ duration : int 1197 1048 490 478 1779 506 518 116 636 1754 ... ## $ start_station_id : int 3 49 58 157 58 3 132 75 157 34 ... ## $ start_station_name : chr "Grieghallen" "Studentsenteret UIB" "Tårnplass" "Høyteknologisenteret" ... ## $ start_station_description: chr NA NA NA "Høyteknologisenteret" ... ## $ start_station_latitude : num 60.4 60.4 60.4 60.4 60.4 ... ## $ start_station_longitude : num 5.33 5.32 5.32 5.33 5.32 ... ## $ end_station_id : int 83 75 157 83 116 82 34 75 83 58 ... ## $ end_station_name : chr "Bergen jernbanestasjon" "Akvariet" "Høyteknologisenteret" "Bergen jernbanestasjon" ... ## $ end_station_description : chr NA NA "Høyteknologisenteret" NA ... ## $ end_station_latitude : num 60.4 60.4 60.4 60.4 60.4 ... ## $ end_station_longitude : num 5.33 5.3 5.33 5.33 5.33 ...
As we can see, the dataset contains 102 873 observations. The data includes all completed trips with a duration greater than 1 minute, and contains useful information such as where the trip started, where it ended and the duration.
In order to get a good feeling about the data one is about to analyse, visualisation is always a good start.
Lets start with looking at where in Bergen the stations are located. We will here use the “leaflet”-package to create the map.
library(leaflet) map_info <- df %>% group_by(start_station_name, start_station_longitude, start_station_latitude) %>% summarise(n_rides = n()) %>% ungroup() %>% mutate(n_rides_norm = 4 * n_rides / max(n_rides)) leaflet(map_info) %>% addTiles() %>% addCircleMarkers(lng = ~ start_station_longitude, lat = ~ start_station_latitude, radius = ~ n_rides_norm, popup = ~ paste0(start_station_name, ", number of rides = ", n_rides), fill = TRUE)
Clicking the points of interest gives you the name of the station and the total number of rides started there. Immediately, one can tell that the stations closer to the city centre are used the most.
Next, let’s have a look at how the number of rides per day has evolved with time.
ggplot(df, aes(x = started_at)) + geom_histogram(fill = "seagreen4", color = "black") + scale_x_datetime(breaks = pretty(df$started_at, n = 6))
Not surprisingly, we see that the number of rides peaked just after the summer vacation (weather still good, work commute started again), and has declined steadily since.
Next, let’s look at the distribution of the duration. I convert the given duration in seconds to whole minutes by using integer division and plot it using a combination of geom_histogram and geom_density.
df <- df %>% filter(duration < 8000) %>% mutate(duration_minutes = duration %/% 60) ggplot(df, aes(x = duration_minutes)) + geom_histogram(aes(y = ..density..), fill = "white", color = "grey", binwidth = 0.9) + geom_density(fill = "seagreen4", color = "black", alpha = 0.7, size = 0.9) + geom_vline(xintercept = mean(df$duration_minutes), linetype = "dashed", color = "red", size = 0.9) + xlim(c(0, 60))
As we can see, the average ride length is just over 10 minutes, with very few rides approaching 1 hour. This is not surprising - the bicycles are created to ride in the city center, and barring disaster you should be able to get anywhere in less than 1 hour!
Does the average ride length vary by the day of the week? We can investigate this using a box plot.
# First, add some time-variables df <- df %>% mutate(week_day = wday(started_at, label = TRUE), time_of_day = hour(started_at), month = month(started_at, label = TRUE)) # Create the plot and zoom in ggplot(df, aes(x = week_day, y = duration)) + geom_boxplot(fill = "seagreen4") + ylim(c(0, 1000))
There does not appear to be a large difference here, but there seems to be a weak tendency that rides in the weekends are slightly longer on average.
Another hypothesis is that rides in the summer should be longer. Lets visualise this using a scatter plot and an estimated trend.
ggplot(df, aes(x = started_at, y = duration)) + geom_jitter(alpha = 0.5, color = "grey") + geom_smooth() + scale_y_log10()
As expected, there appears to be a slight downward trend, i.e. the rides were longer in the summer. Unfortunately we do not even have a full year of data yet, so it will be interesting to see how this develops.
Moreover, one would expect that the duration depends on the starting-point, as some stations are closer to the primary points of interests.
# Create a vector of largest stations largest_stations <- df %>% count(start_station_name) %>% top_n(20) %>% pull(start_station_name) # Plot using ggridges # Note: fct_reorder orders the stations from lowest median duration to highest df %>% filter(start_station_name %in% largest_stations) %>% ggplot(aes(x = duration_minutes, y = fct_reorder(start_station_name, - duration))) + geom_density_ridges(fill = "seagreen4", alpha = 0.5) + xlim(c(0, 25)) + labs(y = "Starting point")
Not surprisingly, the centrally located UIB-station has the lowest median-durations (with students also having a tendency to live near the city-center, hence short commutes). Meanwhile, the rides started at the Aquarium are usually longer.
Finally, let’s have a look at which time of the day most rides are executed, partioned by week day, using a heat map.
df %>% count(week_day, time_of_day) %>% ggplot(aes(x = week_day, y = time_of_day, fill = n)) + geom_tile() + scale_fill_viridis_c()
As we can see, the trend is pretty consistent and not very surprising - the bikes are used the most at around 7 in the mornings, and around 14-15:00 in the afternoon. In the weekends, the bikes are not really used that much.
Unfortunately the dataset does not contain too much information, but in order to truly gain interesting insight one could combine the data with other relevant data sources. In particular, it would be interesting to combine weather data with the existing data set to see if there are any relations there (we would expect rainy days to have less activity - even though we “bergensere” have been hardened by decades of relentless, never-ending rain…).
In the next blog post, we will do some modelling and have a look at how we can predict the duration of any given bike ride given the available data.