Analyzing Strava data using R

Today I will look at how to connect to the Strava-API and do some quick analysis on the activity data.

To start using the Strava-API, start by registering a developer app at strava.com. After doing this, you will get your API credentials, including your personal secret and client id. Here, I have added my secret to keyring using key_set before fetching it in the code below. Next, I get a data frame of Strava-activities using the rStrava package (https://github.com/fawda123/rStrava).

# Config
client_id <- 38395
secret <- keyring::key_get("secret", "strava")
app_name <- "Bloggdata"

# Get token and get data (this part is commented out to save time, read in RDS instead)
stoken <- httr::config(token = strava_oauth(app_name, client_id, secret, app_scope="activity:read_all"))
#my_acts <- get_activity_list(stoken)

#df <- compile_activities(my_acts)
#saveRDS(df, file = here::here("./content/post/strava_df.rds"))
df <- readRDS(here::here("./content/post/strava_df.rds"))

Aggregated activity data

Let’s look at the cumulative riding distance per year, highlighting the last 3 years using the gghightlight-package.

df <- df %>%
  mutate(start_date = as_date(start_date),
         year = year(start_date),
         day_of_year = yday(start_date),
         month = month(start_date),
         day = wday(start_date, label = TRUE),
         week = week(start_date))

df %>%
  group_by(year) %>%
  arrange(start_date) %>%
  mutate(cumulative_riding_distance = cumsum(distance)) %>%
  ungroup() %>%
  ggplot(aes(x = day_of_year, y = cumulative_riding_distance, color = factor(year))) +
  geom_line() +
  scale_color_brewer(palette = "Set1") +
  gghighlight::gghighlight(year > 2016) +
  labs(title = "Cumulative riding distance per year",
       subtitle = "Last 3 years highlighted")

We see that 2017 is currently my best year with 6000 registered kilometers, but 2019 is not looking too bad either. There is still some time left, too!

We can also easily create a github-style calender heat-map using ggplot:

ggplot(df %>% filter(year > 2011), aes(x = week, y = factor(day))) +
  geom_tile(aes(fill = distance)) +
  scale_fill_continuous(low = "lightgreen", high = "red") +
  facet_wrap(~ year, scales = "free_x") +
  labs(x = "Week", y = "") 

Moreover, let’s look at the relationship between distance and average speed using a hex-plot:

df %>%
  ggplot(aes(x = distance, y = average_speed)) + 
  geom_hex() +
  scale_fill_viridis_c()

The plot clearly shows that my typical ride is 25-75 kilometers at around 25 kph.

Analyzing individual activities

It is also possible to look at individual activities in detail using the get_streams function. Here I will look at my ride at “Skjærgårdsrittet” in 2018.

I couldn’t get the compile activity function from the rStrava-package to work, as it seems to be incompatible with newer versions of tidyr. Instead, I adapted the code myself to make it work.

#types <- list("time", "latlng", "distance", "altitude", "velocity_smooth", "heartrate", "cadence", "watts", "temp", "moving", "grade_smooth")
ride_raw <- get_streams(stoken, id = "1586129481")

 ride <- ride_raw %>%
   purrr::transpose() %>% 
   tibble::as_tibble() %>% 
   dplyr::select(type, data) %>% 
   dplyr::mutate(type = unlist(type), 
        data = purrr::map(data, function(x) {
            idx <- !sapply(x, length)
            x[idx] <- NA
            return(x)
        }))
  
 lat_lng_to_df <- function(x) {
   purrr::set_names(x, nm = c("lat", "lng")) %>% tibble::as_tibble()
 }
 
 ride <- ride %>% 
   tidyr::spread(data = ., key = type, value = data) %>%
   tidyr::unnest() %>%
   dplyr::mutate(latlng = purrr::map(latlng, lat_lng_to_df)) %>%
   tidyr::unnest() %>%
   mutate(velocity_kph = 3.6 * velocity_smooth)

Now we can e.g. create a map, using “leaflet”.

Note: The rStrava-package does contain functionality for creating maps, but unfortunately it uses the ggmap package, which now requires registering a credit card with your Google-account to get access to the API.

leaflet(ride) %>%
  addTiles() %>%
  addPolylines(lng = ~ lng, 
               lat = ~ lat)

Or we can look at the altitude profile together with the speed, using the elegant “patchwork”-package to combine the plots.

ride <- ride %>%
  mutate(distance_km = distance / 1000)

altitude <- ride %>%
  ggplot(aes(x = distance_km, y = altitude)) +
  geom_area() +
  coord_cartesian(ylim = c(0, 300))

speed <- ride %>%
  ggplot(aes(x = distance_km, y = velocity_kph)) +
  geom_line() +
  geom_hline(yintercept = mean(ride$velocity_kph), linetype = "dashed", color = "red")

altitude + speed + plot_layout(ncol = 1, nrow = 2)

KOMs

You can also get a list of “King of the Mountains” (KOMs), i.e. segments where you have the all-time record, using the get_KOMs function from the rStrava-package.

koms <- get_KOMs(stoken,
                 id = "1601683")

# Convert the data from list to tibble:
koms_df <- koms %>%
  purrr::transpose() %>%
  tibble::as_tibble() %>%
  tidyr::unnest(name, elapsed_time, average_watts)

Now we can e.g. make a plot of all KOMs, which in my case is not that many anymore (I guess I should go KOM-hunting again this summer…)

koms_df %>%
  ggplot(aes(x = fct_reorder(name, elapsed_time), y = elapsed_time, fill = average_watts)) +
  geom_col(color = "black") +
  coord_flip() +
  labs(x = NULL) +
  theme(legend.position = "bottom")

Not surprisingly, we see that the estimated watt (power output) is higher the shorter the segment is.

There’s a lot more to discover using the Strava-API, so I will definitely explore this further in the future.

Avatar
André Waage Rivenæs
Data science consultant

Related