Stocking rental bikes¶

This exercise is based on Rachel Tatman's tutorial on GCP's BigQuery machine learning option tutorial.

The exercise that goes along with it can be found here.

The problem to solve is the following:
"You stock bikes for a bike rental company in Austin, ensuring stations have enough bikes for all their riders. You want to build a model to predict how many riders will start from each station during each hour, capturing patterns in seasonality, time of day, day of the week, etc."

To get started, create a project in GCP and connect to it by running the code cell below. Make sure you have connected the kernel to your GCP account in Settings by enabling BigQuery.

# Set your own project id here
# as a string, the name of the BigQuery project you have created beforehand
PROJECT_ID = 'kaggle-bq-mwptje-exercise'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('model_dataset', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

Load Google Cloud Bigquery extension¶

In order to run BigQueries directly in the notebook, you need to load this jupyter extension first

%load_ext google.cloud.bigquery

Check out the dataset¶

This is dataset represents data from bike sharing om Austin, TX.

# create a reference to our table
table = client.get_table("bigquery-public-data.austin_bikeshare.bikeshare_trips")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()

2 - Evaluate the data using a dataframe¶

Create a dataframe from the query using the BigQuery magic command and have a quick look.

The %%bigquery magic command allows you to create a dataframe based on the selection, in this case named df_rides, in order to review the data.

Notes:

The select is a group by starting station and date and hour if the timestamp column
The function timestamp_trunc truncates the timestamp column on the hour
We want to select as training data any records with a date before the first of January 2018

%%bigquery df_rides
select start_station_name, timestamp_trunc(start_time,hour) as start_hour, count(trip_id) as num_rides
  from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
where start_time < '2018-01-01'
  and ( ( start_station_id != end_station_id ) or
        ( start_station_id = end_station_id and duration_minutes > 59 ) )
 group by start_station_name, start_hour

Initial look at the query results ...

# check the first fiew rows
df_rides.head()

# how many records are we dealing with here?
df_rides.shape

(292074, 3)

See if there are any issues with the data. Looks like we have no null values and the values of each column has the same data type:

df_rides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292074 entries, 0 to 292073
Data columns (total 3 columns):
start_station_name    292074 non-null object
start_hour            292074 non-null datetime64[ns, UTC]
num_rides             292074 non-null int64
dtypes: datetime64[ns, UTC](1), int64(1), object(1)
memory usage: 6.7+ MB

3 - Create and train the model¶

Create the model based on the query. We are using the rows with a start date of before 2018-01-01 for the training data. Later on we will use the rows on or after this date for the test evaluation data.

The "label" (to be predicted) column is the number of rides per hour predicted based on a linear regression model:

%%bigquery

CREATE OR REPLACE MODEL `model_dataset.bike_trips`
OPTIONS(model_type='linear_reg', OPTIMIZE_STRATEGY='batch_gradient_descent') AS
SELECT start_station_name, 
       TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
       COUNT(trip_id) as label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_time < "2018-01-01"
GROUP BY start_station_name, start_hour

Look at the iterations of the model

%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `model_dataset.bike_trips`)
ORDER BY iteration

4 - Model Evaluation¶

Run the evaluation based on the test data: rows with a start date >= 2018-01-01.

We are especially looking at the R2_score here for the evaluation criteria.

%%bigquery

SELECT *
FROM
ML.EVALUATE(MODEL `model_dataset.bike_trips`, (
SELECT start_station_name, 
       TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
       COUNT(trip_id) as label
  FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
 WHERE start_time >= "2018-01-01"
 GROUP BY start_station_name, start_hour
))

With a negative R2_score of -0.087843 we are doing worse than using the mean for a prediction so what's up?

Note: for an explaination of r2_score see the following site http://www.fairlynerdy.com/what-is-r-squared/

5 - Theories for poor performance¶

There is no random test train split used
The year 2018 has a definite increase of rides compared to the other years
As we are looking at hourly data any bikes returned to the same station within an hour should not be taken into account
as they will be available with the same hour
We should be looking at stations that have existed for a while before 2018 and then continued to exist in 2018
We should be looking at "real" stations and not temporary or administrative ones

Have a look at rides by year, notice the spike in 2018

%%bigquery df_rides_by_year
select extract(year from start_time) as start_year, count(trip_id) as num_rides
  from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
 group by start_year
 order by start_year

import matplotlib.pyplot as plt
df_rides_by_year.set_index('start_year',inplace=True)
df_rides_by_year.plot(kind='bar')
plt.show()

6 - Exercise looking at predictions¶

A good way to figure out where your model is going wrong is to look closer at a small set of predictions. Use your model to predict the number of rides for the 22nd & Pearl station in 2018. Compare the mean values of predicted vs actual rider

%%bigquery
SELECT AVG(predicted_label) as avg_predicted_trips,
       AVG(label) as avg_actual_trips  
  FROM ML.PREDICT(MODEL `model_dataset.bike_trips`, (
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
           COUNT(trip_id) as label
     FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    WHERE EXTRACT(YEAR from start_time) = 2018 
      AND start_station_name = '22nd & Pearl'
    GROUP BY start_station_name, start_hour))

7 - Exercise: Average daily rides per station¶

Either something is wrong with the model or something surprising is happening in the 2018 data.

What could be happening in the data? Write a query to get the average number of riders per station for each year in the dataset and order by the year so you can see the trend. You can use the EXTRACT method to get the day and year from the start time timestamp

%%bigquery df_avg_rides_by_year

WITH stations_by_year AS (
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) as rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
)
SELECT year, AVG(rides) AS avg_rides
  FROM stations_by_year
 GROUP by year
 ORDER by year

df_avg_rides_by_year.set_index('year',inplace=True)
df_avg_rides_by_year.plot(kind='bar')
plt.title('Average Rides per Station per Year')
plt.show()

8 - What do your results tell you?¶

Given the daily average riders per station over the years, does it make sense that the model is failing?

Due to the spike in 2018 it does make sense the model is failing
There are several stations that have been added during 2018 so no history data to predict on
A suggestion would be to only take stations into account that have existed within a certain time frame like from 2015 to 2019

9 - Next Steps¶

Looking at a solution that Evimar Principal de Soto presented in her kernel which a random selection was chosen to split the train-test set into an 80/20 split, this gave a much better r2_score.

In an attempt to improve the score I have also added a selection of only rides that have a different start and end station or rides that have returned the bicycle to the same station after an hour. Within the hour they would be available again so no need to take into account was the idea

%%bigquery

CREATE OR REPLACE MODEL`model_dataset.bike_trips80`
OPTIONS(model_type='linear_reg',
        OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE start_station_id != end_station_id OR
        ( start_station_id = end_station_id AND duration_minutes > 59 ) 
     GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
  FROM stations
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  < 8

Evaluate the model based on the new train/test split:

%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80`, (
  WITH stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE start_station_id != end_station_id OR
         ( start_station_id = end_station_id AND duration_minutes > 59 ) 
     GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
  FROM stations
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  >= 8
))

The mean squared error of 3.900408 has improved compared to the previous model evaluation : 13.494581
The r2_score is now positive: 0.202602 compared to the previous negative value of -0.087843

Try again with start stations that have existed from 2016 to 2018¶

Let's see if this will improve the r2_score by only querying stations that have existed in the period from 2016 to 2018. Note: still doing the random train/test split here.

%%bigquery 

CREATE OR REPLACE MODEL`model_dataset.bike_trips80_2`
OPTIONS(model_type='linear_reg',
        OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH conseq_stations AS 
(
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) AS rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
),
stations_2016_2018 AS 
(
  SELECT start_station_name
    FROM conseq_stations
   GROUP BY start_station_name
  HAVING COUNT(*) = 3
   ORDER BY start_station_name
),
stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE EXTRACT(year FROM start_time) BETWEEN 2016 AND 2018
     GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
  FROM stations INNER JOIN stations_2016_2018
    ON stations.start_station_name = stations_2016_2018.start_station_name
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  < 8

%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80_2`, (
WITH seq_stations AS 
(
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) AS rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
),
stations_2016_2018 AS 
(
  SELECT start_station_name
    FROM seq_stations
   GROUP BY start_station_name
  HAVING COUNT(*) = 3
   ORDER BY start_station_name
),
stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
  FROM stations INNER JOIN stations_2016_2018
    ON stations.start_station_name = stations_2016_2018.start_station_name
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  >= 8
))

The mean squared error of 2.833299 has improved compared to the previous model evaluation : 3.900408
The r2_score has not improved: 0.10704 compared to the previous value 0.202602

It looks like this hasn't improved the results

	trip_id	subscriber_type	bikeid	start_time	start_station_id	start_station_name	end_station_id	end_station_name	duration_minutes
0	9900285908	Annual Membership (Austin B-cycle)	400	2014-10-26 14:12:00+00:00	2823	Capital Metro HQ - East 5th at Broadway	2544	East 6th & Pedernales St.	10
1	9900289692	Walk Up	248	2015-10-02 21:12:01+00:00	1006	Zilker Park West	1008	Nueces @ 3rd	39
2	9900285987	24-Hour Kiosk (Austin B-cycle)	446	2014-10-26 15:12:00+00:00	2712	Toomey Rd @ South Lamar	2712	Toomey Rd @ South Lamar	31
3	9900285989	24-Hour Kiosk (Austin B-cycle)	203	2014-10-26 15:12:00+00:00	2712	Toomey Rd @ South Lamar	2712	Toomey Rd @ South Lamar	31
4	9900285991	24-Hour Kiosk (Austin B-cycle)	101	2014-10-26 15:12:00+00:00	2712	Toomey Rd @ South Lamar	2712	Toomey Rd @ South Lamar	30

	start_station_name	start_hour	num_rides
0	Zilker Park West	2015-10-02 21:00:00+00:00	80
1	Zilker Park West	2015-10-02 20:00:00+00:00	20
2	Zilker Park West	2015-10-03 21:00:00+00:00	77
3	Nueces @ 3rd	2015-10-03 12:00:00+00:00	10
4	Zilker Park West	2015-10-03 16:00:00+00:00	12

	iteration	loss	eval_loss	learning_rate	duration_ms
0	0	4.172457	3.812986	0.1	7083
1	1	3.944248	3.603601	0.2	12934
2	2	3.817862	3.473669	0.2	11296
3	3	3.744511	3.412568	0.2	11264
4	4	3.700585	3.365890	0.2	12100
5	5	3.673713	3.347049	0.2	13605

Marc van der Valk

Stocking Rental Bikes