Stocking Rental Bikes
Stocking rental bikes¶
This exercise is based on Rachel Tatman's tutorial on GCP's BigQuery machine learning option tutorial.
The exercise that goes along with it can be found here.
The problem to solve is the following:
"You stock bikes for a bike rental company in Austin, ensuring stations have enough bikes for all their riders. You want to build a model to predict how many riders will start from each station during each hour, capturing patterns in seasonality, time of day, day of the week, etc."
To get started, create a project in GCP and connect to it by running the code cell below. Make sure you have connected the kernel to your GCP account in Settings by enabling BigQuery.
# Set your own project id here
# as a string, the name of the BigQuery project you have created beforehand
PROJECT_ID = 'kaggle-bq-mwptje-exercise'
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('model_dataset', exists_ok=True)
from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID
Load Google Cloud Bigquery extension¶
In order to run BigQueries directly in the notebook, you need to load this jupyter extension first
%load_ext google.cloud.bigquery
Check out the dataset¶
This is dataset represents data from bike sharing om Austin, TX.
# create a reference to our table
table = client.get_table("bigquery-public-data.austin_bikeshare.bikeshare_trips")
# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()
2 - Evaluate the data using a dataframe¶
Create a dataframe from the query using the BigQuery magic command and have a quick look.
The %%bigquery magic command allows you to create a dataframe based on the selection, in this case named df_rides, in order to review the data.
Notes:
- The select is a group by starting station and date and hour if the timestamp column
- The function timestamp_trunc truncates the timestamp column on the hour
- We want to select as training data any records with a date before the first of January 2018
%%bigquery df_rides
select start_station_name, timestamp_trunc(start_time,hour) as start_hour, count(trip_id) as num_rides
from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
where start_time < '2018-01-01'
and ( ( start_station_id != end_station_id ) or
( start_station_id = end_station_id and duration_minutes > 59 ) )
group by start_station_name, start_hour
Initial look at the query results ...
# check the first fiew rows
df_rides.head()
# how many records are we dealing with here?
df_rides.shape
See if there are any issues with the data. Looks like we have no null values and the values of each column has the same data type:
df_rides.info()
3 - Create and train the model¶
Create the model based on the query. We are using the rows with a start date of before 2018-01-01 for the training data. Later on we will use the rows on or after this date for the test evaluation data.
The "label" (to be predicted) column is the number of rides per hour predicted based on a linear regression model:
%%bigquery
CREATE OR REPLACE MODEL `model_dataset.bike_trips`
OPTIONS(model_type='linear_reg', OPTIMIZE_STRATEGY='batch_gradient_descent') AS
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
COUNT(trip_id) as label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_time < "2018-01-01"
GROUP BY start_station_name, start_hour
Look at the iterations of the model
%%bigquery
SELECT
*
FROM
ML.TRAINING_INFO(MODEL `model_dataset.bike_trips`)
ORDER BY iteration
4 - Model Evaluation¶
Run the evaluation based on the test data: rows with a start date >= 2018-01-01.
We are especially looking at the R2_score here for the evaluation criteria.
%%bigquery
SELECT *
FROM
ML.EVALUATE(MODEL `model_dataset.bike_trips`, (
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
COUNT(trip_id) as label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_time >= "2018-01-01"
GROUP BY start_station_name, start_hour
))
With a negative R2_score of -0.087843 we are doing worse than using the mean for a prediction so what's up?
Note: for an explaination of r2_score see the following site http://www.fairlynerdy.com/what-is-r-squared/
5 - Theories for poor performance¶
- There is no random test train split used
- The year 2018 has a definite increase of rides compared to the other years
- As we are looking at hourly data any bikes returned to the same station within an hour should not be taken into account
as they will be available with the same hour - We should be looking at stations that have existed for a while before 2018 and then continued to exist in 2018
- We should be looking at "real" stations and not temporary or administrative ones
Have a look at rides by year, notice the spike in 2018
%%bigquery df_rides_by_year
select extract(year from start_time) as start_year, count(trip_id) as num_rides
from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
group by start_year
order by start_year
import matplotlib.pyplot as plt
df_rides_by_year.set_index('start_year',inplace=True)
df_rides_by_year.plot(kind='bar')
plt.show()
6 - Exercise looking at predictions¶
A good way to figure out where your model is going wrong is to look closer at a small set of predictions. Use your model to predict the number of rides for the 22nd & Pearl station in 2018. Compare the mean values of predicted vs actual rider
%%bigquery
SELECT AVG(predicted_label) as avg_predicted_trips,
AVG(label) as avg_actual_trips
FROM ML.PREDICT(MODEL `model_dataset.bike_trips`, (
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
COUNT(trip_id) as label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE EXTRACT(YEAR from start_time) = 2018
AND start_station_name = '22nd & Pearl'
GROUP BY start_station_name, start_hour))
7 - Exercise: Average daily rides per station¶
Either something is wrong with the model or something surprising is happening in the 2018 data.
What could be happening in the data? Write a query to get the average number of riders per station for each year in the dataset and order by the year so you can see the trend. You can use the EXTRACT method to get the day and year from the start time timestamp
%%bigquery df_avg_rides_by_year
WITH stations_by_year AS (
SELECT start_station_name,
EXTRACT(year FROM start_time) AS year,
COUNT(trip_id) as rides
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name, year
ORDER BY start_station_name, year
)
SELECT year, AVG(rides) AS avg_rides
FROM stations_by_year
GROUP by year
ORDER by year
df_avg_rides_by_year.set_index('year',inplace=True)
df_avg_rides_by_year.plot(kind='bar')
plt.title('Average Rides per Station per Year')
plt.show()
8 - What do your results tell you?¶
Given the daily average riders per station over the years, does it make sense that the model is failing?
- Due to the spike in 2018 it does make sense the model is failing
- There are several stations that have been added during 2018 so no history data to predict on
- A suggestion would be to only take stations into account that have existed within a certain time frame like from 2015 to 2019
9 - Next Steps¶
Looking at a solution that Evimar Principal de Soto presented in her kernel which a random selection was chosen to split the train-test set into an 80/20 split, this gave a much better r2_score.
In an attempt to improve the score I have also added a selection of only rides that have a different start and end station or rides that have returned the bicycle to the same station after an hour. Within the hour they would be available again so no need to take into account was the idea
%%bigquery
CREATE OR REPLACE MODEL`model_dataset.bike_trips80`
OPTIONS(model_type='linear_reg',
OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH stations AS
(
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
MAX(trip_id) AS trip_id,
COUNT(trip_id) AS label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_id != end_station_id OR
( start_station_id = end_station_id AND duration_minutes > 59 )
GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
FROM stations
WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10) < 8
Evaluate the model based on the new train/test split:
%%bigquery
SELECT
*
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80`, (
WITH stations AS
(
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
MAX(trip_id) AS trip_id,
COUNT(trip_id) AS label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_id != end_station_id OR
( start_station_id = end_station_id AND duration_minutes > 59 )
GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
FROM stations
WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10) >= 8
))
- The mean squared error of 3.900408 has improved compared to the previous model evaluation : 13.494581
- The r2_score is now positive: 0.202602 compared to the previous negative value of -0.087843
Try again with start stations that have existed from 2016 to 2018¶
Let's see if this will improve the r2_score by only querying stations that have existed in the period from 2016 to 2018. Note: still doing the random train/test split here.
%%bigquery
CREATE OR REPLACE MODEL`model_dataset.bike_trips80_2`
OPTIONS(model_type='linear_reg',
OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH conseq_stations AS
(
SELECT start_station_name,
EXTRACT(year FROM start_time) AS year,
COUNT(trip_id) AS rides
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
GROUP BY start_station_name, year
ORDER BY start_station_name, year
),
stations_2016_2018 AS
(
SELECT start_station_name
FROM conseq_stations
GROUP BY start_station_name
HAVING COUNT(*) = 3
ORDER BY start_station_name
),
stations AS
(
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
MAX(trip_id) AS trip_id,
COUNT(trip_id) AS label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE EXTRACT(year FROM start_time) BETWEEN 2016 AND 2018
GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
FROM stations INNER JOIN stations_2016_2018
ON stations.start_station_name = stations_2016_2018.start_station_name
WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10) < 8
%%bigquery
SELECT
*
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80_2`, (
WITH seq_stations AS
(
SELECT start_station_name,
EXTRACT(year FROM start_time) AS year,
COUNT(trip_id) AS rides
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
GROUP BY start_station_name, year
ORDER BY start_station_name, year
),
stations_2016_2018 AS
(
SELECT start_station_name
FROM seq_stations
GROUP BY start_station_name
HAVING COUNT(*) = 3
ORDER BY start_station_name
),
stations AS
(
SELECT start_station_name,
TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
MAX(trip_id) AS trip_id,
COUNT(trip_id) AS label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
FROM stations INNER JOIN stations_2016_2018
ON stations.start_station_name = stations_2016_2018.start_station_name
WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10) >= 8
))
- The mean squared error of 2.833299 has improved compared to the previous model evaluation : 3.900408
- The r2_score has not improved: 0.10704 compared to the previous value 0.202602
It looks like this hasn't improved the results