Stocking Rental Bikes

9 minute read

kaggle-bq-exercise

Stocking rental bikes

This exercise is based on Rachel Tatman's tutorial on GCP's BigQuery machine learning option tutorial.

The exercise that goes along with it can be found here.

The problem to solve is the following:
"You stock bikes for a bike rental company in Austin, ensuring stations have enough bikes for all their riders. You want to build a model to predict how many riders will start from each station during each hour, capturing patterns in seasonality, time of day, day of the week, etc."

To get started, create a project in GCP and connect to it by running the code cell below. Make sure you have connected the kernel to your GCP account in Settings by enabling BigQuery.

In [1]:
# Set your own project id here
# as a string, the name of the BigQuery project you have created beforehand
PROJECT_ID = 'kaggle-bq-mwptje-exercise'

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID, location="US")
dataset = client.create_dataset('model_dataset', exists_ok=True)

from google.cloud.bigquery import magics
from kaggle.gcp import KaggleKernelCredentials
magics.context.credentials = KaggleKernelCredentials()
magics.context.project = PROJECT_ID

Load Google Cloud Bigquery extension

In order to run BigQueries directly in the notebook, you need to load this jupyter extension first

In [2]:
%load_ext google.cloud.bigquery

Check out the dataset

This is dataset represents data from bike sharing om Austin, TX.

In [3]:
# create a reference to our table
table = client.get_table("bigquery-public-data.austin_bikeshare.bikeshare_trips")

# look at five rows from our dataset
client.list_rows(table, max_results=5).to_dataframe()
Out[3]:
trip_id subscriber_type bikeid start_time start_station_id start_station_name end_station_id end_station_name duration_minutes
0 9900285908 Annual Membership (Austin B-cycle) 400 2014-10-26 14:12:00+00:00 2823 Capital Metro HQ - East 5th at Broadway 2544 East 6th & Pedernales St. 10
1 9900289692 Walk Up 248 2015-10-02 21:12:01+00:00 1006 Zilker Park West 1008 Nueces @ 3rd 39
2 9900285987 24-Hour Kiosk (Austin B-cycle) 446 2014-10-26 15:12:00+00:00 2712 Toomey Rd @ South Lamar 2712 Toomey Rd @ South Lamar 31
3 9900285989 24-Hour Kiosk (Austin B-cycle) 203 2014-10-26 15:12:00+00:00 2712 Toomey Rd @ South Lamar 2712 Toomey Rd @ South Lamar 31
4 9900285991 24-Hour Kiosk (Austin B-cycle) 101 2014-10-26 15:12:00+00:00 2712 Toomey Rd @ South Lamar 2712 Toomey Rd @ South Lamar 30

2 - Evaluate the data using a dataframe

Create a dataframe from the query using the BigQuery magic command and have a quick look.

The %%bigquery magic command allows you to create a dataframe based on the selection, in this case named df_rides, in order to review the data.

Notes:

  • The select is a group by starting station and date and hour if the timestamp column
  • The function timestamp_trunc truncates the timestamp column on the hour
  • We want to select as training data any records with a date before the first of January 2018
In [4]:
%%bigquery df_rides
select start_station_name, timestamp_trunc(start_time,hour) as start_hour, count(trip_id) as num_rides
  from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
where start_time < '2018-01-01'
  and ( ( start_station_id != end_station_id ) or
        ( start_station_id = end_station_id and duration_minutes > 59 ) )
 group by start_station_name, start_hour

Initial look at the query results ...

In [5]:
# check the first fiew rows
df_rides.head()
Out[5]:
start_station_name start_hour num_rides
0 Zilker Park West 2015-10-02 21:00:00+00:00 80
1 Zilker Park West 2015-10-02 20:00:00+00:00 20
2 Zilker Park West 2015-10-03 21:00:00+00:00 77
3 Nueces @ 3rd 2015-10-03 12:00:00+00:00 10
4 Zilker Park West 2015-10-03 16:00:00+00:00 12
In [6]:
# how many records are we dealing with here?
df_rides.shape
Out[6]:
(292074, 3)

See if there are any issues with the data. Looks like we have no null values and the values of each column has the same data type:

In [7]:
df_rides.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292074 entries, 0 to 292073
Data columns (total 3 columns):
start_station_name    292074 non-null object
start_hour            292074 non-null datetime64[ns, UTC]
num_rides             292074 non-null int64
dtypes: datetime64[ns, UTC](1), int64(1), object(1)
memory usage: 6.7+ MB

3 - Create and train the model

Create the model based on the query. We are using the rows with a start date of before 2018-01-01 for the training data. Later on we will use the rows on or after this date for the test evaluation data.

The "label" (to be predicted) column is the number of rides per hour predicted based on a linear regression model:

In [8]:
%%bigquery

CREATE OR REPLACE MODEL `model_dataset.bike_trips`
OPTIONS(model_type='linear_reg', OPTIMIZE_STRATEGY='batch_gradient_descent') AS
SELECT start_station_name, 
       TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
       COUNT(trip_id) as label
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_time < "2018-01-01"
GROUP BY start_station_name, start_hour
Out[8]:

Look at the iterations of the model

In [9]:
%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `model_dataset.bike_trips`)
ORDER BY iteration 
Out[9]:
training_run iteration loss eval_loss learning_rate duration_ms
0 0 0 4.172457 3.812986 0.1 7083
1 0 1 3.944248 3.603601 0.2 12934
2 0 2 3.817862 3.473669 0.2 11296
3 0 3 3.744511 3.412568 0.2 11264
4 0 4 3.700585 3.365890 0.2 12100
5 0 5 3.673713 3.347049 0.2 13605

4 - Model Evaluation

Run the evaluation based on the test data: rows with a start date >= 2018-01-01.

We are especially looking at the R2_score here for the evaluation criteria.

In [10]:
%%bigquery

SELECT *
FROM
ML.EVALUATE(MODEL `model_dataset.bike_trips`, (
SELECT start_station_name, 
       TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
       COUNT(trip_id) as label
  FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
 WHERE start_time >= "2018-01-01"
 GROUP BY start_station_name, start_hour
))
Out[10]:
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 1.751198 13.494581 0.334176 0.876819 -0.087843 -0.019927

With a negative R2_score of -0.087843 we are doing worse than using the mean for a prediction so what's up?

Note: for an explaination of r2_score see the following site http://www.fairlynerdy.com/what-is-r-squared/

5 - Theories for poor performance

  • There is no random test train split used
  • The year 2018 has a definite increase of rides compared to the other years
  • As we are looking at hourly data any bikes returned to the same station within an hour should not be taken into account
    as they will be available with the same hour
  • We should be looking at stations that have existed for a while before 2018 and then continued to exist in 2018
  • We should be looking at "real" stations and not temporary or administrative ones

Have a look at rides by year, notice the spike in 2018

In [11]:
%%bigquery df_rides_by_year
select extract(year from start_time) as start_year, count(trip_id) as num_rides
  from `bigquery-public-data.austin_bikeshare.bikeshare_trips`
 group by start_year
 order by start_year
In [12]:
import matplotlib.pyplot as plt
df_rides_by_year.set_index('start_year',inplace=True)
df_rides_by_year.plot(kind='bar')
plt.show()

6 - Exercise looking at predictions

A good way to figure out where your model is going wrong is to look closer at a small set of predictions. Use your model to predict the number of rides for the 22nd & Pearl station in 2018. Compare the mean values of predicted vs actual rider

In [13]:
%%bigquery
SELECT AVG(predicted_label) as avg_predicted_trips,
       AVG(label) as avg_actual_trips  
  FROM ML.PREDICT(MODEL `model_dataset.bike_trips`, (
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) as start_hour,
           COUNT(trip_id) as label
     FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
    WHERE EXTRACT(YEAR from start_time) = 2018 
      AND start_station_name = '22nd & Pearl'
    GROUP BY start_station_name, start_hour))
Out[13]:
avg_predicted_trips avg_actual_trips
0 1.563598 3.954864

7 - Exercise: Average daily rides per station

Either something is wrong with the model or something surprising is happening in the 2018 data.

What could be happening in the data? Write a query to get the average number of riders per station for each year in the dataset and order by the year so you can see the trend. You can use the EXTRACT method to get the day and year from the start time timestamp

In [14]:
%%bigquery df_avg_rides_by_year

WITH stations_by_year AS (
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) as rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
)
SELECT year, AVG(rides) AS avg_rides
  FROM stations_by_year
 GROUP by year
 ORDER by year
In [15]:
df_avg_rides_by_year.set_index('year',inplace=True)
df_avg_rides_by_year.plot(kind='bar')
plt.title('Average Rides per Station per Year')
plt.show()

8 - What do your results tell you?

Given the daily average riders per station over the years, does it make sense that the model is failing?

  • Due to the spike in 2018 it does make sense the model is failing
  • There are several stations that have been added during 2018 so no history data to predict on
  • A suggestion would be to only take stations into account that have existed within a certain time frame like from 2015 to 2019

9 - Next Steps

Looking at a solution that Evimar Principal de Soto presented in her kernel which a random selection was chosen to split the train-test set into an 80/20 split, this gave a much better r2_score.

In an attempt to improve the score I have also added a selection of only rides that have a different start and end station or rides that have returned the bicycle to the same station after an hour. Within the hour they would be available again so no need to take into account was the idea

In [16]:
%%bigquery

CREATE OR REPLACE MODEL`model_dataset.bike_trips80`
OPTIONS(model_type='linear_reg',
        OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE start_station_id != end_station_id OR
        ( start_station_id = end_station_id AND duration_minutes > 59 ) 
     GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
  FROM stations
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  < 8
Out[16]:

Evaluate the model based on the new train/test split:

In [17]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80`, (
  WITH stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE start_station_id != end_station_id OR
         ( start_station_id = end_station_id AND duration_minutes > 59 ) 
     GROUP BY start_station_name, start_hour
)
SELECT start_station_name, start_hour, label
  FROM stations
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  >= 8
))
Out[17]:
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 1.216696 3.900408 0.17694 0.889362 0.202602 0.202982
  • The mean squared error of 3.900408 has improved compared to the previous model evaluation : 13.494581
  • The r2_score is now positive: 0.202602 compared to the previous negative value of -0.087843

Try again with start stations that have existed from 2016 to 2018

Let's see if this will improve the r2_score by only querying stations that have existed in the period from 2016 to 2018. Note: still doing the random train/test split here.

In [18]:
%%bigquery 

CREATE OR REPLACE MODEL`model_dataset.bike_trips80_2`
OPTIONS(model_type='linear_reg',
        OPTIMIZE_STRATEGY='batch_gradient_descent') AS
WITH conseq_stations AS 
(
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) AS rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
),
stations_2016_2018 AS 
(
  SELECT start_station_name
    FROM conseq_stations
   GROUP BY start_station_name
  HAVING COUNT(*) = 3
   ORDER BY start_station_name
),
stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     WHERE EXTRACT(year FROM start_time) BETWEEN 2016 AND 2018
     GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
  FROM stations INNER JOIN stations_2016_2018
    ON stations.start_station_name = stations_2016_2018.start_station_name
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  < 8
Out[18]:
In [19]:
%%bigquery
SELECT
  *
FROM ML.EVALUATE(MODEL `model_dataset.bike_trips80_2`, (
WITH seq_stations AS 
(
  SELECT start_station_name,
         EXTRACT(year FROM start_time) AS year,
         COUNT(trip_id) AS rides
    FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
   WHERE EXTRACT(year FROM start_time) BETWEEN 2016 and 2018
   GROUP BY start_station_name, year
   ORDER BY start_station_name, year
),
stations_2016_2018 AS 
(
  SELECT start_station_name
    FROM seq_stations
   GROUP BY start_station_name
  HAVING COUNT(*) = 3
   ORDER BY start_station_name
),
stations AS
(
    SELECT start_station_name,
           TIMESTAMP_TRUNC(start_time, HOUR) AS start_hour,
           MAX(trip_id) AS trip_id,
           COUNT(trip_id) AS label
      FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
     GROUP BY start_station_name, start_hour
)
SELECT stations.start_station_name, stations.start_hour, stations.label
  FROM stations INNER JOIN stations_2016_2018
    ON stations.start_station_name = stations_2016_2018.start_station_name
 WHERE MOD(FARM_FINGERPRINT(CAST(trip_id as STRING)), 10)  >= 8
))
Out[19]:
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
0 1.196344 3.491311 0.174068 0.914008 0.123441 0.123603
  • The mean squared error of 2.833299 has improved compared to the previous model evaluation : 3.900408
  • The r2_score has not improved: 0.10704 compared to the previous value 0.202602

It looks like this hasn't improved the results