Hosting simple scripts for cheap on GCP

On giving an old Macbook Air some rest

In the past few weeks I wanted to get a sense of how many remote-friendly jobs are posted on stackoverflow on a given day. To do this, I’ve written a small Python utility that parses their XML feed and uploads the results on a spreadsheet on Google Drive. So far so good.

I initially hosted this script as a crontab job on my old Macbook Air that sits collecting dust on a bookshelf. Everything went smooth until I forgot to attach the power cord for a few days (consequently missing data collections), it wasn’t the end of the world and no human life depends on it but it was probably the right time to study how to host simple scripts on a cloud provider for cheap (alas for free).

Alternatives

I chose GCP for no specific reason beyond the fact that I already had an account and some euros worth of free credits. I am aware that the same end results can be achieved using Lambda Functions or spinning up a VM on AWS.
The requirements were quite basic:

  • Something cheap (the script will run 4 times a day)
  • More efficient than a VM (without using k8s auto-scaling I’d end up paying for idle time). With Cloud Functions I’ll be billed only for the time my function has actually run for
  • Easy and quick to deploy

I ended up choosing GCP functions as for my usage levels I’d stay within the free plan usage limits (at the time of writing):

Free tier
Invocations 2,000,000
GB/second 400,000
GHz/second 200,000
Networking 5

Preparing and hosting the repo

In order to successfully host my function to the cloud, my original script needed couple adjustments. First, as the function will be invoked by a GET request, I had to add request as an argument of my main function in order to make it callable.

def main(request):
    # Do stuff

if __name__ ==  "__main__":
    main()

Secondly, gspread (Python module to interact with Google Sheets) requires a json file with API auth keys for authentication. I have put that file in a storage bucket and retrieve the file when the function is invoked.

At this point there are mainly 3 ways to host the script code:

  1. Deploy directly from a local folder.
  2. Compress the script folder in a .zip archive and pass it in as an argument during deployment.
  3. Mirror my github repo on a GCP Source Repository. This will play nicely with Cloudbuild if and when I’ll decide to implement some basic CI that would trigger a refresh of the function upon new commits.

I decided to go with 3 as the process is straightforward enough.

Deployment

Functions can be easily deployed via gcloud command line utility. Depending on the use-case, a number of arguments can be set. In this specific case, it is pretty important to set:

  • --set-env-vars: set environment variables needed by the script to run properly
    • JOB_SHEET_ID: id of the google sheet that will store the parsed job listings.
    • SHEET_NAME: worksheet name of the spreadsheet mentioned above
    • GOOGLE_AUTH_KEY: name of the file storing Google API keys. When run locally, this pointed to a file in the script folder. In cloud function this variable will point to the right file in the storage bucket.
  • --no-allow-unauthenticated: this argument prevents the function from being invoked from the outside world. It is particularly critical as it will ensure I won’t incur in any billing generated by fraudulent activity. This has been the aspect that I’ve struggled to understand the most as I kept finding different authentication approaches. After asking a question on stackoverflow, I have decided to stick with the approach described in this post as it relies on gcp service accounts and doesn’t require to implement auth flows in the function.
  • --timeout: it is safe to set it a tad longer than the average function run time as execution will be killed if it exceeds timeout (default is 30s). It is worth pointing out that Cloud Functions have an execution time limit of 9 minutes.
gcloud functions deploy so-parse \
    --memory 128MB \
    --runtime python37 \
    --entry-point main \
    --source https://source.developers.google.com/projects/your-project/repos/repo-name/path/to/function \
    --timeout 300 \
    --set-env-vars JOB_SHEET_ID="your-ghseet-id" \
    --set-env-vars SHEET_NAME="worksheet-name" \
    --set-env-vars GOOGLE_AUTH_KEY="json-file-with-google-auth-keys.json" \
    --trigger-http \
    --no-allow-unauthenticated

In order to automatically run the function at set intervals, it is possible to rely on Cloud Scheduler. The main logic is to set up a cron job that will fire a GET request to the function’s trigger url at a given interval.
Before setting the cron job up we need to create a service account and make sure it has the right permissions to invoke Cloud Functions.

gcloud iam service-accounts create cloud-scheduler
gcloud functions add-iam-policy-binding \
    --member=serviceAccount:cloud-scheduler@your-project.iam.gserviceaccount.com \
    --role=roles/cloudfunctions.invoker so-parse
gcloud scheduler jobs create http stack-scrape \
    --schedule="*/3 * * * *" \
    --uri=https://your-func-trigger-url \
    --oidc-service-account-email=cloud-scheduler@your-project.iam.gserviceaccount.com 

To confirm that everything went according to plan I quickly peeked a the logs after waiting for a few minutes (to be sure the function actually had the time to run at least once).