Google BigQuery Data Safety Guide

Part of Google Cloud Platform (GCP), BigQuery (BQ) is one of Google’s petabyte-scale SQL databases, ideal for warehousing lots of data, with lightning fast queries and plenty of ways to access data. One of the most common BQ use cases we see in schools is the storage of large sets of Student Information System (SIS) data, like attendance, behavior, grades, and enrollments. This provides many possibilities for reporting, dashboards, metric analysis, and more, but how can you secure that data? Protecting student information should be paramount, so there are additional concerns when storing this information in a cloud-based service like BigQuery.

As with most best practices for security in IT, a model of “least privilege” should be applied to the entire process. We recommend reviewing three areas of action: how data gets imported to BQ, how users access data in BQ, and how data is rendered.

Importing data to BigQuery

As with any Extract, Transform, Load (ETL) or similar process, data has to come from somewhere. A common example is generating an automated export from your SIS platform in CSV format, delivering that file to a server via SFTP or mapped drive, then using Amplified Labs’ Local Hero to automatically import that CSV to a BQ table on a daily or hourly schedule. There are alternative solutions, including uploading CSV files to Google Storage and then loading them into BQ, or streaming inserts to BQ from a local script. Regardless of the import method, it needs to be secure.

If you’re exporting data from a database and preparing for import to BQ, consider who has access to the infrastructure powering that process. Some important questions to consider about importing to BQ:

 

  • If you set up an import on your laptop for ease of access, what happens if you leave your laptop at an airport, with all those student records sitting on the hard drive?

  • If you run the import process from a scheduled task on a server, you’re less likely to have the hardware walk away. But then who has access to that server and its resources?

  • Do you need to keep the source data files around after a successful import, or can they be deleted, reducing the amount of time you’re storing unencrypted student data?

  • If you’re using service account keys or OAuth tokens to programmatically access GCP services (BigQuery/Cloud Storage/etc.), who has access to those keys? What other data can those keys provide access to?

 

The first step to secure data in BigQuery is to secure the infrastructure and processes responsible for exporting source data and importing to BQ. But what about data as it sits in BQ – how is that secured?

 

Storing and accessing data directly in BigQuery

Once we get data in BQ, we need to control access to that data and how it is stored. This can be done a number of ways, from encrypting data at rest, to restricting who has access to the actual BQ service. Keep in mind that all Super Admins on your G Suite domain can access your GCP projects, and other users can be granted specific access as needed.

To quote the BQ documentation on encrypting data:

By default, BigQuery encrypts customer content stored at rest. BigQuery handles and manages this default encryption for you without any additional actions on your part.”

See more about that and taking on individual responsibility for encrypting BQ data here. Ultimately this is useful if you manage your own keys for other encryption needs.

Since BQ is a SQL database, you’re likely going to want to give users access to it. That can be done via third-party connectors like Data Studio (see the next section), programmatically, or through the web UI. BQ is a part of the Google Cloud Platform and is associated with an Organization and a Project in GCP. You should already be implementing best practices in security and control in your GCP project, but if not, now is a great time to start.

Following the model of least privilege, with IAM you can give a standard G Suite user access to specific areas of GCP projects. For BigQuery, you could give someone specific access to operations using predefined roles, such as read-only access. You can create custom roles to get even more granular. Use caution with giving blanket project level access, as that could give someone access to every entity in the project, like Compute Engine or others. Only give users the access they need to perform their specific duties.

You can also specify roles on service account keys, so it is recommended to restrict any script or application with programmatic access to BQ via the exact same method.

Once you’ve set up the correct roles for your BQ users, their accounts still need basic security. Ensure you’re enforcing strong password policies and use 2-factor authentication when you can.

Exporting & rendering data from BigQuery

It’s important to recognize the many ways to get data out of BQ. You can do it the same way you import: through the web UI, a script running on a local server, by exporting to Google Sheets, or connecting to Data Studio, etc. Controlling where your data goes is an immediate challenge as soon as you give someone access to the dataset. This means your BQ users are now also responsible for securing any data they access and export. This can get complicated, so let’s break it down with a few scenarios and you’ll see a trend pretty quickly.

Exporting BigQuery to Google Sheets

First, let’s say you have a multi-million-row table in BQ, and you want to give someone access to a subset of that data without giving them access to the entire BQ dataset. You can create a query or specific view in BQ, run the query, then export to Google Sheets. Share that sheet with the user and voilà, the user can sort, filter, and search through data as desired without the need for SQL skills or training, or access to your BQ project. You will want to inform the user how to secure that data; i.e. don’t share it outside of the domain, don’t set it to “anyone with the link can access”, etc. (Side note, you can now automatically update exports from BQ -> sheets!) Making data portable and accessible to Google Sheets lets the non-SQL muggles use the data in a meaningful way, but this portable data can be an increased security risk.

Connecting BigQuery to Data Studio

Another scenario to consider involves the use of Data Studio. We use Data Studio a lot here at Amplified IT: for tracking deals with our customers, forecasting growth, and highlighting anomalies. Data Studio is a fantastic tool for creating graphical renderings of your data. It’s easy to connect to BQ, and you can blend data from multiple data sets (akin to doing a SQL join on multiple tables) and drag and drop columns into charts.

One great feature of Data Studio is that you can give users access to a Data Studio dashboard that you have connected to a BQ dataset without giving them direct access to the BQ dataset itself! Users can then modify and share their dashboards as needed. This lets you transform and obfuscate data as needed via views and custom SQL queries in BQ, while still giving users the flexibility needed to manipulate the data based on their needs. Keep in mind that users by default have the option to export a table in a Data Studio dashboard to a CSV or Google Sheet, so ensure your users know how to keep that data secure.

Unless you’re using BQ as a backup or data dump location, users will need to access it. How will you manage that access, and how will you ensure they keep this data secure as well? End-user training is critical. Set standards around the enforcement of best practice policies, and make sure your users know how to secure their own accounts.

Summary and Next Steps

BigQuery is an excellent service that can consolidate data from multiple sources into one highly scalable, secure, and easy to access SQL environment. From importing to storing and exporting data, you control every point of access and privilege. While the data can be very well secure, ensuring users are informed and knowledgeable about how they access and store data is a critical step to ensuring privacy.

If you’re just getting started with Google Platform and BigQuery, we can help! Google offers credits to schools who are new to GCP, and our team can ensure that you get up and running and have a PO-based billing cycle for future usage. Set up a time to chat with our team about using GCP in your district.

Find this article useful? Share it!

  • Melanie Long
    Data, Implementation, and Engagement Consultant

  • About the Author:

    Melanie lives in Virginia and is based in Amplified IT’s home office located in Norfolk. One of the first members to join the Amplified IT team, Melanie has worn many hats at the company.  She most enjoys interfacing with customers and helping them implement tools that solve common pain points and frustrations. Today she leads the onboarding and interfacing with Labs tool clients, making lives easier and breezier one implementation at a time.