Introduction

As data engineers, we often sit in the uncomfortable middle ground between “Business needs data fast” and “Legal says we can’t share PII.” In the past, this meant creating endless authorized views or duplicating tables with sensitive columns removed — a maintenance and cost nightmare.

BigQuery’s Column-Level Security (CLS) and Dynamic Data Masking features allow us to solve this elegantly. We can keep a single physical table but serve different versions of the data (clear, masked, or nullified) based on the user’s IAM permissions.

In this guide, we’ll implement a robust security policy for a users table containing sensitive email and salary data, then integrate it with dbt and explore monitoring strategies.


The Architecture of BigQuery Security

Before writing code, we need to understand the hierarchy. BigQuery security relies on Dataplex (formerly Data Catalog) taxonomies:

  1. Taxonomy: A logical group of policy tags (e.g., “PII Data”). Think of it as a security domain.
  2. Policy Tags: Specific labels within a taxonomy (e.g., “High Sensitivity”, “Medium Sensitivity”).
  3. Data Policies: Rules attached to tags that define how data is hidden (e.g., “Mask with SHA256”, “Block access”).

The flow is: BigQuery → Policy Tags (in Governance section) → Data Policies Creation → Application to Columns


Scenario: The users Table

We have a simple table in our raw dataset with varying sensitivity levels:

  • user_id (Public)
  • full_name (Public)
  • email (Sensitive - Needs Masking for analysts but readable for joins)
  • salary (Highly Sensitive - Restricted Access, no masking option)

Let’s create this table with some dummy data:

CREATE OR REPLACE TABLE `my_project.raw.users` AS
SELECT 1 as user_id, 'Alice Bob' as full_name, 'alice@example.com' as email, 120000 as salary
UNION ALL
SELECT 2, 'Charlie Dave', 'charlie@example.com', 95000
UNION ALL
SELECT 3, 'Eve Foster', 'eve@example.com', 150000;

Part 1: The Manual Way (For Understanding)

I recommend doing this once manually (console clicks, sometimes called “ClickOps”) to understand the flow, then switching to Terraform for everything that follows.

You may probably have to activate Google Cloud Data Catalog API.

  1. Navigate to Policy Tags:

    • Option A (Recommended): Go to the BigQuery Console, click on your project in the left sidebar, then click the “Policy tags” tab
    • Option B: Go to Dataplex → Govern → Data Products in the GCP Console (we won’t cover this)
  2. Create Taxonomy: Click “Create Taxonomy”

    • Name: Data Governance
    • Region: us or eu (must match your BigQuery dataset region)
  3. Create Policy Tags: Within your taxonomy, add two tags:

    • PII - Email
    • Confidential - Salary
  4. Set Masking Rules:

    • Select each time the Policy Tag and click on “Manage Data Policies
    • Click on PII - Email“Create data policy”
    • Choose masking function: “Hash (SHA256)”
    • Click on Confidential - Salary“Create data policy”
    • Choose masking rule: Default Masking Value
  5. Give Masked Reader / Fine-Grained Reader roles:

  • You can give the Masked Reader role when you create the “Masking Rule” in the “Principal” field
  • You can give the Fine-Grained Reader role by selecting a policy tag and in the “Info Panel” adding a “Principal” the following roles/datacatalog.categoryFineGrainedReader role

You should have something like this in the Policy tags page:

List of Policy Tags created

  1. Apply to BigQuery Table:
    • Go to BigQuery Console
    • Navigate to your table: my_project.raw.users
    • Click “Edit schema” (pencil icon)
    • For the email column: Click on “Add policy tag” → Select PII - Email
    • For the salary column: Add policy tag → Select Confidential - Salary
    • Save changes

Now if you don’t have any permission you would see something like this (but you can query the table without the columns you don’t have access to):

With the Masked Reader role you would have access to the masked data:

With the right permission Fine-Grained Reader you would have access to the real data

Please note that even if you’re the Owner of the GCP project you won’t have access to the columns without the dedicated permissions.


Part 2: The “Engineer” Way (Terraform)

Clicking through consoles doesn’t scale. The companion repository for this article provisions the full stack via Terraform:

github.com/p-munhoz/tf-bq-column-masking

tf-bq-column-masking/
├── modules/
│   └── column_security/    # Reusable module: taxonomy + tags + policies + IAM
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── environments/
    └── dev/                # Wire everything together + BQ dataset + table
        ├── main.tf
        ├── variables.tf
        ├── outputs.tf
        └── terraform.tfvars.example

1. Define the Taxonomy and Tags

resource "google_data_catalog_taxonomy" "this" {
  provider     = google-beta
  project      = var.project_id
  region       = var.region
  display_name = "Data Governance"
  description  = "Taxonomy for PII and Confidential data — managed by Terraform"
 
  # Required to actually enforce column-level access control.
  # Without this, policy tags exist but do nothing.
  activated_policy_types = ["FINE_GRAINED_ACCESS_CONTROL"]
}
 
resource "google_data_catalog_policy_tag" "pii_email" {
  provider     = google-beta
  taxonomy     = google_data_catalog_taxonomy.this.id
  display_name = "PII - Email"
  description  = "Email addresses — SHA256 masked for non-privileged users"
}
 
resource "google_data_catalog_policy_tag" "confidential_salary" {
  provider     = google-beta
  taxonomy     = google_data_catalog_taxonomy.this.id
  display_name = "Confidential - Salary"
  description  = "Salary data — NULL for Masked Readers"
}

Key notes:

  • The region must match your BigQuery dataset location (eu, us, europe-west1, etc.). Taxonomies are regional and cannot be applied across regions.
  • google-beta provider is required for these resources.
  • activated_policy_types = ["FINE_GRAINED_ACCESS_CONTROL"] is what actually arms the taxonomy. Without it, policy tags are decorative.

2. Define the Masking Policies

# Email → SHA256 hash (still useful for JOINs and GROUP BY)
resource "google_bigquery_datapolicy_data_policy" "email_masking" {
  provider         = google-beta
  project          = var.project_id
  location         = var.region
  data_policy_id   = "pii_email_sha256"
  policy_tag       = google_data_catalog_policy_tag.pii_email.name
  data_policy_type = "DATA_MASKING_POLICY"
 
  data_masking_policy {
    predefined_expression = "SHA256"
    # Other options: DEFAULT_MASKING_VALUE (NULL), EMAIL_MASK,
    # LAST_FOUR_CHARS, FIRST_FOUR_CHARS, DATE_YEAR_MASK
  }
 
  depends_on = [google_data_catalog_policy_tag.pii_email]
}
 
# Salary → NULL (no analytical value as a masked figure)
resource "google_bigquery_datapolicy_data_policy" "salary_masking" {
  provider         = google-beta
  project          = var.project_id
  location         = var.region
  data_policy_id   = "confidential_salary_null"
  policy_tag       = google_data_catalog_policy_tag.confidential_salary.name
  data_policy_type = "DATA_MASKING_POLICY"
 
  data_masking_policy {
    predefined_expression = "DEFAULT_MASKING_VALUE"
  }
 
  depends_on = [google_data_catalog_policy_tag.confidential_salary]
}

3. Grant IAM Roles

The module separates permissions per column, you can give a user masked access to email without giving them any access to salary:

# Masked Readers: see SHA256 hash for email
resource "google_bigquery_datapolicy_data_policy_iam_member" "email_masked_readers" {
  for_each = toset(var.email_masked_reader_members)
 
  provider       = google-beta
  project        = var.project_id
  location       = var.region
  data_policy_id = google_bigquery_datapolicy_data_policy.email_masking.data_policy_id
  role           = "roles/bigquerydatapolicy.maskedReader"
  member         = each.value
}
 
# Fine-Grained Readers: see cleartext email
resource "google_data_catalog_policy_tag_iam_member" "email_fine_grained_readers" {
  for_each = toset(var.email_fine_grained_reader_members)
 
  provider   = google-beta
  policy_tag = google_data_catalog_policy_tag.pii_email.id
  role       = "roles/datacatalog.categoryFineGrainedReader"
  member     = each.value
}

4. Apply Tags to the Table Schema

Policy tags are embedded directly in the Terraform schema definition:

resource "google_bigquery_table" "users" {
  dataset_id = google_bigquery_dataset.raw_tf.dataset_id
  table_id   = "users"
 
  schema = jsonencode([
    { name = "user_id",   type = "INTEGER", mode = "REQUIRED" },
    { name = "full_name", type = "STRING",  mode = "REQUIRED" },
    {
      name = "email"
      type = "STRING"
      mode = "REQUIRED"
      policyTags = {
        names = [module.column_security.pii_email_policy_tag_id]
      }
    },
    {
      name = "salary"
      type = "INTEGER"
      mode = "REQUIRED"
      policyTags = {
        names = [module.column_security.confidential_salary_policy_tag_id]
      }
    }
  ])
 
  depends_on = [module.column_security]
}

5. Deployment

cd environments/dev
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your project ID, region, and user lists
 
terraform init
terraform plan
terraform apply

Due to GCP API propagation delays, you may occasionally need two passes:

terraform apply -target=module.column_security
terraform apply

After applying, a useful output gives you the exact policy tag IDs ready to paste into any other table schema:

terraform output dbt_schema_yml_snippet

Part 3: Managing Access (The IAM Layer)

Three Permission Levels

RoleWhere It’s GrantedWhat They See
No roleAccess Denied on any query that touches a protected column
Masked Reader (roles/bigquerydatapolicy.maskedReader)On the data policyemail: SHA256 hash — salary: 0
Fine-Grained Reader (roles/datacatalog.categoryFineGrainedReader)On the policy tagCleartext values

Policy tags restrict access regardless of project ownership. Even a GCP project Owner will receive Access Denied on a protected column unless they are explicitly granted the Fine-Grained Reader role on that policy tag. This surprises most engineers the first time they encounter it, plan for it when onboarding new team members or granting break-glass access.

Data Analyst (Masked View)

Give them BigQuery Data Viewer on the dataset plus Masked Reader on the data policies:

resource "google_bigquery_dataset_iam_member" "analyst_viewer" {
  dataset_id = google_bigquery_dataset.raw_tf.dataset_id
  role       = "roles/bigquery.dataViewer"
  member     = "user:analyst@company.com"
}

What they see:

user_id | full_name    | email                                                            | salary
--------|--------------|------------------------------------------------------------------|-------
1       | Alice Bob    | 3a7bd3e2360a3d29eea436fcfb7e44c735d117c42d1c1835420b6b9942dd4f1b | 0
2       | Charlie Dave | fc5e038d38a57032085441e7fe7010b0f6f5b84e3c0e74e1e7e1e4e7e2e5f3e5 | 0
3       | Eve Foster   | 7e8d9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8 | 0

They can still query, aggregate, and JOIN, they just can’t read the raw values.

HR Admin (Clear View)

Grant Fine-Grained Reader on both policy tags:

resource "google_data_catalog_policy_tag_iam_member" "hr_email_reader" {
  policy_tag = google_data_catalog_policy_tag.pii_email.id
  role       = "roles/datacatalog.categoryFineGrainedReader"
  member     = "user:hr-admin@company.com"
}
 
resource "google_data_catalog_policy_tag_iam_member" "hr_salary_reader" {
  policy_tag = google_data_catalog_policy_tag.confidential_salary.id
  role       = "roles/datacatalog.categoryFineGrainedReader"
  member     = "user:hr-admin@company.com"
}

What they see:

user_id | full_name    | email               | salary
--------|--------------|---------------------|--------
1       | Alice Bob    | alice@example.com   | 120000
2       | Charlie Dave | charlie@example.com | 95000
3       | Eve Foster   | eve@example.com     | 150000

Part 4: Common Gotchas and Edge Cases

1. JOINs on Masked Columns

SHA256 masking is deterministic, the same email always produces the same hash. This means JOINs still work:

-- As analyst@company.com (email is masked with SHA256)
SELECT u.user_id, u.email, o.order_id
FROM `my_project.raw.users` u
JOIN `my_project.raw.orders` o
  ON u.email = o.customer_email  -- Works if both columns use the same masking rule

✅ Works when both sides of the JOIN use the same SHA256 masking policy. ❌ Breaks when one column is masked and the other isn’t, or when they use different masking functions.

2. BI Tool Behaviour

Looker, Tableau, Metabase, and similar tools fully respect column-level security. A few things to know:

  • If using a service account, all users of that BI tool share its permissions
  • If using OAuth (user-level credentials), each user sees data according to their own IAM roles
  • Some tools cache schema metadata - refresh the connection after applying new policy tags

Recommendation: Use a dedicated service account per access tier (one for analysts, one for HR) and grant each the appropriate role.

3. Performance Impact

SHA256 masking is applied during the scan phase and adds negligible overhead (under 2% in our testing on tables up to 100M rows). Costs remain unchanged, you’re still scanning the same bytes.

One pattern to be aware of:

-- Can be slower: mask is applied before the filter
SELECT * FROM users WHERE email = 'alice@example.com'
 
-- Faster: filter on an unmasked column
SELECT * FROM users WHERE user_id = 1

4. Region Locking

Taxonomies are regional and must match your BigQuery dataset location exactly. You cannot apply a taxonomy from eu to a dataset in us-central1. Plan your regions before you start.

5. Wildcard Table Queries

Policy tags work with wildcard queries, but all tables in the wildcard must have the tags applied consistently. If users_2024 has the PII - Email tag but users_2023 doesn’t, the query will fail.

6. dbt and Policy Tag Persistence

If you use dbt, be aware that --full-refresh drops and recreates tables, which removes policy tags unless they are explicitly defined in your schema.yml:

models:
  - name: users
    columns:
      - name: email
        policy_tags:
          - "projects/my_project/locations/eu/taxonomies/123/policyTags/456"
      - name: salary
        policy_tags:
          - "projects/my_project/locations/eu/taxonomies/123/policyTags/789"

Use terraform output pii_email_policy_tag_id to get the exact path to paste here. More information in the official dbt documentation.


Part 5: Audit Logging - Who Accessed Your Masked Data?

Compliance isn’t just about restricting access, it’s about knowing who accessed what and when.

Enable Data Access Audit Logs

resource "google_project_iam_audit_config" "bigquery_audit" {
  project = var.project_id
  service = "bigquery.googleapis.com"
 
  audit_log_config { log_type = "DATA_READ" }
  audit_log_config { log_type = "DATA_WRITE" }
}

This is already included in the Terraform repository.

Query Audit Logs

SELECT
  timestamp,
  protopayload_auditlog.authenticationInfo.principalEmail as user_email,
  protopayload_auditlog.resourceName as table_accessed,
  protopayload_auditlog.metadataJson as query_metadata
FROM `my_project.my_dataset.cloudaudit_googleapis_com_data_access_*`
WHERE
  resource.type = 'bigquery_resource'
  AND protopayload_auditlog.methodName = 'jobservice.query'
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timestamp DESC

Alert on Suspicious Access Patterns

Set up a scheduled query that flags anomalies, a user suddenly querying 10x more PII records than usual, salary data accessed outside business hours, or a service account touching tables it normally doesn’t.

WITH user_baseline AS (
  SELECT
    principalEmail,
    AVG(record_count) AS avg_records,
    STDDEV(record_count) AS stddev_records
  FROM historical_access_logs
  GROUP BY principalEmail
),
recent_access AS (
  SELECT principalEmail, COUNT(*) AS recent_record_count
  FROM todays_access_logs
  WHERE table_name LIKE '%users%'
  GROUP BY principalEmail
)
SELECT
  r.principalEmail,
  r.recent_record_count,
  b.avg_records,
  (r.recent_record_count - b.avg_records) / b.stddev_records AS z_score
FROM recent_access r
JOIN user_baseline b USING (principalEmail)
WHERE (r.recent_record_count - b.avg_records) / b.stddev_records > 3

Part 6: Cost & Production Checklist

  • No additional storage costs: you’re storing the same data, not duplicating tables
  • No additional query costs: masking does not change bytes scanned, but if you filter on masked columns it can prevent predicate pushdown optimizations.
  • Audit logging: typically under $10/month for most projects
  • ⚠️ Taxonomy limits: free, but capped at 1,000 policy tags per project

The real saving is in simplicity. The old pattern of maintaining users_full, users_anonymized, and users_restricted as separate tables doubles or triples your storage and creates a lineage nightmare. One table with column-level security replaces all of that.


Masking vs. Blocking: A Decision Matrix

Use CaseSHA256 MaskingNULL MaskingHard Block (no role)
Email for JOINs/grouping
Phone number for distinct counts
Salary figures
Social Security Numbers
Credit card numbers
IP addresses for fraud analysis

Rule of thumb: mask when the column still has analytical value in disguised form (JOINs, counts); return NULL when there’s no legitimate use case for non-privileged users; hard-block only when even a NULL would reveal something (e.g., the presence of a salary record itself is sensitive).


Alternative Masking Strategies

SHA256 and NULL are just two options. BigQuery supports several others:

predefined_expression = "EMAIL_MASK"       # f***@example.com
predefined_expression = "LAST_FOUR_CHARS"  # ****1234
predefined_expression = "FIRST_FOUR_CHARS" # 1234****
predefined_expression = "DATE_YEAR_MASK"   # 1987-01-01 → 1987-00-00

For fully custom logic (e.g., mask external emails but not internal ones):

CREATE OR REPLACE FUNCTION `my_project.my_dataset.mask_email`(email STRING) AS (
  IF(
    ENDS_WITH(email, '@company.com'),
    email,
    CONCAT(LEFT(email, 2), '***@', SPLIT(email, '@')[OFFSET(1)])
  )
);

Then reference it in Terraform:

data_masking_policy {
  routine = "projects/my_project/datasets/my_dataset/routines/mask_email"
}

Real-World Implementation Checklist

Before rolling this out to production:

  • Identify which columns contain PII or sensitive data
  • Define your taxonomy structure, start with 3–5 tags
  • Test masking rules in a dev environment first
  • Verify BI tools and downstream pipelines work with masked/NULL values
  • Enable audit logging before granting access
  • Create runbooks for granting/revoking Fine-Grained Reader access
  • If using dbt, add policy_tags to schema.yml for all sensitive models
  • Set up alerting for unusual access patterns
  • Schedule quarterly reviews of who holds Fine-Grained Reader permissions

Conclusion: Security as Code

By implementing Column-Level Security, we move away from “security by obscurity” or managing hundreds of views. We treat data privacy as a first-class citizen in our infrastructure.

Key takeaways:

  • Decouple data from policy. Your table schema defines what the data is; policy tags define who sees it.
  • Masking is not all-or-nothing. SHA256 for joinable fields, NULL for financial figures, hard blocks for the most sensitive data — use the right tool for each column.
  • Policy tags restrict access regardless of project ownership, unless the user is explicitly granted the appropriate fine-grained role.
  • Automate everything. Terraform prevents configuration drift. One manual change in the console can silently expose PII.
  • Security without visibility is theater. Audit logs and anomaly detection are as important as the restrictions themselves.

The beauty of this approach is that it scales with your team. Whether you have 10 analysts or 1,000, a single users table with proper column-level security is infinitely more maintainable than a sprawling mess of _masked, _anonymized, and _restricted variants.

Security isn’t a checkbox — it’s a practice. And with BigQuery’s column-level security, we finally have the tools to make that practice both robust and developer-friendly.


Want to dive deeper?