Skip to main content

bigquery-expert

Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,

MoltbotDen
Data & Analytics

BigQuery Expert

BigQuery's billing model fundamentally changes how you write SQL. Every byte scanned costs money; every query that doesn't use partition pruning or column selection is a cost leak. The expert mindset is: filter partition columns first, select only needed columns, use clustering to reduce rows scanned within a partition, and understand that storage is cheap while query compute (slots) is the primary cost lever. BigQuery's columnar format means SELECT * is particularly expensive — it reads every column.

Core Mental Model

BigQuery is a serverless, column-oriented, distributed SQL engine. It stores data in Capacitor (columnar format), which means it reads only the columns you select. Partitioning divides the table into segments by a date/timestamp/integer column — queries filtering on the partition column skip non-matching segments entirely. Clustering sorts data within each partition by up to 4 columns — reducing rows scanned within a partition. Always: partition first, cluster second, and filter on both in every query.


Partitioning and Clustering

-- Partitioned + clustered table (the production-standard DDL)
CREATE TABLE `myproject.analytics.events`
(
  event_id    STRING       NOT NULL,
  user_id     STRING       NOT NULL,
  event_type  STRING       NOT NULL,
  payload     JSON,
  event_date  DATE         NOT NULL,  -- partition column
  created_at  TIMESTAMP    NOT NULL
)
PARTITION BY event_date
CLUSTER BY user_id, event_type       -- clustering columns (up to 4)
OPTIONS (
  partition_expiration_days = 365,
  require_partition_filter  = TRUE,  -- prevents accidental full table scans
  description = "User events, partitioned by day, clustered by user+type"
);

-- Timestamp partitioning (auto-creates DATE partition from TIMESTAMP)
CREATE TABLE `myproject.analytics.orders`
(
  order_id    STRING    NOT NULL,
  customer_id STRING    NOT NULL,
  amount      NUMERIC,
  status      STRING,
  created_at  TIMESTAMP NOT NULL
)
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, status;

-- Integer range partitioning (useful for non-time data)
CREATE TABLE `myproject.analytics.users`
(
  user_id     INT64     NOT NULL,
  email       STRING,
  tier        STRING
)
PARTITION BY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 10000000, 1000000));
-- Creates 10 partitions of 1M users each

Partition Pruning

-- ✅ Always filter on partition column (DATE, TIMESTAMP, or DATE() of TIMESTAMP)
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)  -- partition pruning ✅
  AND user_id = 'user_123';

-- ❌ Wrapping partition column in a function (breaks partition pruning)
WHERE DATE_ADD(event_date, INTERVAL 1 DAY) > '2024-01-01'  -- no partition pruning!

-- ❌ Comparing partition TIMESTAMP with BETWEEN string
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31'  -- works but check explain
-- ✅ Use explicit type casts
WHERE DATE(created_at) BETWEEN '2024-01-01' AND '2024-01-31'

Query Cost Analysis

-- INFORMATION_SCHEMA: analyze query costs per project/user/table
SELECT
  job_id,
  user_email,
  query,
  total_bytes_processed / POW(2, 30)          AS gb_scanned,
  total_bytes_processed / POW(2, 30) * 0.005  AS estimated_cost_usd,
  total_slot_ms / 1000.0                       AS slot_seconds,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND state = 'DONE'
  AND error_result IS NULL
ORDER BY total_bytes_processed DESC
LIMIT 50;

-- Find most expensive queries per user
SELECT
  user_email,
  COUNT(*) AS query_count,
  SUM(total_bytes_processed) / POW(2, 40) AS total_tb_scanned,
  SUM(total_bytes_processed) / POW(2, 40) * 5.0 AS estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email
ORDER BY total_tb_scanned DESC;

-- Tables with most scanned bytes (candidates for partitioning/clustering)
SELECT
  referenced_table.table_id,
  COUNT(*)                                          AS query_count,
  SUM(total_bytes_processed) / POW(2, 30)           AS total_gb_scanned,
  AVG(total_bytes_processed) / POW(2, 30)           AS avg_gb_per_query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT j,
  UNNEST(referenced_tables) AS referenced_table
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY referenced_table.table_id
ORDER BY total_gb_scanned DESC;

Standard SQL: ARRAY, STRUCT, UNNEST

-- ARRAY: list of values in one column
SELECT
  order_id,
  ARRAY_AGG(product_id ORDER BY created_at) AS product_ids,
  ARRAY_AGG(STRUCT(product_id, quantity, price) ORDER BY created_at) AS line_items
FROM order_items
GROUP BY order_id;

-- UNNEST: explode array into rows
SELECT
  o.order_id,
  item.product_id,
  item.quantity,
  item.price
FROM orders o,
UNNEST(o.line_items) AS item;  -- cross join lateral on array

-- STRUCT: nested record
SELECT
  user_id,
  STRUCT(
    first_name AS first,
    last_name  AS last,
    email
  ) AS contact_info
FROM users;

-- Access nested fields
SELECT
  user_id,
  contact_info.email,
  contact_info.first
FROM users_with_struct;

-- Combine ARRAY + STRUCT for complex analytics
SELECT
  customer_id,
  COUNT(*) AS order_count,
  SUM(amount) AS total_spent,
  ARRAY_AGG(
    STRUCT(order_id, amount, status, DATE(created_at) AS order_date)
    ORDER BY created_at DESC
    LIMIT 5
  ) AS recent_orders
FROM orders
GROUP BY customer_id;

Window Functions

-- Window functions: computations across related rows without collapsing
SELECT
  user_id,
  order_id,
  amount,
  created_at,
  
  -- Running total per customer
  SUM(amount) OVER (
    PARTITION BY user_id
    ORDER BY created_at
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_spend,
  
  -- Rank within category by amount
  RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS rank_in_category,
  DENSE_RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS dense_rank_in_category,
  
  -- Lag/lead for time-series comparisons
  LAG(amount, 1) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_order_amount,
  LEAD(amount, 1) OVER (PARTITION BY user_id ORDER BY created_at) AS next_order_amount,
  
  -- Row number for deduplication
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS recency_rank
  
FROM orders;

-- QUALIFY: filter on window function result (BigQuery extension)
-- Find each customer's most recent order without a subquery
SELECT user_id, order_id, amount, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) = 1;

-- vs standard SQL (requires subquery)
SELECT user_id, order_id, amount, created_at
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM orders
)
WHERE rn = 1;

BigQuery ML

-- Train a logistic regression model to predict churn
CREATE OR REPLACE MODEL `myproject.analytics.customer_churn_model`
OPTIONS (
  model_type             = 'logistic_reg',
  input_label_cols       = ['churned'],
  data_split_method      = 'auto_split',
  auto_class_weights     = TRUE,
  l2_reg                 = 0.1
) AS
SELECT
  total_orders,
  days_since_last_order,
  avg_order_value,
  order_frequency_30d,
  support_tickets,
  customer_tenure_days,
  churned
FROM `myproject.analytics.training_features`
WHERE training_date < '2024-01-01';

-- Evaluate model
SELECT *
FROM ML.EVALUATE(MODEL `myproject.analytics.customer_churn_model`,
  (SELECT * FROM `myproject.analytics.training_features`
   WHERE training_date >= '2024-01-01'));

-- Make predictions
SELECT
  customer_id,
  churned AS actual,
  predicted_churned,
  predicted_churned_probs
FROM ML.PREDICT(MODEL `myproject.analytics.customer_churn_model`,
  (SELECT * FROM `myproject.analytics.customer_features`));

-- Supported model types: linear_reg, logistic_reg, kmeans, matrix_factorization,
-- boosted_tree_classifier, boosted_tree_regressor, random_forest_*,
-- dnn_classifier, dnn_regressor, automl_*, tensorflow (BQML imports TF SavedModels)

Authorized Views

-- Authorized views: column/row level security without exposing underlying tables
-- Step 1: Create the restricted view
CREATE VIEW `myproject.analytics_restricted.customer_pii_view`
AS
SELECT
  customer_id,
  -- Mask PII for non-admin users
  CASE
    WHEN SESSION_USER() IN ('[email protected]', '[email protected]')
    THEN email
    ELSE CONCAT(LEFT(email, 3), '***@***')
  END AS email,
  CASE
    WHEN SESSION_USER() IN ('[email protected]')
    THEN full_name
    ELSE NULL
  END AS full_name,
  tier,
  created_at
FROM `myproject.analytics.customers`
WHERE
  -- Row-level security: analysts see only their region
  region = (
    SELECT region
    FROM `myproject.access_control.analyst_assignments`
    WHERE analyst_email = SESSION_USER()
  )
  OR SESSION_USER() IN ('[email protected]');

-- Step 2: Grant the view access to the underlying table (authorized view)
-- In BigQuery Console: Table → Sharing → Authorize View
-- Or via bq CLI:
-- bq update --view myproject:analytics.customers \
--   --authorized_view myproject:analytics_restricted.customer_pii_view

Materialized Views

-- Materialized view: auto-refreshes when base table changes
CREATE MATERIALIZED VIEW `myproject.analytics.daily_revenue_mv`
PARTITION BY event_date
CLUSTER BY region
OPTIONS (
  enable_refresh = TRUE,
  refresh_interval_minutes = 60
)
AS
SELECT
  DATE(created_at)     AS event_date,
  region,
  SUM(amount)          AS total_revenue,
  COUNT(*)             AS order_count,
  AVG(amount)          AS avg_order_value,
  COUNT(DISTINCT user_id) AS unique_customers
FROM `myproject.analytics.orders`
WHERE status IN ('completed', 'shipped')
GROUP BY event_date, region;

-- Smart reuse: BigQuery auto-routes queries to MV when MV covers the query
-- This query will use daily_revenue_mv instead of scanning orders table:
SELECT region, SUM(total_revenue)
FROM `myproject.analytics.daily_revenue_mv`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY region;

Slot Allocation and Reservations

BigQuery pricing models:
  On-demand:    $6.25 per TB scanned. No slot management. Good for exploration.
  Capacity:     Flat monthly cost for dedicated slots. Good for predictable workloads.
  
Slot math:
  1 slot ≈ 1 virtual CPU
  100 GB query with complex joins ≈ 200-500 slot-seconds
  
Reservations:
  Baseline:     Always available to assignment
  Max:          Can burst to this if idle slots exist
  Autoscale:    Flex slots purchased automatically during bursts (pay-as-you-go)

INFORMATION_SCHEMA for slot analysis:
-- Slot utilization analysis
SELECT
  TIMESTAMP_TRUNC(start_time, HOUR) AS hour,
  SUM(total_slot_ms) / (1000.0 * 3600) AS slot_hours_used,
  COUNT(*) AS queries
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY hour
ORDER BY hour;

Wildcard Tables and Exports

-- Wildcard tables: query date-sharded tables (legacy pattern)
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'
  AND user_id = 'user_123';

-- Export to GCS
EXPORT DATA OPTIONS (
  uri = 'gs://my-bucket/exports/orders-*.parquet',
  format = 'PARQUET',
  overwrite = TRUE,
  compression = 'SNAPPY'
) AS
SELECT order_id, customer_id, amount, created_at
FROM `myproject.analytics.orders`
WHERE DATE(created_at) = CURRENT_DATE();

Anti-Patterns

-- ❌ SELECT * on large tables (reads ALL columns — kills performance and costs)
SELECT * FROM `myproject.analytics.events`
WHERE event_date = CURRENT_DATE();
-- ✅ Select only needed columns
SELECT event_id, user_id, event_type
FROM `myproject.analytics.events`
WHERE event_date = CURRENT_DATE();

-- ❌ No partition filter (full table scan)
SELECT COUNT(*) FROM `myproject.analytics.events` WHERE user_id = 'abc';
-- ✅ Always filter partition column
SELECT COUNT(*) FROM `myproject.analytics.events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)  -- partition pruning
  AND user_id = 'abc';

-- ❌ CROSS JOIN without UNNEST (accidental cartesian product)
SELECT * FROM table_a, table_b;  -- implicit cross join
-- ✅ Explicit joins with conditions

-- ❌ Repeated identical subqueries (each evaluated separately)
SELECT
  (SELECT MAX(amount) FROM orders WHERE status = 'completed') AS max_order,
  (SELECT MIN(amount) FROM orders WHERE status = 'completed') AS min_order;
-- ✅ Single subquery or CTE
WITH order_stats AS (
  SELECT MAX(amount) AS max_order, MIN(amount) AS min_order
  FROM orders WHERE status = 'completed'
)
SELECT max_order, min_order FROM order_stats;

-- ❌ Non-deterministic function in partition column check (breaks pruning)
WHERE event_date > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)  -- ✅ this is fine
WHERE EXTRACT(YEAR FROM event_date) = 2024  -- ✅ also fine, pushes filter
WHERE TO_JSON_STRING(event_date) = '"2024-01-01"'  -- ❌ breaks pruning

Quick Reference

Table Design:
  Always partition:    DATE/TIMESTAMP column
  Always cluster:      first 1-4 query filter columns after partition
  require_partition_filter: TRUE for big tables (prevent accidental scans)

Cost Reduction:
  1. Select only needed columns (columnar = column cost)
  2. Filter partition column first in WHERE
  3. Use clustering to reduce rows within partition
  4. Materialized views for repeated expensive aggregations
  5. Use INFORMATION_SCHEMA.JOBS to find expensive queries

Standard SQL Reminders:
  QUALIFY             → filter on window function (no subquery needed)
  UNNEST              → explode ARRAY into rows
  STRUCT              → nested record
  ARRAY_AGG           → aggregate rows into array

BigQuery ML:
  Train:   CREATE MODEL ... OPTIONS(model_type=...) AS SELECT ...
  Eval:    ML.EVALUATE(MODEL ..., (SELECT ...))
  Predict: ML.PREDICT(MODEL ..., (SELECT ...))

Security:
  Column masking:      CASE WHEN SESSION_USER() IN (...) THEN ... END
  Row filtering:       WHERE in view definition
  Authorized view:     grant view access to base table

Skill Information

Source
MoltbotDen
Category
Data & Analytics
Repository
View on GitHub

Related Skills