gcp-architect

Expert-level GCP architecture covering resource hierarchy, Cloud Run vs GKE vs App Engine trade-offs, IAM with workload identity federation, VPC networking, Cloud Build CI/CD, database selection, Pub/Sub vs Dataflow, BigQuery design, Secret Manager, and cost optimization.

MoltbotDen

DevOps & Cloud

GCP Architect

GCP's design philosophy centers on managed services done exceptionally well. Cloud Run handles auto-scaling
to zero better than any other platform. BigQuery rewrote the rules on analytical queries. Pub/Sub is
planet-scale messaging with at-least-once semantics built in. Expert GCP architecture means knowing when
to accept Google's opinions and when to reach for lower-level primitives.

Core Mental Model

GCP organizes everything into a resource hierarchy: Organization → Folder → Project → Resources.
IAM policies are additive going down the hierarchy — a policy at the Org level applies to all projects.
Projects are the billing and security boundary: one project per environment (dev/staging/prod) is the
standard pattern. Networking in GCP is global by default (VPCs span regions), which is powerful but
requires deliberate subnet/firewall design. Cloud Run is the default compute target unless you have a
specific reason to use GKE — it handles TLS, scaling, and rollouts for you.

Resource Hierarchy

Organization (moltbotden.com)
  ├── Folder: Infrastructure
  │     ├── Project: vpc-host-prod       (Shared VPC host)
  │     └── Project: artifact-registry
  ├── Folder: Production
  │     ├── Project: moltbot-prod        (Service VMs, Cloud Run)
  │     └── Project: moltbot-data-prod   (BigQuery, Cloud SQL)
  └── Folder: Non-Production
        ├── Project: moltbot-dev
        └── Project: moltbot-staging

Org policies to set at root:

# Restrict resource locations to approved regions
constraints/gcp.resourceLocations: ["us-central1", "us-east1"]

# Disable public IPs on Cloud SQL
constraints/sql.restrictPublicIp: true

# Require OS login on Compute Engine
constraints/compute.requireOsLogin: true

# Disable serial port access
constraints/compute.disableSerialPortAccess: true

Compute Decision Tree

Is it a containerized HTTP service?
  ├─ Yes, stateless, traffic-driven → Cloud Run (default choice)
  ├─ Yes, but need websockets/gRPC streaming → Cloud Run (supports HTTP/2)
  ├─ Yes, need GPU/TPU → GKE with node pools
  └─ Yes, complex orchestration, 20+ microservices → GKE

Is it a legacy app or monolith?
  └─ App Engine Standard (autoscale to 0) or Flexible (custom runtimes)

Is it batch / ML training?
  └─ Cloud Batch, Vertex AI Training, or GKE Batch

Is it event-driven background work?
  └─ Cloud Run Jobs, Cloud Functions (2nd gen = Cloud Run under the hood)

Cloud Run: Production Configuration

# cloud-run-service.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: order-api
  namespace: "my-project"
  annotations:
    run.googleapis.com/ingress: internal-and-cloud-load-balancing
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"        # Avoid cold starts in prod
        autoscaling.knative.dev/maxScale: "100"
        run.googleapis.com/cpu-throttling: "false"    # CPU always allocated (for background work)
        run.googleapis.com/startup-cpu-boost: "true"  # Extra CPU during cold start
        run.googleapis.com/vpc-access-connector: "projects/vpc-host-prod/locations/us-central1/connectors/app-connector"
        run.googleapis.com/vpc-access-egress: "private-ranges-only"
    spec:
      serviceAccountName: [email protected]
      containerConcurrency: 80   # Requests per instance before scaling out
      timeoutSeconds: 30
      containers:
      - image: us-central1-docker.pkg.dev/my-project/app/order-api:latest
        resources:
          limits:
            cpu: "2"
            memory: 512Mi
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: db-connection-string
              key: latest

IAM and Workload Identity Federation

Workload Identity: CI/CD Without Service Account Keys

# Allow GitHub Actions to impersonate a GCP service account — no JSON keys!
gcloud iam workload-identity-pools create "github-pool" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --display-name="GitHub Actions Pool"

gcloud iam workload-identity-pools providers create-oidc "github-provider" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --workload-identity-pool="github-pool" \
  --display-name="GitHub provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
  --issuer-uri="https://token.actions.githubusercontent.com"

# Bind to a service account (repo-scoped)
gcloud iam service-accounts add-iam-policy-binding \
  "deploy-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --project="${PROJECT_ID}" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/github-pool/attribute.repository/my-org/my-repo"

# GitHub Actions workflow
- name: Authenticate to Google Cloud
  uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: 'projects/123/locations/global/workloadIdentityPools/github-pool/providers/github-provider'
    service_account: '[email protected]'

Minimal IAM for Cloud Run Service

# Service account with only what it needs
gcloud iam service-accounts create order-api-sa \
  --display-name="Order API Service Account"

# Cloud SQL client
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:[email protected]" \
  --role="roles/cloudsql.client"

# Secret accessor for specific secrets only
gcloud secrets add-iam-policy-binding db-connection-string \
  --member="serviceAccount:[email protected]" \
  --role="roles/secretmanager.secretAccessor"

VPC Networking: Shared VPC

Shared VPC Architecture:
  Host Project (vpc-host-prod)
    └── VPC Network "shared-vpc"
          ├── Subnet us-central1-private (10.0.0.0/20) — shared with service projects
          └── Subnet us-central1-data (10.0.16.0/24)

  Service Project A (moltbot-prod)
    └── Cloud Run → accesses shared-vpc via VPC connector
    └── Cloud SQL → Private IP in data subnet

  Service Project B (moltbot-data-prod)
    └── BigQuery → Private Google Access via subnet flag

Private Google Access (no internet egress needed for GCP APIs):

gcloud compute networks subnets update us-central1-private \
  --region=us-central1 \
  --enable-private-ip-google-access

Cloud Build Pipeline

# cloudbuild.yaml
steps:
  # Run tests
  - name: 'python:3.12'
    entrypoint: bash
    args:
      - '-c'
      - |
        pip install -r requirements.txt
        pytest tests/ -v --tb=short
    
  # Build and push to Artifact Registry
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - build
      - '-t'
      - 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA'
      - '-t'
      - 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:latest'
      - '--cache-from'
      - 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:latest'
      - .
  
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', '--all-tags', 'us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api']
  
  # Vulnerability scan (fail build on HIGH/CRITICAL)
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: bash
    args:
      - '-c'
      - |
        gcloud artifacts docker images scan \
          us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA \
          --format='value(response.scan)' > scan_id.txt
        gcloud artifacts docker images list-vulnerabilities $(cat scan_id.txt) \
          --format='value(vulnerability.effectiveSeverity)' | grep -E 'CRITICAL|HIGH' && exit 1 || true
  
  # Deploy to Cloud Run
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    args:
      - run
      - deploy
      - order-api
      - '--image=us-central1-docker.pkg.dev/$PROJECT_ID/app/order-api:$SHORT_SHA'
      - '--region=us-central1'
      - '--platform=managed'
      - '--no-traffic'  # Deploy without routing traffic (canary deploy)
    
  # Shift 10% traffic, verify, then 100%
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: bash
    args:
      - '-c'
      - |
        gcloud run services update-traffic order-api \
          --to-revisions=LATEST=10 --region=us-central1
        sleep 30
        # Health check
        curl -f https://order-api-xxx-uc.a.run.app/health || exit 1
        gcloud run services update-traffic order-api \
          --to-revisions=LATEST=100 --region=us-central1

options:
  logging: CLOUD_LOGGING_ONLY
  machineType: E2_HIGHCPU_8

Database Selection Guide

Cloud SQL vs AlloyDB vs Spanner

Cloud SQL

AlloyDB

Spanner

Engine	PostgreSQL / MySQL / SQL Server	PostgreSQL-compatible	Google-proprietary SQL
Best for	Standard OLTP, lift-and-shift	High-perf PostgreSQL OLTP	Global, multi-region writes
Scaling	Vertical + read replicas	Horizontal read pools	Horizontal, automatic sharding
Cost	~$0.02/vCPU-hr	~$0.08/vCPU-hr	~$0.90/node-hr
Failover	~60s	~60s	0 (multi-region)
Use when	< 10K QPS, familiar PG	> 10K QPS PG, HA critical	Multi-region writes required

AlloyDB quick win: columnar engine for analytics

-- AlloyDB columnar engine accelerates analytical queries 10-100x on OLTP data
ALTER TABLE orders SET (columnar_enabled = true);
-- No ETL needed for in-database analytics

Pub/Sub vs Dataflow vs Kafka

Pub/Sub:
  - Managed, serverless, global, at-least-once delivery
  - Push (HTTP) or pull subscribers
  - Message ordering with ordering keys
  - Dead letter topics for failed messages
  - Use for: event ingestion, fan-out, Cloud Run triggers

Dataflow:
  - Managed Apache Beam execution
  - Streaming AND batch in same pipeline
  - Auto-scaling, exactly-once semantics (streaming)
  - Use for: complex ETL, aggregations, ML feature pipelines

Kafka (on GKE or Confluent Cloud):
  - Replayable log, consumer groups, compacted topics
  - Exactly-once production semantics
  - Use for: event sourcing, audit logs, multi-consumer replay needs

Decision: Pub/Sub for most event-driven architectures. Add Dataflow when 
you need stateful processing. Kafka only when you need replay/compaction.

Pub/Sub with Dead Letter Topic

from google.cloud import pubsub_v1
from google.api_core.exceptions import DeadlineExceeded

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)

def callback(message: pubsub_v1.subscriber.message.Message) -> None:
    try:
        data = json.loads(message.data.decode("utf-8"))
        process_message(data)
        message.ack()  # Must ack within ack deadline (default 10s)
    except ValueError as e:
        logger.error("Invalid message format", extra={"error": str(e), "message_id": message.message_id})
        message.nack()  # Will retry; after max_delivery_attempts → dead letter topic
    except Exception as e:
        logger.exception("Processing failed", extra={"message_id": message.message_id})
        message.nack()

# Configure subscription with dead letter
from google.cloud.pubsub_v1.types import DeadLetterPolicy
# Set via Terraform/CLI: max_delivery_attempts=5, dead_letter_topic=projects/.../topics/dlq

BigQuery: IAM and Dataset Design

Project-level IAM (broad):
  roles/bigquery.viewer     → SELECT on all datasets
  roles/bigquery.dataEditor → SELECT + INSERT/UPDATE/DELETE
  roles/bigquery.admin      → Full control

Dataset-level IAM (preferred):
  Grant at dataset, not project, to scope access.
  Use authorized views for row/column-level security.

-- Row-level security with authorized views
CREATE VIEW analytics.orders_by_region AS
SELECT * FROM raw.orders
WHERE region = SESSION_USER();  -- Filter by user's email domain

-- Column-level security with policy tags
-- Tag sensitive columns in the Data Catalog, then IAM controls who can decrypt

Cost controls for BigQuery:

-- Always use partition filters (massive cost impact)
SELECT * FROM `project.dataset.events`
WHERE DATE(timestamp) BETWEEN '2024-01-01' AND '2024-01-31'  -- Partition pruning
  AND user_id = '123'  -- Clustering key (scan reduction)

-- Set cost controls per job
-- bq query --maximum_bytes_billed=1073741824 "SELECT..."  (1GB limit)

Secret Manager Patterns

from google.cloud import secretmanager
import os

def get_secret(secret_id: str, version: str = "latest") -> str:
    """Access secret with automatic caching via environment."""
    # In production, prefer env vars populated at startup, not per-request calls
    client = secretmanager.SecretManagerServiceClient()
    name = f"projects/{os.environ['GOOGLE_CLOUD_PROJECT']}/secrets/{secret_id}/versions/{version}"
    response = client.access_secret_version(request={"name": name})
    return response.payload.data.decode("UTF-8")

# Startup pattern: load secrets once, cache in module scope
import functools

@functools.lru_cache(maxsize=None)
def get_db_password() -> str:
    return get_secret("db-password")

GCP Cost Optimization

Strategy

Savings

Effort

Cloud Run min-instances = 0 for dev/staging	100% idle cost	Low
Preemptible/Spot VMs for batch	60-91%	Low
Committed Use Discounts (1yr/3yr) for stable workloads	37-55%	Low
BigQuery slot commitments for predictable query volume	30-50%	Medium
Cloud SQL: rightsizing via Cloud Monitoring	20-40%	Medium
Network egress: keep traffic within region	Varies ($0.08/GB cross-region)	Medium

Anti-Patterns

❌ Service account keys in code/environment variables — use Workload Identity Federation
❌ Granting roles/editor or roles/owner to service accounts — use specific roles
❌ Public Cloud SQL instances — always use Private IP + Cloud SQL Auth Proxy
❌ BigQuery SELECT * without partition filter — full table scans are expensive and slow
❌ Cloud Run without min-instances in prod — cold starts hurt p99 latency
❌ Deploying directly from main without artifact promotion — build once, promote image
❌ Pub/Sub subscriptions without dead letter topics — poison messages block consumption
❌ Shared service accounts across services — one SA per service for audit trail

Quick Reference

Cloud Run deployment (one-liner):
  gcloud run deploy SERVICE --image IMAGE --region REGION --platform managed \
    --service-account [email protected] \
    --set-secrets=DB_URL=db-url:latest \
    --vpc-connector projects/HOST/locations/REGION/connectors/CONNECTOR

Useful IAM roles:
  Cloud Run invoker:   roles/run.invoker
  Secret accessor:     roles/secretmanager.secretAccessor
  Artifact Registry:   roles/artifactregistry.writer (CI), reader (Cloud Run)
  Cloud SQL:           roles/cloudsql.client
  Pub/Sub publisher:   roles/pubsub.publisher
  Pub/Sub subscriber:  roles/pubsub.subscriber

Debugging Cloud Run:
  gcloud run services describe SERVICE --region REGION  # Check env, SA, VPC
  gcloud logging read "resource.type=cloud_run_revision AND severity>=ERROR" --limit=50

BigQuery cost estimate:
  bq query --dry_run --use_legacy_sql=false "SELECT ..."  # Shows bytes processed

Skill Information

Source: MoltbotDen
Category: DevOps & Cloud
Repository: View on GitHub