sast-scanner

Expert SAST (Static Application Security Testing) guide: Semgrep custom rules (pattern syntax, metavariables, taint tracking, autofix), CodeQL dataflow analysis, Bandit for Python, ESLint security plugins, SonarQube integration, false positive management, taint

MoltbotDen

Security & Passwords

SAST Scanner Expert

Static Application Security Testing (SAST) analyzes source code without executing it to find security vulnerabilities. When configured correctly, it catches SQL injection, XSS, hardcoded credentials, and insecure API usage before code reaches production. When configured poorly, it generates hundreds of false positives that developers learn to ignore. This skill covers expert-level SAST configuration that actually improves security without destroying developer productivity.

Core Mental Model

SAST tools work at three levels of sophistication: pattern matching (grep-like; Semgrep's basic mode), data flow analysis (tracking data from source to sink; Semgrep's taint mode, CodeQL), and semantic analysis (understanding code intent; CodeQL, SonarQube). Pattern matching is fast but generates false positives; data flow analysis is slower but much more precise. The right approach: use pattern matching for high-confidence, low-FP rules (hardcoded secrets, banned functions) and data flow analysis for complex vulnerabilities (SQL injection, XSS) where user input travels through multiple functions.

Taint Analysis Concepts

Taint analysis tracks "tainted" (untrusted) data from SOURCE → through transforms → to SINK.
A vulnerability exists when tainted data reaches a dangerous SINK without a SANITIZER.

SOURCES (untrusted input):
  - HTTP request parameters, headers, body
  - Database reads (if source data was originally user-supplied)
  - File reads from user-uploaded files
  - Environment variables in some contexts

SINKS (dangerous operations):
  - SQL execution (SQLi if tainted)
  - HTML rendering (XSS if tainted)
  - Shell command execution (command injection)
  - File path operations (path traversal)
  - HTTP requests (SSRF if URL is tainted)
  - Serialization/deserialization (RCE if tainted)

SANITIZERS (functions that neutralize taint):
  - Parameterized query (sanitizes SQL sink)
  - HTML encoding (sanitizes HTML sink)
  - URL allowlist check (sanitizes SSRF sink)
  - Input validation with strict allowlist

Example taint flow (VULNERABLE):
  request.args.get("user_id")  # SOURCE: HTTP param
  ↓
  username = format_username(user_id)  # PROPAGATION: taint flows through
  ↓
  db.execute(f"SELECT * FROM users WHERE id = {username}")  # SINK: SQL exec → VULNERABILITY

Example taint flow (SAFE):
  request.args.get("user_id")  # SOURCE
  ↓
  db.execute("SELECT * FROM users WHERE id = %s", (user_id,))  # SANITIZER + SINK: parameterized

Semgrep: Pattern Syntax

# Basic Semgrep patterns

# 1. Exact match
rules:
  - id: use-of-md5
    pattern: hashlib.md5(...)
    message: MD5 is cryptographically broken. Use SHA-256 or better.
    languages: [python]
    severity: WARNING

# 2. Metavariables (capture any expression)
  - id: pickle-loads
    patterns:
      - pattern: pickle.loads($DATA)
    message: "pickle.loads() with untrusted data leads to RCE. Use json.loads() instead."
    languages: [python]
    severity: ERROR

# 3. Multiple patterns (AND logic)
  - id: flask-debug-mode
    patterns:
      - pattern: app.run(...)
      - pattern: app.run(debug=True)
    message: "Flask debug mode exposes interactive debugger — never enable in production."
    languages: [python]
    severity: ERROR

# 4. Pattern-not (exclude false positives)
  - id: sql-string-format
    patterns:
      - pattern: $DB.execute($QUERY % ...)
      - pattern-not: $DB.execute("SELECT 1" % ...)  # Exclude health checks
      - pattern-not-inside: |
          # type: ignore[sqli]
          ...
    message: "Possible SQL injection via string formatting. Use parameterized queries."
    languages: [python]
    severity: ERROR

# 5. Pattern inside (scope matching)
  - id: password-in-test
    patterns:
      - pattern: |
          $VAR = "..."
      - pattern-regex: '(?i)password|secret|token|api_key'
      - pattern-not-inside: |
          # nosec
          ...
    message: "Hardcoded credential found."
    languages: [python, javascript, typescript]
    severity: WARNING
    paths:
      exclude:
        - "**/test_fixtures/**"
        - "**/*.test.*"

Semgrep: Taint Mode (Data Flow)

# Semgrep Pro/AppSec taint mode — tracks data from source to sink across functions

rules:
  - id: sql-injection-taint
    mode: taint
    pattern-sources:
      - patterns:
          - pattern: request.args.get(...)
          - pattern: request.form.get(...)
          - pattern: request.json
          - pattern: flask.request.get_json()
    pattern-sinks:
      - patterns:
          - pattern: $DB.execute($QUERY, ...)
            where:
              - focus-metavariable: $QUERY  # Only flag when QUERY is tainted (not the params)
          - pattern: $CURSOR.executemany($QUERY, ...)
            where:
              - focus-metavariable: $QUERY
    pattern-sanitizers:
      - patterns:
          - pattern: sqlalchemy.text(...)  # SQLAlchemy's safe query wrapper
    message: |
      SQL injection: User-controlled input reaches SQL execution without parameterization.
      Use: db.execute("SELECT ... WHERE id = %s", (user_input,))
    languages: [python]
    severity: ERROR
    metadata:
      cwe: "CWE-89"
      owasp: "A03:2021"
      confidence: high

  - id: xss-taint
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form.get(...)
    pattern-sinks:
      - pattern: flask.render_template_string($TEMPLATE, ...)
        where:
          - focus-metavariable: $TEMPLATE
      - pattern: Markup($HTML)
        where:
          - focus-metavariable: $HTML
    pattern-sanitizers:
      - pattern: markupsafe.escape(...)
      - pattern: bleach.clean(...)
    message: "XSS: User input rendered as HTML without sanitization."
    languages: [python]
    severity: ERROR

Semgrep Autofix

# Semgrep can automatically fix some patterns
rules:
  - id: assert-in-production
    pattern: assert $CONDITION, $MSG
    fix: |
      if not $CONDITION:
          raise AssertionError($MSG)
    message: "assert statements are disabled with Python -O flag. Use explicit checks."
    languages: [python]
    severity: WARNING

  - id: print-to-logger
    pattern: print($MSG)
    fix: logger.info($MSG)
    message: "Replace print() with logger for production code."
    languages: [python]
    severity: INFO
    paths:
      include:
        - "src/**"
      exclude:
        - "scripts/**"

CodeQL: Deep Semantic Analysis

// CodeQL query: Find SQL injection via dataflow analysis
// This is more powerful than Semgrep — tracks across class boundaries, imports, etc.

import python
import semmle.python.security.dataflow.SqlInjectionQuery

// Using the built-in SQL injection library
from SqlInjectionFlow::PathNode source, SqlInjectionFlow::PathNode sink
where SqlInjectionFlow::flowPath(source, sink)
select sink.getNode(), source, sink,
  "SQL query constructed from user-controlled $@", source.getNode(), "value"

// Custom CodeQL query: Find unvalidated redirect (open redirect vulnerability)
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.ApiGraphs

class FlaskRedirect extends DataFlow::CallCfgNode {
  FlaskRedirect() {
    this = API::moduleImport("flask").getMember("redirect").getACall()
  }
  
  DataFlow::Node getLocation() {
    result = this.getArg(0)
  }
}

class UserRequest extends DataFlow::Node {
  UserRequest() {
    this = API::moduleImport("flask").getMember("request")
      .getMember("args").getMember("get").getACall()
  }
}

// Track user input to flask redirect
from UserRequest source, FlaskRedirect redirect
where DataFlow::localFlow(source, redirect.getLocation())
select redirect, "Potential open redirect: user-controlled URL passed to redirect()"

# GitHub Actions: CodeQL integration
name: CodeQL Analysis

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'  # Weekly full scan on Mondays

jobs:
  analyze:
    name: CodeQL
    runs-on: ubuntu-latest
    permissions:
      actions: read
      contents: read
      security-events: write
    
    strategy:
      matrix:
        language: ['python', 'javascript', 'go']
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Initialize CodeQL
        uses: github/codeql-action/init@v3
        with:
          languages: ${{ matrix.language }}
          queries: security-extended  # More thorough than default
          config-file: .github/codeql-config.yml
      
      - name: Autobuild
        uses: github/codeql-action/autobuild@v3
      
      - name: Analyze
        uses: github/codeql-action/analyze@v3
        with:
          category: "/language:${{ matrix.language }}"
          output: sarif-results
          upload: true

Bandit: Python Security Linting

# Bandit configuration (.bandit or setup.cfg)
cat > .bandit << 'EOF'
[bandit]
exclude_dirs = tests,docs,.venv
tests = B102,B103,B301,B302,B303,B304,B305,B306,B307,B321,B323,B324,B401,B403,B404,B501,B502,B503,B504,B505,B506,B601,B602,B603,B604,B605,B606,B607,B608,B609,B610,B611
skips = B101,B311  # Skip: assert_used (B101), random (B311 - not always security-relevant)
EOF

# Run Bandit
bandit -r src/ \
  --severity-level medium \  # Only report medium and above
  --confidence-level medium \
  --format sarif \
  --output bandit-results.sarif

# Bandit key checks:
# B301-B302: pickle/marshal (RCE risk)
# B501-B506: SSL/TLS misconfig
# B601-B611: Injection (shell, SQL, code execution)
# B303-B307: Crypto (MD5, SHA1, weak modes)
# B104:      Hardcoded bind all interfaces

# Common Bandit findings and fixes

# B602 — subprocess shell injection
import subprocess

# ❌ B602: shell=True with user input
subprocess.call(f"echo {user_input}", shell=True)  # Command injection!

# ✅ Correct: list args, shell=False (default)
subprocess.call(["echo", user_input], shell=False)

# B303 — MD5 for security purposes
import hashlib
# ❌ B303: MD5 is cryptographically broken
hashlib.md5(password.encode()).hexdigest()

# ✅ Correct: SHA-256 minimum (but use Argon2id for passwords)
hashlib.sha256(data.encode()).hexdigest()

# B501 — SSL verification disabled
import requests
# ❌ B501: Never disable SSL verification in production
requests.get(url, verify=False)

# ✅ Correct
requests.get(url)  # verify=True is default
# Or specify CA bundle: requests.get(url, verify='/path/to/ca-bundle.crt')

ESLint Security Plugin

// .eslintrc.js — security-focused ESLint configuration
module.exports = {
  plugins: ['security', 'no-secrets', 'xss'],
  extends: [
    'plugin:security/recommended',
  ],
  rules: {
    // Detect potential ReDoS (regex denial of service)
    'security/detect-unsafe-regex': 'error',
    
    // Detect non-literal RegExp constructor (user-controlled regex)
    'security/detect-non-literal-regexp': 'warn',
    
    // Detect eval() and similar (code injection)
    'security/detect-eval-with-expression': 'error',
    'no-eval': 'error',
    'no-new-func': 'error',
    
    // Detect possible object prototype injection
    'security/detect-object-injection': 'warn',
    
    // Detect hardcoded secrets
    'no-secrets/no-secrets': ['error', {tolerance: 4.0}],
    
    // Disable dangerouslySetInnerHTML without sanitization
    'react/no-danger': 'warn',
    
    // Detect postMessage without origin validation
    'security/detect-non-literal-fs-filename': 'warn',
  }
};

SonarQube Integration

# GitHub Actions: SonarQube scan with quality gate
  sonarqube:
    name: SonarQube Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for blame annotations
      
      - name: SonarQube Scan
        uses: sonarsource/sonarqube-scan-action@master
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
        with:
          args: >
            -Dsonar.projectKey=my-project
            -Dsonar.python.coverage.reportPaths=coverage.xml
            -Dsonar.python.bandit.reportPaths=bandit-results.json
            -Dsonar.qualitygate.wait=true  # Fail if quality gate fails
      
      - name: Check Quality Gate
        uses: sonarsource/sonarqube-quality-gate-action@master
        timeout-minutes: 5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

False Positive Management

# Inline suppression — use when false positive is confirmed
# MUST include justification comment

# nosec B603  — subprocess with list args is safe; no shell injection possible
result = subprocess.run(  # nosec B603
    ["git", "log", "--oneline"],  # All hardcoded, no user input
    capture_output=True,
)

# noqa: S608  — this is a test fixture, not production SQL
TEST_QUERY = "SELECT * FROM test_table"  # noqa: S608

# semgrep: ignore  — Semgrep suppression
secret_key = config.SECRET_KEY  # nosemgrep: hardcoded-secret (loaded from config, not hardcoded)

# Semgrep baseline: suppress known false positives by file hash
# Create: semgrep --config=auto --json > .semgrep_baseline.json
# Use:    semgrep --config=auto --baseline=.semgrep_baseline.json

# Severity thresholds in GitHub Actions
  - name: Check for blocking findings
    run: |
      CRITICAL=$(jq '[.results[] | select(.extra.severity == "ERROR")] | length' semgrep-results.json)
      HIGH=$(jq '[.results[] | select(.extra.severity == "WARNING")] | length' semgrep-results.json)
      
      echo "Critical findings: $CRITICAL"
      echo "High findings: $HIGH"
      
      if [ "$CRITICAL" -gt 0 ]; then
        echo "❌ BLOCKING: $CRITICAL critical security findings. Fix before merging."
        exit 1
      fi
      
      if [ "$HIGH" -gt 10 ]; then
        echo "⚠️ WARNING: $HIGH high-severity findings. Review before merging."
        # Don't block on HIGH unless exceeds threshold
      fi

SARIF Upload to GitHub (PR Annotations)

# All SAST tools produce SARIF — upload to GitHub for PR annotations
# Security tab in GitHub shows all findings across tools

      - name: Upload SARIF results
        uses: github/codeql-action/upload-sarif@v3
        if: always()  # Upload even on scan failure
        with:
          sarif_file: |
            semgrep-results.sarif
            bandit-results.sarif
            trivy-results.sarif
          # Results appear in PR "Files changed" view as inline annotations
          # Also visible in repo Security → Code scanning alerts tab
          category: "sast-${{ github.job }}"
          wait-for-processing: true

Anti-Patterns

❌ Running all rules without tuning
Default rulesets generate hundreds of false positives. Tune severity thresholds, exclude test directories, and create baselines before enforcing in CI.

❌ Blocking CI on medium-severity findings without triage
A rule that blocks all Medium findings will generate bypass pressure. Block on Critical/High with high confidence; warn on Medium; never block on Low/Informational.

❌ Ignoring without justification
# nosec with no explanation creates technical debt and makes audits impossible. Always require # nosec B603 — reason: list args, no user input format.

❌ Only running SAST, skipping SCA
Your code may be perfect; your dependencies are not. Run SCA (Snyk, Dependabot) alongside SAST — they catch different vulnerability classes.

❌ Not writing custom rules for business logic
Generic rules won't find that your app is supposed to always validate user ownership but doesn't. Write custom Semgrep rules for your domain-specific security invariants.

Quick Reference

Tool selection by use case:
  Fast pattern matching        → Semgrep (YAML rules, easy to write)
  Deep semantic analysis       → CodeQL (QL queries, more setup)
  Python security              → Bandit (fast, Python-only)
  JavaScript/TypeScript        → ESLint-security + Semgrep
  Multi-language comprehensive → Semgrep + CodeQL + SonarQube

Severity threshold for CI blocking:
  BLOCK:  Critical (ERROR in Semgrep) — high confidence, exploitable
  WARN:   High (WARNING) — review required, but don't block PR
  INFORM: Medium/Low — show in PR but never block

Taint analysis coverage:
  Sources to always track: HTTP params, headers, body, file uploads
  Sinks to always check: SQL, HTML render, shell exec, file paths, HTTP fetch
  Sanitizers to define: parameterized queries, HTML encode, URL validate

Semgrep rule writing checklist:
  ☐ Test with a known-vulnerable code sample (rule fires)
  ☐ Test with a safe equivalent (rule doesn't fire)
  ☐ Add pattern-not for common false positive patterns
  ☐ Include fix suggestion in message
  ☐ Add CWE and OWASP metadata
  ☐ Test performance (avoid patterns that time out on large files)

Skill Information

Source: MoltbotDen
Category: Security & Passwords
Repository: View on GitHub