incident-response
Complete incident response lifecycle: detection, triage, containment, eradication, recovery, and lessons learned. IR runbooks, forensic preservation, cloud-specific IR (CloudTrail, GuardDuty), communication templates, IOC hunting with SIEM queries, a
Installation
npx clawhub@latest install incident-responseView the full skill documentation and source below.
Documentation
Incident Response
A security incident handled well is a company stress test you survive. Handled poorly, it becomes a data breach disclosure, a regulatory fine, or a company-ending event. The difference between the two is almost always preparation — documented runbooks, practiced procedures, and clear communication chains — not technical sophistication.
Core Mental Model
The NIST IR lifecycle has six phases: Preparation → Identification → Containment → Eradication → Recovery → Lessons Learned. In a real incident, these phases overlap and loop back. Containment may reveal new scope that requires returning to identification. Eradication may trigger another containment step. Think of it as a cycle, not a waterfall. The most important phase is Preparation — everything you do before an incident happens.
IR Lifecycle
Phase 1: PREPARATION (before incident)
✓ Document asset inventory and crown jewels
✓ Deploy detection: SIEM, EDR, cloud trail logs
✓ Write and test runbooks
✓ Establish contact tree (legal, PR, exec, IR team)
✓ Practice with tabletop exercises quarterly
Phase 2: IDENTIFICATION
✓ Alert fires from SIEM / EDR / user report
✓ Triage: Is this a real incident? Severity? Scope?
✓ Declare incident and open incident channel
✓ Assign Incident Commander (IC) and Comms Lead
Phase 3: CONTAINMENT
✓ Short-term: Stop the bleeding (network isolation, account lock)
✓ Preserve evidence BEFORE wiping
✓ Long-term: Apply patches, rotate credentials, segment
Phase 4: ERADICATION
✓ Remove malware / malicious access
✓ Patch the vulnerability
✓ Harden the environment
Phase 5: RECOVERY
✓ Restore from clean backups
✓ Monitor closely for 72 hours
✓ Gradual service restoration
Phase 6: LESSONS LEARNED
✓ Post-incident review within 5 business days
✓ Root cause analysis
✓ Action items with owners and due dates
Triage Checklist
# Incident Triage — First 15 Minutes
**Incident ID:** INC-YYYY-NNN
**Declared:** [timestamp + timezone]
**Incident Commander:** [name]
**Comms Lead:** [name]
## Scope Assessment
- [ ] What systems are potentially affected?
Systems: _______________
- [ ] What data may have been accessed?
Data types: _______________
- [ ] What is the earliest possible compromise date?
Est. start: _______________
- [ ] Is the attacker still active?
Active: YES / NO / UNKNOWN
## Detection Source
- [ ] SIEM alert: [alert name]
- [ ] EDR detection: [detection]
- [ ] User report
- [ ] Third-party notification
- [ ] Automated scan finding
## Severity Classification
- P1 CRITICAL: Active breach, data exfiltration in progress, production down
- P2 HIGH: Confirmed breach, contained; sensitive data at risk
- P3 MEDIUM: Indicators of compromise, investigation ongoing
- P4 LOW: Security event, likely not a breach
**Current Severity:** ___
## Immediate Actions Required
- [ ] Open #incident-INC-YYYY-NNN Slack channel
- [ ] Notify IC chain per severity level
- [ ] Start forensic evidence collection NOW (before any remediation)
- [ ] Begin incident timeline log
Containment Runbook
Order matters: preserve evidence first, then isolate, then investigate.
# AWS Containment Runbook — Compromised EC2 Instance
# STEP 1: Snapshot everything BEFORE touching the instance
INSTANCE_ID="i-0abc123"
REGION="us-east-2"
# Create forensic snapshot of root volume
VOLUME_ID=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId' \
--output text --region $REGION)
SNAPSHOT_ID=$(aws ec2 create-snapshot \
--volume-id $VOLUME_ID \
--description "FORENSIC: Incident INC-2024-042 - $(date -u +%Y%m%dT%H%M%SZ)" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=incident,Value=INC-2024-042},{Key=forensic,Value=true}]" \
--query 'SnapshotId' --output text)
echo "Forensic snapshot created: $SNAPSHOT_ID"
# STEP 2: Capture instance memory (via SSM before isolation)
aws ssm send-command \
--instance-ids $INSTANCE_ID \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["sudo avml /tmp/memory.lime && aws s3 cp /tmp/memory.lime s3://forensic-evidence-bucket/INC-2024-042/memory.lime"]'
# STEP 3: Isolate — apply restrictive security group (deny all traffic)
ISOLATE_SG=$(aws ec2 create-security-group \
--group-name "FORENSIC-ISOLATION-INC-2024-042" \
--description "Blocks all traffic for forensic isolation" \
--query 'GroupId' --output text)
# No ingress or egress rules = deny all
aws ec2 modify-instance-attribute \
--instance-id $INSTANCE_ID \
--groups $ISOLATE_SG
echo "Instance $INSTANCE_ID isolated with SG $ISOLATE_SG"
# STEP 4: Revoke compromised IAM credentials
# Get the IAM role attached to the instance
ROLE_NAME=$(aws ec2 describe-iam-instance-profile-associations \
--filters "Name=instance-id,Values=$INSTANCE_ID" \
--query 'IamInstanceProfileAssociations[0].IamInstanceProfile.Arn' \
--output text | cut -d'/' -f2)
# Revoke all active sessions for the role
aws iam put-role-policy \
--role-name $ROLE_NAME \
--policy-name "INCIDENT-REVOKE-ALL" \
--policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"*","Resource":"*","Condition":{"DateLessThan":{"aws:TokenIssueTime":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'"}}}]}'
echo "All IAM sessions revoked for role $ROLE_NAME"
SIEM Queries for IOC Hunting
-- Splunk: Detect lateral movement via unusual internal connections
index=vpc_flow action=ACCEPT
| eval is_internal=if(match(dst_ip,"^10\.|^172\.(1[6-9]|2[0-9]|3[0-1])\.|^192\.168\."), 1, 0)
| stats count by src_ip, dst_ip, dst_port, is_internal
| where is_internal=1 AND count > 50
| sort -count
-- AWS CloudTrail: Detect privilege escalation attempts
-- (AttachRolePolicy, CreateAccessKey, PutUserPolicy from unusual IAM)
index=cloudtrail eventSource=iam.amazonaws.com
(eventName=AttachRolePolicy OR eventName=CreateAccessKey OR
eventName=PutUserPolicy OR eventName=CreateLoginProfile)
| where userIdentity.type != "Service"
| stats count by userIdentity.arn, eventName, sourceIPAddress, errorCode
| where errorCode="" OR errorCode="None"
| sort -count
-- GuardDuty: High-severity findings in last 24h
-- (via Athena on GuardDuty findings exported to S3)
SELECT
type,
severity,
title,
description,
json_extract_scalar(resource, '$.instanceDetails.instanceId') as instance_id,
updatedAt
FROM guardduty_findings
WHERE severity >= 7.0
AND updatedAt > date_add('hour', -24, now())
ORDER BY severity DESC;
-- Okta: Impossible travel detection (login from geographically distant locations)
SELECT
actor_id,
actor_login,
client_ip,
outcome_result,
published,
LAG(client_ip) OVER (PARTITION BY actor_id ORDER BY published) as prev_ip
FROM okta_system_log
WHERE event_type = 'user.session.start'
AND outcome_result = 'SUCCESS'
HAVING geo_distance(client_ip, prev_ip) > 500 -- km
AND time_diff_minutes < 120;
Forensic Log Collection
#!/bin/bash
# forensic_collect.sh — Collect volatile evidence before containment changes
INCIDENT="INC-2024-042"
OUTPUT_DIR="/forensic/${INCIDENT}/$(hostname)"
mkdir -p "$OUTPUT_DIR"
echo "[$(date -u)] Starting forensic collection for $INCIDENT" | tee "$OUTPUT_DIR/collection.log"
# 1. Running processes (volatile — collect first)
ps aux > "$OUTPUT_DIR/processes.txt"
ps auxf > "$OUTPUT_DIR/process_tree.txt"
# 2. Network connections
netstat -tulpn > "$OUTPUT_DIR/netstat.txt" 2>&1
ss -tulpn > "$OUTPUT_DIR/ss.txt" 2>&1
# 3. Active logins
who > "$OUTPUT_DIR/who.txt"
last -F > "$OUTPUT_DIR/last.txt"
lastlog > "$OUTPUT_DIR/lastlog.txt"
# 4. Scheduled tasks (common persistence mechanism)
crontab -l > "$OUTPUT_DIR/crontab_root.txt" 2>&1
ls -la /etc/cron* > "$OUTPUT_DIR/cron_dirs.txt" 2>&1
cat /etc/cron.d/* >> "$OUTPUT_DIR/cron_dirs.txt" 2>&1
systemctl list-units --type=service > "$OUTPUT_DIR/systemd_services.txt"
# 5. Recent file modifications (last 7 days)
find /etc /usr /bin /sbin -mtime -7 -type f 2>/dev/null > "$OUTPUT_DIR/recent_modifications.txt"
find /tmp /var/tmp -type f 2>/dev/null -ls >> "$OUTPUT_DIR/recent_modifications.txt"
# 6. Auth logs
cp /var/log/auth.log "$OUTPUT_DIR/" 2>/dev/null
cp /var/log/secure "$OUTPUT_DIR/" 2>/dev/null
# 7. Hash all collected files for chain of custody
sha256sum "$OUTPUT_DIR"/* > "$OUTPUT_DIR/CHECKSUMS.sha256"
# 8. Upload to forensic evidence bucket (immutable, versioned)
aws s3 cp "$OUTPUT_DIR" "s3://forensic-evidence-${INCIDENT}/" --recursive \
--no-guess-mime-type \
--metadata "incident=${INCIDENT},collected=$(date -u +%Y%m%dT%H%M%SZ),collector=$(whoami)"
echo "[$(date -u)] Collection complete" | tee -a "$OUTPUT_DIR/collection.log"
Communication Templates
# Internal Escalation (P1 Incident — send within 15 minutes)
**TO:** [CISO, CTO, Legal, CEO]
**SUBJECT:** [P1 SECURITY INCIDENT] INC-2024-042 — Active Investigation
We have declared a P1 security incident at [time] UTC.
**What we know:**
- Detection source: [GuardDuty / EDR / user report]
- Affected systems: [system names]
- Potential data exposure: [data types or "investigating"]
- Attacker status: [active / contained / unknown]
**Actions taken:**
- Incident Commander assigned: [Name]
- Systems isolated: [yes/no]
- Evidence preservation: [in progress / complete]
**Next update:** [time + 30 minutes] or sooner if material changes.
Incident channel: #incident-INC-2024-042
IC: [Name] | [phone]
---
# Regulatory Notification Template (GDPR — 72-hour deadline)
[Company] hereby notifies [supervisory authority] of a personal data breach pursuant to
Article 33 of the GDPR.
**Nature of the breach:** Unauthorized access to [system] resulting in potential exposure of
[data categories] affecting approximately [N] data subjects.
**Date of breach:** [date or "investigation ongoing"]
**Date discovered:** [date]
**Date of notification:** [date]
**Categories of personal data:** [names, emails, etc.]
**Approximate number of data subjects:** [N]
**Categories of recipients:** [internal / third parties if shared]
**Likely consequences:** [risk assessment]
**Measures taken:**
1. [Containment action]
2. [Remediation action]
3. [Prevention measure]
**Contact:** [DPO name, email, phone]
Tabletop Exercise Design
# Tabletop Scenario: Ransomware via Phishing
Duration: 90 minutes | Participants: IR team, IT, legal, comms, exec
## Inject Timeline
T+0:00 — User reports their files have strange extensions
T+0:05 — EDR shows Emotet → Cobalt Strike → ransomware chain on 3 endpoints
T+0:10 — Business asks: should we pay the ransom?
**Discussion 1:** What is your immediate containment action?
T+0:20 — Backup systems found encrypted (attacker had 14-day dwell time)
T+0:25 — PR receives press inquiry from reporter
**Discussion 2:** Who approves the PR response? What do you say?
T+0:40 — Legal confirms customer PII was on compromised systems
**Discussion 3:** What is your GDPR/CCPA notification timeline and obligation?
T+0:55 — Attacker posts sample data on darkweb forum
**Discussion 4:** How does this change your response strategy?
## Questions to Drive Discussion
- Who has authority to isolate production systems?
- What's the process for notifying regulators in each jurisdiction?
- At what point do we engage external IR firm?
- How do we communicate with customers before we know full scope?
- What evidence must we preserve for law enforcement?
Anti-Patterns
❌ Remediating before preserving evidence
The instinct is to patch and clean immediately. This destroys forensic evidence. Always snapshot, memory dump, and log collection before any remediation action.
❌ No pre-approved communication templates
During an incident, you don't have time to write communications from scratch. Legal approval takes hours. Pre-approve templates for all scenarios before an incident.
❌ IC trying to do everything
The IC coordinates, does not execute. Assign specific roles: forensics lead, comms lead, legal liaison, exec briefer. IC without delegation creates a bottleneck.
❌ Not practicing with tabletop exercises
Incident response is a skill that degrades without practice. Teams that have never run a tabletop exercise will make basic coordination mistakes in a real incident.
❌ Declaring victory too early
Attackers frequently maintain persistence after initial remediation. Monitor for 72 hours after "eradication." Many breaches are re-breaches within 30 days.
Quick Reference
Severity levels:
P1 CRITICAL → Active breach, data exfil, production down → IC + exec NOW
P2 HIGH → Confirmed breach, contained → IC + legal within 1h
P3 MEDIUM → IOCs found, investigation → IC + IR team
P4 LOW → Security event, no breach → IR team
Containment order:
1. Preserve evidence (snapshot, memory dump, logs)
2. Isolate (network block, account disable)
3. Investigate (forensics on preserved evidence)
NEVER: remediate before preserving
Regulatory timelines:
GDPR → 72 hours after becoming aware
CCPA → No mandatory timeline (notify "expeditiously")
HIPAA → 60 days after discovery
Evidence preservation:
EC2: EBS snapshot → memory dump via avml → VPC flow logs
SaaS: Export audit logs immediately (often 90-day retention)
Endpoints: EDR telemetry, process dump, disk image