linux-sysadmin
Expert Linux system administration covering process and service management with systemd, advanced networking with modern tools, storage and LVM, performance analysis toolkit, user permissions and SSH hardening, log management, and system internals via /proc and /sys.
Linux Sysadmin
Linux system administration mastery is the foundation of everything in the cloud. Understanding how the
kernel manages processes, memory, and I/O lets you diagnose problems that no monitoring tool will surface
for you. The tools are decades old, but the patterns are universal.
Core Mental Model
Every Linux system is a hierarchy: the kernel manages hardware, the init system (systemd) manages services,
and everything else is a process in a tree rooted at PID 1. Problems always have a root cause — a process
consuming too much CPU, a file descriptor leaked, a network socket stuck in TIME_WAIT, a disk filling up.
The skill is navigating the tool chain (top → strace → lsof → netstat → tcpdump) to narrow from symptom
to cause. The /proc filesystem is the kernel's real-time self-portrait — most tools just read from it.
systemd: Service Management
Unit File for a Production Service
# /etc/systemd/system/order-api.service
[Unit]
Description=Order API Service
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service
[Service]
Type=notify # systemd waits for sd_notify() before marking "active"
User=order-api
Group=order-api
WorkingDirectory=/opt/order-api
ExecStart=/opt/order-api/bin/server --config /etc/order-api/config.yaml
ExecReload=/bin/kill -HUP $MAINPID # Reload config without restart
# Environment
EnvironmentFile=/etc/order-api/env
Environment=PORT=8080
# Restart behavior
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=60
StartLimitBurst=3 # Max 3 restarts in 60s before giving up
# Security hardening
NoNewPrivileges=yes # Prevent privilege escalation
PrivateTmp=yes # Isolated /tmp
ProtectSystem=strict # /usr, /boot read-only
ReadWritePaths=/var/lib/order-api /var/log/order-api
ProtectHome=yes
CapabilityBoundingSet=CAP_NET_BIND_SERVICE # Only needed capability
AmbientCapabilities=CAP_NET_BIND_SERVICE
LimitNOFILE=65536 # Raise file descriptor limit
# Resource limits (cgroup v2)
MemoryLimit=512M
CPUQuota=200% # Max 2 CPU cores
[Install]
WantedBy=multi-user.target
# Essential systemd commands
systemctl start|stop|restart|reload|status order-api
systemctl enable|disable order-api # Enable/disable on boot
systemctl daemon-reload # After editing unit files
# journalctl for logs
journalctl -u order-api # All logs for unit
journalctl -u order-api -f # Follow (tail -f equivalent)
journalctl -u order-api --since "1 hour ago"
journalctl -u order-api -n 100 --no-pager
journalctl -u order-api -p err # Priority: emerg alert crit err warning notice info debug
journalctl --disk-usage # How much journal space used
journalctl --vacuum-size=1G # Trim old journals
# Analyze startup time
systemd-analyze blame
systemd-analyze critical-chain order-api.service
Networking: Modern Toolkit
ip Commands (replace ifconfig/route)
# Interface management
ip addr show
ip addr add 10.0.1.10/24 dev eth0
ip addr del 10.0.1.10/24 dev eth0
# Route management
ip route show
ip route add 192.168.2.0/24 via 10.0.1.1 dev eth0
ip route add default via 10.0.1.1
# Network namespace (container networking internals)
ip netns list
ip netns exec my-ns ip addr show
# Socket stats (replace netstat -tulpn)
ss -tulpn # TCP+UDP, listening, with process
ss -s # Summary statistics
ss -t state established # All established TCP connections
ss -o state TIME-WAIT # Connections in TIME_WAIT
ss 'sport = :8080' # Connections on port 8080
iptables vs nftables
# iptables: legacy but ubiquitous
iptables -L -n -v # List all rules with packet counts
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -P INPUT DROP # Default deny
iptables-save > /etc/iptables/rules.v4 # Persist
# nftables: modern replacement
# /etc/nftables.conf
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
# Allow established connections
ct state established,related accept
# Allow loopback
iifname "lo" accept
# Allow ICMP
ip protocol icmp accept
ip6 nexthdr icmpv6 accept
# Allow SSH, HTTP, HTTPS
tcp dport { 22, 80, 443 } accept
# Rate limit SSH to prevent brute force
tcp dport 22 ct state new limit rate 5/minute accept
tcp dport 22 ct state new drop
}
chain forward {
type filter hook forward priority 0; policy drop;
}
chain output {
type filter hook output priority 0; policy accept;
}
}
tcpdump Patterns
# Capture HTTP traffic on eth0
tcpdump -i eth0 -nn 'tcp port 80' -w /tmp/http.pcap
# Show DNS queries
tcpdump -i any -nn 'udp port 53'
# Capture traffic between two hosts
tcpdump -i eth0 'host 10.0.1.10 and host 10.0.1.20'
# Show packet contents (ASCII)
tcpdump -i eth0 -A 'tcp port 8080 and (tcp[tcpflags] & tcp-push != 0)'
# Capture and read without resolving hostnames
tcpdump -i eth0 -nn -s0 -w /tmp/capture.pcap
tcpdump -r /tmp/capture.pcap -nn 'tcp port 443'
Storage: LVM and Performance
# Disk overview
lsblk -f # Tree view with filesystem info
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE
df -h # Disk usage
df -i # Inode usage (can run out of inodes!)
du -sh /var/log/* | sort -rh # Find large directories
# LVM management
pvs / vgs / lvs # Show PVs, VGs, LVs
pvcreate /dev/sdb
vgcreate data_vg /dev/sdb
lvcreate -L 50G -n app_lv data_vg
mkfs.ext4 /dev/data_vg/app_lv
mount /dev/data_vg/app_lv /data
# Extend LV online (no unmount needed for ext4/xfs)
lvextend -L +20G /dev/data_vg/app_lv
resize2fs /dev/data_vg/app_lv # ext4
xfs_growfs /data # xfs
# Performance analysis
iostat -xz 1 5 # Extended I/O stats (util%, await, r/w/s)
iotop -o # Which processes doing I/O
fio --name=randread --ioengine=libaio --iodepth=32 \
--rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 # Disk benchmark
Performance Analysis Toolkit
# CPU analysis
top -d 1 # Refresh every second
htop # Better UI, tree view
atop # Historical view, shows killed processes
# Load average interpretation
# load average: 1.2, 0.8, 0.5 (1min, 5min, 15min)
# On a 4-core system: > 4.0 = saturation
nproc # Number of processors
uptime # Quick load average view
# Memory analysis
free -h
vmstat 1 5 # Virtual memory, CPU, I/O snapshot
cat /proc/meminfo | grep -E "MemTotal|MemFree|Cached|Buffers|SwapUsed"
# Process investigation
ps aux --sort=-%mem | head -20 # Top memory consumers
ps aux --sort=-%cpu | head -20 # Top CPU consumers
pmap -x <PID> # Memory map of a process
# strace: trace system calls
strace -p <PID> # Attach to running process
strace -c -p <PID> # Count syscalls (statistics)
strace -e openat ls /etc # Trace only open() calls
strace -f -e trace=network curl http://example.com # Network syscalls only
# lsof: list open files and sockets
lsof -p <PID> # All files opened by PID
lsof -i :8080 # What's listening on port 8080
lsof -i tcp -n # All TCP connections
lsof +D /var/log # All files open in directory
lsof -u username # All files by user
# Find who is using a file
fuser -v /var/log/app.log
fuser -k 8080/tcp # Kill process using port 8080
SSH Hardening
# /etc/ssh/sshd_config hardening
PermitRootLogin no
PasswordAuthentication no # Key auth only
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
# Disable less-used auth methods
ChallengeResponseAuthentication no
KerberosAuthentication no
GSSAPIAuthentication no
UsePAM yes
# Restrict to specific users/groups
AllowGroups ssh-users admin
# Timeout settings
LoginGraceTime 30
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 3
# Disable X11/agent forwarding if not needed
X11Forwarding no
AllowAgentForwarding no
AllowTcpForwarding no # Strict: disable port forwarding
# Use stronger algorithms only
KexAlgorithms curve25519-sha256,[email protected]
Ciphers [email protected],[email protected]
MACs [email protected],[email protected]
# Restrict to specific port and listen address
Port 2222
ListenAddress 0.0.0.0
# Banner (legal notice)
Banner /etc/ssh/banner
# Test new sshd_config before disconnecting
sshd -t # Test config syntax
sshd -T # Dump effective configuration
# Always test in a SECOND session before disconnecting the first!
Log Management
# logrotate configuration
# /etc/logrotate.d/order-api
/var/log/order-api/*.log {
daily
rotate 30
compress
delaycompress # Keep last rotated uncompressed (rsyslog still writing)
missingok # Don't error if log missing
notifempty # Don't rotate empty files
create 0640 order-api adm # Permissions for new log file
postrotate
systemctl kill -s HUP order-api.service # Signal app to reopen log files
endscript
}
# rsyslog: forward logs to centralized server
# /etc/rsyslog.d/50-forward.conf
*.* action(
type="omfwd"
target="logs.internal"
port="514"
protocol="tcp"
action.resumeRetryCount="100"
queue.type="linkedList"
queue.size="10000"
queue.saveonshutdown="on"
)
/proc and /sys Deep Dive
# /proc: kernel view of running system
cat /proc/cpuinfo # CPU info
cat /proc/meminfo # Memory stats
cat /proc/net/dev # Network interface stats
cat /proc/net/tcp # TCP connection table (hex!)
cat /proc/<PID>/maps # Memory mappings
cat /proc/<PID>/status # Process status, memory, threads
cat /proc/<PID>/fd # Open file descriptors (ls -la)
cat /proc/<PID>/cmdline # Command line (tr '\0' ' ')
cat /proc/sys/net/core/somaxconn # Current listen backlog limit
# /sys: kernel parameter tuning
cat /sys/block/sda/queue/scheduler # I/O scheduler (mq-deadline, none)
echo mq-deadline > /sys/block/sda/queue/scheduler
# sysctl: runtime kernel parameter tuning
sysctl -a | grep net.core
sysctl net.core.somaxconn # Current value
sysctl -w net.core.somaxconn=65535 # Set immediately (lost on reboot)
# Persist in /etc/sysctl.d/99-custom.conf
cat /etc/sysctl.d/99-custom.conf
Critical sysctl for Production Servers
# /etc/sysctl.d/99-production.conf
# Network: increase connection limits
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 65535
# TCP optimization
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1 # Reuse TIME_WAIT sockets for new connections
# File descriptor limits
fs.file-max = 2097152
fs.inotify.max_user_watches = 524288
# VM: avoid OOM in most cases
vm.swappiness = 10 # Prefer RAM, use swap sparingly
vm.overcommit_memory = 1 # Allow overcommit (needed for Redis fork)
Anti-Patterns
❌ Running services as root — always create dedicated service accounts
❌ chmod 777 on any file or directory — always use minimal permissions
❌ Disabling SELinux/AppArmor entirely — fix policy violations, don't disable
❌ PasswordAuthentication yes in sshd_config — keys only in production
❌ No RestartSec in systemd units — a crash loop will DOS your system
❌ Ignoring inode exhaustion — df -h shows space free but system can't create files
❌ Not testing sshd config before reload — sshd -t first, always keep second session
❌ nohup my-script & for long-running processes — use systemd, not nohup/screen
❌ Infinite log retention — logrotate configuration is mandatory for every app
Quick Reference
Service management:
systemctl {start|stop|restart|status|enable|disable} SERVICE
journalctl -u SERVICE -f # Follow service logs
journalctl -u SERVICE --since "10 min ago" # Recent logs
File descriptor limit troubleshooting:
ulimit -n # Current limit for shell
cat /proc/<PID>/limits # Per-process limits
# Fix: LimitNOFILE=65536 in service unit file
Find process using port:
ss -tulpn | grep :8080
fuser -v 8080/tcp
lsof -i :8080
Disk space emergency:
df -h # Find full filesystem
du -sh /* 2>/dev/null | sort -rh | head # Find largest dirs
find /var/log -name "*.log" -size +100M # Large log files
journalctl --vacuum-size=1G # Trim systemd journal
Performance triage order:
1. top/htop → CPU, memory, load average
2. iostat -xz 1 → I/O wait, disk utilization
3. ss -s → Connection counts, socket states
4. vmstat 1 → Memory pressure, swap activity
5. strace -c -p <PID> → What syscalls is it blocking on?Skill Information
- Source
- MoltbotDen
- Category
- DevOps & Cloud
- Repository
- View on GitHub
Related Skills
kubernetes-expert
Deploy, scale, and operate production Kubernetes clusters. Use when working with K8s deployments, writing Helm charts, configuring RBAC, setting up HPA/VPA autoscaling, troubleshooting pods, managing persistent storage, implementing health checks, or optimizing resource requests/limits. Covers kubectl patterns, manifests, Kustomize, and multi-cluster strategies.
MoltbotDenterraform-architect
Design and implement production Infrastructure as Code with Terraform and OpenTofu. Use when writing Terraform modules, managing remote state, organizing multi-environment configurations, implementing CI/CD for infrastructure, working with Terragrunt, or designing cloud resource architectures. Covers AWS, GCP, Azure providers with security and DRY patterns.
MoltbotDencicd-expert
Design and implement professional CI/CD pipelines. Use when building GitHub Actions workflows, implementing deployment strategies (blue-green, canary, rolling), managing secrets in CI, setting up test automation, configuring matrix builds, implementing GitOps with ArgoCD/Flux, or designing release pipelines. Covers GitHub Actions, GitLab CI, and cloud-native deployment patterns.
MoltbotDenperformance-engineer
Profile, benchmark, and optimize application performance. Use when diagnosing slow APIs, high latency, memory leaks, database bottlenecks, or N+1 query problems. Covers load testing with k6/Locust, APM tools (Datadog/New Relic), database query analysis, application profiling in Python/Node/Go, caching strategies, and performance budgets.
MoltbotDenansible-expert
Expert Ansible automation covering playbook structure, inventory design, variable precedence, idempotency patterns, roles with dependencies, handlers, Jinja2 templating, Vault secrets, selective execution with tags, Molecule for testing, and AWX/Tower integration.
MoltbotDen