What is Web Scraping?
Web scraping extracts data from websites:
- Fetch web pages
- Parse HTML/JSON
- Extract needed information
- Structure the data
Simple Fetching
Using web_fetch
The simplest approach for agents:
web_fetch(url="https://example.com")
// Returns readable content as markdown
Python Requests
import requests
response = requests.get("https://example.com")
html = response.text
status = response.status_code
With Headers
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
}
response = requests.get(url, headers=headers)
Parsing HTML
BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Find elements
title = soup.find("title").text
links = soup.find_all("a")
divs = soup.find_all("div", class_="content")
# CSS selectors
items = soup.select("div.article h2")
Extracting Data
# Get text
element.text
element.get_text()
# Get attribute
link["href"]
img["src"]
# Get all text
soup.get_text(separator=" ", strip=True)
Common Patterns
Extract All Links
links = []
for a in soup.find_all("a", href=True):
links.append(a["href"])
Extract Table
table = soup.find("table")
rows = []
for tr in table.find_all("tr"):
cells = [td.text.strip() for td in tr.find_all(["td", "th"])]
rows.append(cells)
Extract Structured Data
articles = []
for article in soup.find_all("article"):
articles.append({
"title": article.find("h2").text,
"date": article.find("time")["datetime"],
"summary": article.find("p").text
})
Handling JavaScript
Problem
Many sites render with JavaScript—HTML alone won't work.
Solutions
browser(action="open", targetUrl="https://example.com")
browser(action="snapshot") // Gets rendered content
Check network tab for JSON APIs the page uses.
Often gets enough content for simple pages.
Rate Limiting
Be Polite
import time
for url in urls:
response = requests.get(url)
time.sleep(1) # Wait between requests
Handle Errors
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
logger.error(f"Failed to fetch {url}: {e}")
Ethics and Legality
Respect robots.txt
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", url):
# OK to scrape
Best Practices
- Check terms of service
- Identify your bot in User-Agent
- Rate limit requests
- Don't overload servers
- Cache when possible
- Use APIs when available
When Not to Scrape
- Terms explicitly forbid it
- Behind authentication
- Personal/private data
- When an API exists
JSON APIs
Often easier than HTML scraping:
response = requests.get("https://api.example.com/data")
data = response.json()
Finding APIs
Handling Pagination
all_data = []
page = 1
while True:
response = requests.get(f"{url}?page={page}")
data = parse_page(response.text)
if not data:
break
all_data.extend(data)
page += 1
time.sleep(1)
Error Handling
def safe_fetch(url, retries=3):
for i in range(retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response
except requests.RequestException:
if i == retries - 1:
raise
time.sleep(2 ** i)
Data Cleaning
After extraction:
def clean_text(text):
if text is None:
return ""
# Remove extra whitespace
text = " ".join(text.split())
# Remove special characters if needed
return text.strip()
Conclusion
Web scraping basics:
- Fetch pages with requests or web_fetch
- Parse HTML with BeautifulSoup
- Be ethical and rate-limit
- Prefer APIs when available
- Handle errors gracefully
Use responsibly.
Next: Automation Patterns - Building reliable automations