Housefly: A Hands-On Web Scraping Playground

Housefly is an interactive learning project designed to teach web scraping through structured challenges. Each chapter includes a companion website built specifically to be scraped, allowing you to practice in a controlled environment.

Features

Realistic Web Scraping Challenges – Work with purpose-built websites.
Structured Learning – Progress through guided exercises.
Automated Solution Checking – Verify your scrapers against expected outputs.

Getting Started

Clone the Repository

git clone https://github.com/yourusername/housefly.git
cd housefly

Navigate to Chapter 1

Each chapter contains a simple website to scrape, along with an expected.txt file defining the correct output.

Write Your Scraper

Implement your solution inside the corresponding solution{number}/ directory.

Check Your Answer

Run the validation script to compare your scraper’s output against expected.txt:

npm run ca 1

Add Env Vars (Optional)

Some of the challenges require 3rd party apis e.g. OpenAI and for those, there is a .env.template file that you can fill in and rename to .env to use them

mv .env.template .env

Project Structure

housefly/
├── apps/
│   ├── chapter1/  # Website for Chapter 1
│   │   ├── index.html
│   │   ├── package.json
│   ├── chapter2/
│   ├── chapter3/
│   ├── solution1/  # Place your Chapter 1 solution here
│   │   ├── expected.txt
│   │   ├── index.ts
│   │   ├── package.json
├── scripts/
│   ├── check_answers.sh  # Script to validate solutions

Roadmap

Basic HTML Scraping

Single static HTML file with simple text
Single HTML file with structured data (tables, lists, divs with classes)
Single HTML file with unstructured text requiring AI-based structuring (e.g., extracting key information from free-form text)

JavaScript-Rendered Content

Single-page site where content loads dynamically via JavaScript
Scraping sites with infinite scroll and lazy-loaded content

Multi-Page Crawling

multiple pages within the same subdomain (/content/*)
- storing data and duplicates (index of URL -> data)
- sitemap + link crawling
Sitemap crawling and extracting internal links
Managing duplicate data (indexing URLs vs. storing new content)

Scraping API-Driven Websites

Extracting data from JSON responses in API-driven websites
Scraping sites where data loads via AJAX calls
Handling GraphQL APIs

Interacting with Websites (Forms & Sessions)

Automating form submissions (e.g., login forms, search bars)
Handling session-based authentication (cookies, tokens)
Navigating paginated content

Media & Non-Text Scraping

Extracting images and metadata (alt text, filenames)
Downloading and parsing PDFs
Scraping embedded video metadata (YouTube, Vimeo)

Handling Web Crawling Defenses

Rate limiting and polite crawling (respecting robots.txt)
Handling CAPTCHAs with solver services
Dealing with anti-scraping mechanisms (e.g., Cloudflare, bot traps)

Large-Scale & Unstructured Web Crawling

Scraping random websites with different path structures and data formats
AI-assisted parsing for messy and unstructured data
Building a search crawler (e.g., using searxng for discovering new content)

Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bug reports or feature requests.

License

MIT License

Ready to Start Scraping?

👉 Try Housefly Now

Disclaimer

This is for educational purposes and web scraping on websites that don't want you to can violate ToSes and potentially get you in trouble if done at an industrial scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

Roadmap

Contributing

License

Ready to Start Scraping?

Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

Roadmap

Contributing

License

Ready to Start Scraping?

Disclaimer