Awesome AI Web Scraping

A curated list of tools, libraries, and resources for AI-powered web scraping.
Frameworks, hosted APIs, browser infrastructure, MCP servers, and research for turning the web into clean, structured data for LLMs, RAG pipelines, and agents.
Scope: Tools where AI or LLMs play a meaningful role in extraction, navigation, or content understanding. General-purpose scrapers (Scrapy, BeautifulSoup) belong in awesome-web-scraping. Autonomous browser agents belong in awesome-web-agents.
Contents
Frameworks & Libraries
Self-hosted, open-source. Most pair a headless browser with an LLM for schema-based or prompt-based extraction.
- Crawl4AI - LLM-friendly web crawler with Markdown output and JSON-schema or LLM-based extraction. Python.

- Scrapling - Adaptive Python framework with smart element tracking that relocates elements after site changes. Cloudflare Turnstile bypass, spider framework with pause/resume, and a built-in MCP server.

- ScrapeGraphAI - Python scraper using LLM + graph pipelines. Describe data in natural language, get typed JSON. Works with OpenAI, Anthropic, Groq, Gemini, Ollama.

- llm-scraper - TypeScript library for structured extraction with Zod schemas. Supports GPT, Claude, Gemini, Llama, Qwen.

- Reader - Jina AI's URL-to-Markdown converter. Engine behind
r.jina.ai. 
- Stagehand - Browser automation framework with
act, extract, and observe primitives over Playwright. 
- Browser-Use - Agent framework commonly used for scraping complex, login-walled sites.

- Skyvern - Browser automation for forms, logins, and dynamic content.

- LaVague - Natural language web automation framework.

- CyberScraper 2077 - LLM scraper with Streamlit UI. Supports OpenAI, Gemini, and Ollama. Tor support included.

- ScraperAI - AI scraper with auto-detection of page types, pagination, and catalog cards.

- SpiderCreator - Generates Playwright spiders from natural language prompts.

- PulsarRPA - AI-powered browser automation and data extraction.

Hosted APIs
Managed services that turn URLs into LLM-ready Markdown or JSON. JS rendering, proxies, and anti-bot handled internally.
- Firecrawl - Scrape, crawl, map, search, agent, and interact endpoints. LLM-ready Markdown. 500 free credits, paid plans from $16/mo.
- Jina Reader - Prepend
r.jina.ai/ to any URL for LLM-friendly text. Free tier with no API key required.
- Diffbot - Computer vision and NLP extraction with a knowledge graph layer. Paid.
- Apify - Marketplace of 10,000+ pre-built scrapers ("Actors") plus a runtime for your own. Free tier and paid plans.
- Bright Data - Scraping with 150M+ proxies and pre-built APIs for 120+ sites. Free tier and paid plans.
- Zyte - Scraping API with AI extraction. Formerly Scrapinghub. Paid.
- ScrapingBee - JS rendering, AI extraction, Markdown, and Google SERP support. Free trial and paid plans.
- ZenRows - Anti-bot focused scraping API with Markdown output. Free trial and paid plans.
- Oxylabs - Proxies plus a Web Scraper API with adaptive parsing. Paid.
- Spider - Concurrent crawler and scraper API with LLM-ready output. Free tier and paid plans.
- WebScraping.AI - Scraping API with question-answering and field-extraction endpoints. Free tier and paid plans.
- Scrapeless - Scraping API with anti-bot bypass and structured extraction. Free tier and paid plans.
- Kadoa - Self-healing extraction that adapts when sites change. Paid.
- Expand.ai - Turns any website into a type-safe API. Paid.
- Reworkd - Agentic AI for no-code structured extraction. Paid.
Browser Infrastructure for AI
Headless browsers designed for AI agents and scrapers.
- Steel.dev - Open-source headless browser API for AI agents. Self-host or use the hosted service.

- Browserbase - Hosted headless browser. Powers Stagehand. Paid.
- Hyperbrowser - Browser platform with stealth, scraping, and agent endpoints. Free tier and paid plans.
- Anchor Browser - Browser API with built-in auth and session persistence. Paid.
- Browserless - Headless Chrome as a service. Free tier and paid plans.
- Obscura - Rust-based headless browser. CDP-compatible with Puppeteer and Playwright. Built-in stealth and tracker blocking.

- Browserable - Open-source, self-hostable browser automation library.

No-Code AI Scrapers
Visual or point-and-click tools that use AI to extract data without writing code.
- Browse AI - Chrome extension and SaaS for AI-assisted scraping with scheduled monitoring.
- Bardeen.ai - Chrome extension combining AI scraping with automation across 100+ apps.
- Thunderbit - Two-click Chrome extension with AI "Suggest Fields" for instant extraction.
- Gumloop - Visual workflow builder for scraping, LLM calls, and data transforms.
- Octoparse - Visual scraper with AI-assisted field detection.
- ParseHub - Visual scraper with template-based extraction.
MCP Servers for Scraping
Model Context Protocol servers that expose scraping capabilities to Claude, Cursor, Windsurf, and other LLM clients.
- Firecrawl MCP - Official MCP wrapper for Firecrawl's scrape, crawl, and extract endpoints.

- Bright Data MCP - Search, scrape, and extract from 60+ sources with anti-bot bypass. 5,000 free requests/month.

- Scrapling MCP - Built-in MCP server bundled with Scrapling. Install with
pip install "scrapling[ai]".
- Fetch - Anthropic's official fetch MCP server. URL-to-Markdown.
- Browserbase MCP - MCP server exposing Browserbase sessions and Stagehand primitives.

- Puppeteer MCP - Browser automation for scraping and interaction.
- Apify MCP - Run any Apify Actor as an MCP tool.

- WebScraping.AI MCP - MCP integration for WebScraping.AI's extraction tools.
Web Search APIs for LLMs
Search APIs that return structured, LLM-friendly results with full-page content.
- Exa - Neural search API. Returns clean content alongside results.
- Tavily - Search API optimized for LLMs and RAG.
- Linkup - Search API with verified sources.
- Perplexity Sonar - Perplexity's online search and answer API.
- Serper - Fast, low-cost Google search API.
- SerpAPI - Search engine results API.
- Brave Search API - Independent search index.
- You.com API - Web, news, and snippet endpoints.
- Kagi Search API - Premium, ad-free search results.
Proxy & Anti-Bot Infrastructure
- Bright Data - 150M+ proxies, Web Unblocker, browser cloud.
- Oxylabs - Residential, datacenter, and ISP proxies plus Web Unblocker.
- Decodo (Smartproxy) - Residential proxies and scraping APIs.
- NetNut - ISP and residential proxy network.
- ZenRows - Anti-bot proxy and scraping API.
- ScraperAPI - Proxy rotation and CAPTCHA handling.
Datasets
Pre-scraped web data for RAG, training, or benchmarking.
- Common Crawl - The largest public web crawl. Petabytes of pages, monthly updates.
- FineWeb - 15T-token deduplicated web dataset from Hugging Face.
- RedPajama-Data-v2 - 30T-token open web dataset.
- C4 - Colossal Clean Crawled Corpus derived from Common Crawl.
- The Pile - 825 GiB diverse text corpus including web data.
Benchmarks & Research
- SWDE - Structured Web Data Extraction benchmark from Microsoft Research.
- WebSRC - Dataset for web-based structural reading comprehension.
- AXE - Research on DOM pruning for token-efficient LLM extraction.
- NEXT-EVAL - Benchmark comparing HTML representations for LLM extraction accuracy.
Tutorials & Guides
Contributing
Contributions welcome. Open a pull request to add a new tool or resource.
Guidelines:
- Keep entries focused on AI/LLM-powered scraping. Generic scrapers belong elsewhere.
- Follow the format:
- [Name](url) - One-line description.
- Add the GitHub stars badge for open-source projects.
- Mention pricing in the description if relevant (free tier, paid, etc.).
License

To the extent possible under law, the contributors have waived all copyright and related rights to this work.