A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
AI-first web scraping engine with stealth bypass, MCP server, and multimodal output (Markdown, JSON, PDF) for agents and
Expertly crafted by Roy Dawson IV
Web Scraper Toolkit is a production-grade scraping and browser automation platform for:
You can run it as:
web-scraper)web-scraper-server)chrome, msedge, chromium) when blocked.host_profiles.json.Default behavior is tuned for safety + resilience:
on_blocked.host_profiles_read_only=true) to apply-only with no writes.429 Too Many Requests (sorry/index) bans.pip install web-scraper-toolkit
playwright install
Optional desktop solver support:
pip install web-scraper-toolkit[desktop]
playwright install
Run a first scrape:
web-scraper --url https://example.com --format markdown --export


These diagrams are rendered from Mermaid source files for GitHub/PyPI compatibility. Sources:
docs/diagrams/*.mmd
Minimal:
web-scraper --url https://example.com --format markdown --export
Batch + merge:
web-scraper --input urls.txt --workers auto --format text --merge --output-name merged.txt
Diagnostics wrapper:
web-scraper --run-diagnostic challenge_matrix --diagnostic-url https://target-site.tld/resource --diagnostic-runs-per-variant 2
Optional toolkit auto-commit (off by default):
web-scraper --run-diagnostic toolkit_route --diagnostic-url https://target-site.tld/resource --diagnostic-auto-commit-host-profile
Strict progression gating + artifact capture:
web-scraper \
--run-diagnostic toolkit_route \
--diagnostic-url https://target-site.tld/resource \
--diagnostic-require-2xx \
--diagnostic-save-artifacts \
--diagnostic-artifacts-dir ./scripts/out/artifacts
Deterministic fixture replay / recording for regression analysis:
python scripts/diag_toolkit_route.py --fixture-replay ./tests/fixtures/challenge/cloudflare_blocked.json
python scripts/diag_toolkit_route.py --url https://target-site.tld/resource --fixture-record ./tests/fixtures/challenge/latest_toolkit_fixture.json
python scripts/diag_challenge_matrix.py --fixture-replay ./tests/fixtures/challenge/zoominfo_px_then_cf_loop.json
Cloudflare stealth-strategy matrix testing:
python scripts/diag_cloudflare_matrix.py --url https://target-site.tld/challenge
Local stdio:
web-scraper-server --stdio
Remote transport:
web-scraper-server --transport streamable-http --host 127.0.0.1 --port 8000 --path /mcp
import asyncio
from web_scraper_toolkit.browser.config import BrowserConfig
from web_scraper_toolkit.browser.playwright_handler import PlaywrightManager
async def main() -> None:
cfg = BrowserConfig.from_dict({
"headless": True,
"browser_type": "chromium",
"native_fallback_policy": "on_blocked",
"host_profiles_enabled": True,
"host_profiles_path": "./host_profiles.json",
"host_profiles_read_only": False,
})
async with PlaywrightManager(cfg) as manager:
content, final_url, status = await manager.smart_fetch("https://example.com")
print({"status": status, "url": final_url, "has_content": bool(content)})
asyncio.run(main())
When toolkit enters OS-level mouse challenge solving:
pyautogui failsafe remains active (move cursor to a screen corner to abort).Optional env override:
WST_OS_INPUT_WARNING_SECONDS (default: 3)Precedence order:
WST_*)settings.local.cfg / settings.cfgconfig.jsonKey files:
config.example.jsonsettings.example.cfghost_profiles.example.jsonINSTRUCTIONS.md (full operations runbook)For exhaustive setup, deployment, troubleshooting, CLI/MCP option coverage, and diagnostics workflows, read:
INSTRUCTIONS.mddocs/config_schema.md (config + host profile schema contract)docs/api_stability.md (API/deprecation policy)docs/support_matrix.md (platform/browser support matrix)docs/release_checklist.md (ship checklist)Canonical script diagnostics now use scripts/diag_*.py names.
Truthfulness note:
The following output blocks are copied from deterministic command runs in this repository.
diag_toolkit_zoominfo --helpCommand:
python scripts/diag_toolkit_zoominfo.py --help
Expected output:
usage: diag_toolkit_zoominfo.py [-h] [--url URL] [--timeout-ms TIMEOUT_MS]
[--skip-interactive]
[--include-headless-stage]
[--log-level {DEBUG,INFO,WARNING,ERROR}]
[--auto-commit-host-profile]
[--host-profiles-path HOST_PROFILES_PATH]
[--read-only] [--require-2xx]
[--save-artifacts]
[--artifacts-dir ARTIFACTS_DIR]
Command:
python -m web_scraper_toolkit.cli --help
Expected excerpt:
--diagnostic-require-2xx
Require final HTTP 2xx status for toolkit diagnostic
stage success.
--diagnostic-save-artifacts
Persist per-stage diagnostic artifacts for toolkit
route diagnostics.
--diagnostic-artifacts-dir DIAGNOSTIC_ARTIFACTS_DIR
Optional artifacts directory override for toolkit
route diagnostics.
File/fixture expectation used in tests/test_script_diagnostics.py:
{
"summary": {
"progressed_stages": 1
}
}
Before release tags, execute and verify:
ruff format --check .
ruff check src
mypy
pytest -q -m "not integration"
python -m build
python -m twine check dist/*
python scripts/clean_workspace.py --dry-run
For full release/security gates, see docs/release_checklist.md.
Details and limitations: docs/support_matrix.md.
Created by: Roy Dawson IV
GitHub: https://github.com/imyourboyroy
PyPi: https://pypi.org/user/ImYourBoyRoy/
Host-learning now has an explicit operator CLI so you can inspect, diff, and manage learned routing without digging through JSON manually.
web-scraper-hosts --path ./host_profiles.json summary
web-scraper-hosts --path ./host_profiles.json inspect zoominfo.com
web-scraper-hosts --path ./host_profiles.json diff zoominfo.com
web-scraper-hosts --path ./host_profiles.json promote zoominfo.com
web-scraper-hosts --path ./host_profiles.json demote zoominfo.com
web-scraper-hosts --path ./host_profiles.json reset zoominfo.com
JSON output is available for automation:
web-scraper-hosts --path ./host_profiles.json --json inspect zoominfo.com
This keeps host-learning mutations explicit:
inspect / diff / summary are read-onlypromote / demote / reset mutate the store intentionallyMCP server integration for DaVinci Resolve Studio
mcp-language-server gives MCP enabled clients access semantic tools like get definition, references, rename, and diagnos
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots