websearchmcp
๐ 4 tools
WebSearchMCP
Web search, page fetching, and content extraction as MCP tools.
Part of the MCP AI Suite.
Features
- 4 MCP tools:
web_search,web_answer,fetch_webpage,browser_fetch - 4 search engines: SearXNG (priority), DuckDuckGo, Mojeek, Brave
web_answer: search + rerank โ top relevant sources plus an extractive summary, ready for the client LLM to synthesize a cited answer (no key required)- Optional cross-encoder reranking: reorders results by relevance via
fastembed.TextCrossEncoder(optional dependency; no-op fallback if not installed, no torch) - WebExtractor: HTML โ clean markdown with search result parsing
- Browser fetch: Playwright Chromium with stealth mode, JS rendering, screenshots
- CAPTCHA detection: auto-skips blocked engines
- Four surfaces: Python library,
websearchmcpCLI,websearchmcp-serverMCP server, and an optional FastAPI app (websearchmcp-api) - MCP-native: works with Claude Desktop, Cursor, VS Code
Installation
pip install mcpaisuite-websearchmcp
Quick Start
Python API
from websearchmcp import WebSearchFactory
ws = WebSearchFactory.from_env()
results = await ws.search("latest AI news", max_results=10)
page = await ws.fetch("https://example.com")
rendered = await ws.browser_fetch("https://spa-app.com")
MCP Server
websearchmcp-server
CLI
websearchmcp search "Python best practices 2026"
websearchmcp fetch https://example.com
websearchmcp fetch --browser https://spa-app.com
Configuration
| Env Variable | Description | Default |
|---|---|---|
SEARXNG_URL | SearXNG instance URL | (none โ uses fallback engines) |
WEBSEARCH_ENGINES | Comma-separated engines to use | duckduckgo,mojeek,brave |
WEBSEARCH_MAX_LENGTH | Max extracted content length (chars) | 8000 |
WEBSEARCH_PROXIES | Comma-separated proxy URLs | (none) |
SearXNG Setup (recommended)
docker run -d -p 9999:8080 searxng/searxng
export SEARXNG_URL=http://localhost:9999
Architecture
Query โ SearXNG (JSON API, no CAPTCHA)
โ fallback if unavailable
โ DuckDuckGo HTML โ WebExtractor
โ fallback
โ Mojeek HTML โ WebExtractor
โ fallback
โ Brave HTML โ WebExtractor
Resilience & performance
The pipeline wraps every engine in a resilience layer so transient failures, bot challenges, and repeated queries donโt degrade results. All of the components below are in-memory and per-process (no external dependencies).
Engine selection
Engines are chosen via WEBSEARCH_ENGINES (comma-separated, default
duckduckgo,mojeek,brave). SearXNG is enabled separately by setting
SEARXNG_URL. When configured, SearXNG is always tried first (priority 1,
reliable and CAPTCHA-free); the engines listed in WEBSEARCH_ENGINES are then
rotated through in order as fallbacks. Unknown engine names are silently
skipped. Supported engines: searxng, duckduckgo, mojeek, brave.
If an engine returns results that look like a CAPTCHA challenge (titles too short / empty), the result is discarded, the engineโs circuit records a failure, and the engine is not retried for that query.
Circuit breaker
Each engine has its own circuit breaker. After 3 consecutive failures the engineโs circuit opens and it is skipped for a 300-second (5-minute) cooldown. Once the cooldown elapses the circuit half-opens and the engine is tried again; a successful call resets the failure count to zero.
Per-engine rate limiter
A sliding-window rate limiter allows at most 10 requests per minute per engine (60-second window). Engines over the limit are skipped for that query rather than queued.
In-memory TTL cache
Search results are cached in memory keyed by query:max_results. Entries expire
after a 300-second TTL, and the cache holds up to 100 entries (when full,
the oldest entry is evicted). A cache hit short-circuits the entire engine
pipeline, so repeated identical queries return instantly without hitting any
engine.
Result deduplication
Results aggregated across engines are deduplicated by normalized URL โ
netloc + path, lower-cased with the trailing slash and query string stripped.
This collapses the same page surfaced by multiple engines (or with tracking
params) into a single result. Deduplication runs before results are truncated to
max_results and cached.
Proxy support
Set WEBSEARCH_PROXIES to a comma-separated list of proxy URLs (e.g.
http://p1:8080,socks5://p2:1080). The pipeline applies them in round-robin
rotation to outbound HTTP fetches. When unset, no proxy is used. The pipeline
also rotates the User-Agent header on each fetch.
Explicit configuration
WebSearchFactory.from_env() reads the environment variables above. For
programmatic control, WebSearchFactory.create() takes the same settings
explicitly:
from websearchmcp import WebSearchFactory
ws = WebSearchFactory.create(
searxng_url="http://localhost:9999", # optional; tried first when set
engines=["duckduckgo", "mojeek", "brave"],
max_length=8000, # max extracted content length
proxies=["http://p1:8080"], # optional; round-robin rotation
)
Tools
web_search
Search the web. Returns numbered list of {title, url, snippet}.
fetch_webpage
Fetch a URL and extract clean markdown. Auto-fallback to browser_fetch for JS-heavy sites.
browser_fetch
Full Playwright Chromium rendering with stealth mode. For SPAs, JS-rendered content, sites that block HTTP clients.
Integration with MCP AI Suite
websearchmcp is automatically integrated when used with kernelmcp:
# ~/.kernelmcp/config.yaml
# No config needed โ websearchmcp is auto-detected
The kernelโs orchestrator routes web_search, fetch_webpage, and browser_fetch tools to websearchmcp.
License
AGPL-3.0-or-later (a commercial license is available โ contact the maintainer).