websearchmcp

🌐 4 tools

WebSearchMCP

Web search, page fetching, and content extraction as MCP tools.

Features

4 MCP tools: web_search, web_answer, fetch_webpage, browser_fetch
4 search engines: SearXNG (priority), DuckDuckGo, Mojeek, Brave
web_answer: search + rerank → top relevant sources plus an extractive summary, ready for the client LLM to synthesize a cited answer (no key required)
Optional cross-encoder reranking: reorders results by relevance via fastembed.TextCrossEncoder (optional dependency; no-op fallback if not installed, no torch)
WebExtractor: HTML → clean markdown with search result parsing
Browser fetch: Playwright Chromium with stealth mode, JS rendering, screenshots
CAPTCHA detection: auto-skips blocked engines
Four surfaces: Python library, websearchmcp CLI, websearchmcp-server MCP server, and an optional FastAPI app (websearchmcp-api)
MCP-native: works with Claude Desktop, Cursor, VS Code

Installation

pip install mcpaisuite-websearchmcp

Quick Start

Python API

from websearchmcp import WebSearchFactory

ws = WebSearchFactory.from_env()
results = await ws.search("latest AI news", max_results=10)
page = await ws.fetch("https://example.com")
rendered = await ws.browser_fetch("https://spa-app.com")

MCP Server

websearchmcp-server

CLI

websearchmcp search "Python best practices 2026"
websearchmcp fetch https://example.com
websearchmcp fetch --browser https://spa-app.com

Configuration

Env Variable	Description	Default
`SEARXNG_URL`	SearXNG instance URL	(none — uses fallback engines)
`WEBSEARCH_ENGINES`	Comma-separated engines to use	`duckduckgo,mojeek,brave`
`WEBSEARCH_MAX_LENGTH`	Max extracted content length (chars)	`8000`
`WEBSEARCH_PROXIES`	Comma-separated proxy URLs	(none)

SearXNG Setup (recommended)

docker run -d -p 9999:8080 searxng/searxng
export SEARXNG_URL=http://localhost:9999

Architecture

Query → SearXNG (JSON API, no CAPTCHA)
     ↓ fallback if unavailable
     → DuckDuckGo HTML → WebExtractor
     ↓ fallback
     → Mojeek HTML → WebExtractor  
     ↓ fallback
     → Brave HTML → WebExtractor

Resilience & performance

The pipeline wraps every engine in a resilience layer so transient failures, bot challenges, and repeated queries don’t degrade results. All of the components below are in-memory and per-process (no external dependencies).

Engine selection

Engines are chosen via WEBSEARCH_ENGINES (comma-separated, default duckduckgo,mojeek,brave). SearXNG is enabled separately by setting SEARXNG_URL. When configured, SearXNG is always tried first (priority 1, reliable and CAPTCHA-free); the engines listed in WEBSEARCH_ENGINES are then rotated through in order as fallbacks. Unknown engine names are silently skipped. Supported engines: searxng, duckduckgo, mojeek, brave.

If an engine returns results that look like a CAPTCHA challenge (titles too short / empty), the result is discarded, the engine’s circuit records a failure, and the engine is not retried for that query.

Circuit breaker

Each engine has its own circuit breaker. After 3 consecutive failures the engine’s circuit opens and it is skipped for a 300-second (5-minute) cooldown. Once the cooldown elapses the circuit half-opens and the engine is tried again; a successful call resets the failure count to zero.

Per-engine rate limiter

A sliding-window rate limiter allows at most 10 requests per minute per engine (60-second window). Engines over the limit are skipped for that query rather than queued.

In-memory TTL cache

Search results are cached in memory keyed by query:max_results. Entries expire after a 300-second TTL, and the cache holds up to 100 entries (when full, the oldest entry is evicted). A cache hit short-circuits the entire engine pipeline, so repeated identical queries return instantly without hitting any engine.

Result deduplication

Results aggregated across engines are deduplicated by normalized URL — netloc + path, lower-cased with the trailing slash and query string stripped. This collapses the same page surfaced by multiple engines (or with tracking params) into a single result. Deduplication runs before results are truncated to max_results and cached.

Proxy support

Set WEBSEARCH_PROXIES to a comma-separated list of proxy URLs (e.g. http://p1:8080,socks5://p2:1080). The pipeline applies them in round-robin rotation to outbound HTTP fetches. When unset, no proxy is used. The pipeline also rotates the User-Agent header on each fetch.

Explicit configuration

WebSearchFactory.from_env() reads the environment variables above. For programmatic control, WebSearchFactory.create() takes the same settings explicitly:

from websearchmcp import WebSearchFactory

ws = WebSearchFactory.create(
    searxng_url="http://localhost:9999",   # optional; tried first when set
    engines=["duckduckgo", "mojeek", "brave"],
    max_length=8000,                        # max extracted content length
    proxies=["http://p1:8080"],             # optional; round-robin rotation
)

Tools

web_search

Search the web. Returns numbered list of {title, url, snippet}.

fetch_webpage

Fetch a URL and extract clean markdown. Auto-fallback to browser_fetch for JS-heavy sites.

browser_fetch

Full Playwright Chromium rendering with stealth mode. For SPAs, JS-rendered content, sites that block HTTP clients.

Integration with MCP AI Suite

websearchmcp is automatically integrated when used with kernelmcp:

# ~/.kernelmcp/config.yaml
# No config needed — websearchmcp is auto-detected

The kernel’s orchestrator routes web_search, fetch_webpage, and browser_fetch tools to websearchmcp.

License

Apache-2.0 — free and open source for any use, including commercial.