Home / Docs / websearchmcp
On this page

websearchmcp

๐ŸŒ 4 tools

WebSearchMCP

Web search, page fetching, and content extraction as MCP tools.

Part of the MCP AI Suite.

Features

  • 4 MCP tools: web_search, web_answer, fetch_webpage, browser_fetch
  • 4 search engines: SearXNG (priority), DuckDuckGo, Mojeek, Brave
  • web_answer: search + rerank โ†’ top relevant sources plus an extractive summary, ready for the client LLM to synthesize a cited answer (no key required)
  • Optional cross-encoder reranking: reorders results by relevance via fastembed.TextCrossEncoder (optional dependency; no-op fallback if not installed, no torch)
  • WebExtractor: HTML โ†’ clean markdown with search result parsing
  • Browser fetch: Playwright Chromium with stealth mode, JS rendering, screenshots
  • CAPTCHA detection: auto-skips blocked engines
  • Four surfaces: Python library, websearchmcp CLI, websearchmcp-server MCP server, and an optional FastAPI app (websearchmcp-api)
  • MCP-native: works with Claude Desktop, Cursor, VS Code

Installation

pip install mcpaisuite-websearchmcp

Quick Start

Python API

from websearchmcp import WebSearchFactory

ws = WebSearchFactory.from_env()
results = await ws.search("latest AI news", max_results=10)
page = await ws.fetch("https://example.com")
rendered = await ws.browser_fetch("https://spa-app.com")

MCP Server

websearchmcp-server

CLI

websearchmcp search "Python best practices 2026"
websearchmcp fetch https://example.com
websearchmcp fetch --browser https://spa-app.com

Configuration

Env VariableDescriptionDefault
SEARXNG_URLSearXNG instance URL(none โ€” uses fallback engines)
WEBSEARCH_ENGINESComma-separated engines to useduckduckgo,mojeek,brave
WEBSEARCH_MAX_LENGTHMax extracted content length (chars)8000
WEBSEARCH_PROXIESComma-separated proxy URLs(none)
docker run -d -p 9999:8080 searxng/searxng
export SEARXNG_URL=http://localhost:9999

Architecture

Query โ†’ SearXNG (JSON API, no CAPTCHA)
     โ†“ fallback if unavailable
     โ†’ DuckDuckGo HTML โ†’ WebExtractor
     โ†“ fallback
     โ†’ Mojeek HTML โ†’ WebExtractor  
     โ†“ fallback
     โ†’ Brave HTML โ†’ WebExtractor

Resilience & performance

The pipeline wraps every engine in a resilience layer so transient failures, bot challenges, and repeated queries donโ€™t degrade results. All of the components below are in-memory and per-process (no external dependencies).

Engine selection

Engines are chosen via WEBSEARCH_ENGINES (comma-separated, default duckduckgo,mojeek,brave). SearXNG is enabled separately by setting SEARXNG_URL. When configured, SearXNG is always tried first (priority 1, reliable and CAPTCHA-free); the engines listed in WEBSEARCH_ENGINES are then rotated through in order as fallbacks. Unknown engine names are silently skipped. Supported engines: searxng, duckduckgo, mojeek, brave.

If an engine returns results that look like a CAPTCHA challenge (titles too short / empty), the result is discarded, the engineโ€™s circuit records a failure, and the engine is not retried for that query.

Circuit breaker

Each engine has its own circuit breaker. After 3 consecutive failures the engineโ€™s circuit opens and it is skipped for a 300-second (5-minute) cooldown. Once the cooldown elapses the circuit half-opens and the engine is tried again; a successful call resets the failure count to zero.

Per-engine rate limiter

A sliding-window rate limiter allows at most 10 requests per minute per engine (60-second window). Engines over the limit are skipped for that query rather than queued.

In-memory TTL cache

Search results are cached in memory keyed by query:max_results. Entries expire after a 300-second TTL, and the cache holds up to 100 entries (when full, the oldest entry is evicted). A cache hit short-circuits the entire engine pipeline, so repeated identical queries return instantly without hitting any engine.

Result deduplication

Results aggregated across engines are deduplicated by normalized URL โ€” netloc + path, lower-cased with the trailing slash and query string stripped. This collapses the same page surfaced by multiple engines (or with tracking params) into a single result. Deduplication runs before results are truncated to max_results and cached.

Proxy support

Set WEBSEARCH_PROXIES to a comma-separated list of proxy URLs (e.g. http://p1:8080,socks5://p2:1080). The pipeline applies them in round-robin rotation to outbound HTTP fetches. When unset, no proxy is used. The pipeline also rotates the User-Agent header on each fetch.

Explicit configuration

WebSearchFactory.from_env() reads the environment variables above. For programmatic control, WebSearchFactory.create() takes the same settings explicitly:

from websearchmcp import WebSearchFactory

ws = WebSearchFactory.create(
    searxng_url="http://localhost:9999",   # optional; tried first when set
    engines=["duckduckgo", "mojeek", "brave"],
    max_length=8000,                        # max extracted content length
    proxies=["http://p1:8080"],             # optional; round-robin rotation
)

Tools

Search the web. Returns numbered list of {title, url, snippet}.

fetch_webpage

Fetch a URL and extract clean markdown. Auto-fallback to browser_fetch for JS-heavy sites.

browser_fetch

Full Playwright Chromium rendering with stealth mode. For SPAs, JS-rendered content, sites that block HTTP clients.

Integration with MCP AI Suite

websearchmcp is automatically integrated when used with kernelmcp:

# ~/.kernelmcp/config.yaml
# No config needed โ€” websearchmcp is auto-detected

The kernelโ€™s orchestrator routes web_search, fetch_webpage, and browser_fetch tools to websearchmcp.

License

AGPL-3.0-or-later (a commercial license is available โ€” contact the maintainer).