XML Sitemap Generator
XML Sitemap Generator is a high-performance web crawler and sitemap discovery engine designed to create SEO-compliant XML sitemaps for both static and dynamically rendered websites.
By implementing a hybrid crawling architecture, the application dynamically chooses between high-speed static HTML parsing and headless browser rendering depending on the page type. It streams live crawl logs to the browser using Server-Sent Events (SSE) and automatically discovers, crawls, validates, and merges existing sitemap data for maximum coverage.
The Challenge
Modern web applications present unique challenges to standard web crawlers. Traditional search crawlers scan static HTML, but client-side hydrated frameworks (like React, Vue, or dynamic Next.js SPAs) render their links dynamically via client-side JavaScript.
To crawl these Client-Side Rendered (CSR) sites, a crawler must boot up a headless browser (like Puppeteer) to execute JavaScript and extract URLs. However, headless browsers are extremely resource-intensive, slow to boot, and prone to memory leaks. Running Puppeteer inside serverless execution environments (like Vercel API routes) introduces strict memory limits (50ms to 1024MB RAM) and request timeouts (10-60 seconds).
The challenge was to design a crawler that:
- Automatically detects if a page is SSR or CSR, falling back to Puppeteer only when absolutely necessary.
- Respects
robots.txtdisallow rules and site indexing rules. - Manages memory and concurrency efficiently within serverless constraints.
- Streams crawl progress live to the client UI to prevent connection timeouts.
My Role: I was the sole developer and architect for this project, building the hybrid parsing engine, the Server-Sent Events (SSE) streaming pipeline, and the React/Next.js frontend.
The Solution
XML Sitemap Generator utilizes a multi-stage discovery and crawling engine that balances speed, compliance, and developer insights.
Core Features
- Hybrid Crawling Engine: Employs high-speed parsing (via Cheerio) for traditional Server-Side Rendered (SSR) HTML and falls back to headless rendering (via Puppeteer) for JavaScript-heavy Client-Side Rendered (CSR) single-page applications.
- Sitemap Discovery & Merging: Automatically parses
robots.txtto discover existing sitemaps, downloads their entries, crawls all discovered links recursively, and merges the datasets to ensure 100% indexing coverage. - Real-Time Progress Streaming: Uses Server-Sent Events (SSE) to push live crawl logs, discovered URLs, and queue counts to the client UI in real time without refreshing.
- Compliance & Depth Prioritization: Automatically respects
robots.txtdisallow criteria, computes page priority based on folder depth (1.0 for root, decreasing per level), and extracts last-modified metadata.
Technical Architecture
The architecture separates the crawler engine into a Next.js serverless route that dispatches crawl jobs, executes checks, and streams logs back to the React UI:
┌─────────────────┐
│ React UI │
└────────┬────────┘
│ (1) POST /api/generate-sitemap
▼
┌────────────────────────┐
│ Next.js API Route (SSE)│◄────────────────┐
└────────────┬───────────┘ │
│ (2) Read robots.txt │
▼ │
┌──────────────────────┐ │
│ Discovery Engine ├──────────────────┤
└───────────┬──────────┘ │
│ (3) Enqueue URLs │ (4) Stream
▼ │ Progress
┌───────────────┐ │
┌─────────►│ Crawl Queue ├─────────────────────┘
│ └───────┬───────┘
│ │ (5) Dequeue (Batch of 5)
│ ▼
│ ┌─────────────────────┐
│ │ CSR Heuristic Check │
│ └────┬───────────┬────┘
│ │ (Static) │ (JS Rendered)
│ ▼ ▼
│ ┌─────────┐ ┌───────────┐
│ │ Cheerio │ │ Puppeteer │
│ └────┬────┘ └─────┬─────┘
│ │ │
└────────────┴─────┬──────┴─────┘
│ (6) Parse Links & Recurse
▼
┌──────────────────┐
│ XML Map Compiler │
└──────────────────┘Engineering Deep Dives
1. Hybrid Dispatcher & Heuristic CSR Detection
To crawl modern applications efficiently, the crawler uses a fast HTTP parser by default. When downloading a URL, it performs a heuristic scan on the raw static HTML. If the page is identified as Client-Side Rendered (CSR), the crawler delegates the request to a Puppeteer headless browser instance.
The heuristic evaluates HTML properties, including script tags, DOM structure, and known framework root container selectors (like #root or #__next):
// Dynamic detection settings in src/utils/sitemapGenerator.js
const config = {
csr: {
minimalContentLength: 200, // If HTML length is below this, it is likely empty / a placeholder
minimalChildNodes: 5, // Minimal body elements
scriptCountThreshold: 10, // Excessive script count indicates a heavy bundle app
contentScriptRatio: 1000, // Ratio of HTML length to script count
rootSelectors: ["#root", "#__next", "#app"], // Common framework hydration mounting points
},
puppeteer: {
waitForSelectorsTimeout: 10000,
gotoTimeout: 60000,
waitUntil: "networkidle2",
},
};If the static HTML lacks text content but contains framework mounting points and heavy JavaScript script imports, isCSR evaluates to true. Puppeteer boots up, loads the page, waits for the JS hydration (networkidle2), and extracts the fully rendered HTML DOM for link parsing. This hybrid fallback reduces execution times by over 80% on standard static content.
2. Real-Time Streaming with Server-Sent Events (SSE)
Crawling up to 1000 pages takes time. If a Next.js serverless route handles this synchronously, it will exceed the request timeout limit on serverless hosts.
To solve this, we implemented Server-Sent Events (SSE) at app/api/generate-sitemap/route.js. The client initiates a single GET connection, and the server chunks and streams real-time status packets as pages are crawled:
// Schema of streamed SSE data packet
{
"status": "crawling",
"url": "https://example.com/blog/slug",
"count": 42,
"depth": 3,
"queueSize": 12
}This keeps the HTTP connection active, preventing serverless host timeouts while updating the UI with live progress indicators, current crawling URLs, and discovered counts.
3. Concurrency & Resource Pooling
Spawning a headless Chrome instance inside Puppeteer consumes roughly 150MB of RAM. If we crawled 20 pages concurrently, we would hit serverless memory limits instantly, resulting in a fatal process crash.
To keep memory consumption stable, we designed a batch queue with a strict concurrency limit of 5 simultaneous pages:
// Concurrency control in src/utils/sitemapGenerator.js
const batchSize = 5;
const queue = [...initialUrls];
const visited = new Set();
while (queue.length > 0 && visited.size < maxPages) {
const currentBatch = queue.splice(0, batchSize);
await Promise.all(
currentBatch.map(async (url) => {
if (visited.has(url)) return;
visited.add(url);
const html = await fetchPageContent(url); // Decides Cheerio vs. Puppeteer
const links = extractLinks(html, baseUrl);
queue.push(...links.filter(link => !visited.has(link)));
})
);
}This pooling strategy keeps active Chromium instances constrained to safe bounds, maintaining the execution runtime footprint under Vercel's limits.
Technical Challenges & Trade-offs
1. Serverless Environments vs. Persistent Web Crawlers
Web crawlers are traditionally long-running background processes. In contrast, Next.js serverless functions are ephemeral, stateless, and shut down after request completion or timeout.
- Decision: We designed the crawler to run on-demand, streaming progress via SSE. To prevent serverless execution timeout on larger crawls (e.g. over 500 pages), the client slider allows users to bound the crawler between 10 and 1000 pages. For larger sites, self-hosting via Docker (which bypasses serverless time limits) is provided as a pre-configured option.
2. Puppeteer Resource Footprint on Lambda
Running Puppeteer inside serverless lambdas requires packaging Chromium. The chromium binary exceeds standard serverless bundle sizes and requires significant execution overhead.
- Decision: We integrated
puppeteer-corewith a lightweight serverless-chromium layer. While this required complex bundle configuration during deployment, it minimized our deployment bundle sizes, reducing function cold start times and avoiding serverless bundle limits.
3. Infinite Recursion and Dynamic Parameter Loops
Modern sites often have infinite dynamic paths (e.g., calendar pages with /date?day=1, /date?day=2, etc.) which can trap a recursive crawler in an infinite loop.
- Decision: We added an aggressive normalization step that strips common tracking parameters (like
utm_*) and caps folder-depth recursion to 5 levels. If the crawler exceeds this depth, it stops enqueuing links from that branch, saving memory and crawl budget.
Results & Impact
- Hybrid Scalability: Crawls up to 1000 pages per session, dynamically parsing complex client-side applications and simple SSR sites under the same interface.
- Complete Indexing Coverage: Leverages a sitemap merging algorithm that cross-references existing sitemaps with the crawler output, reducing indexing gaps.
- Stable Memory Profile: Maintained a memory footprint below 350MB under maximum concurrency constraints by utilizing strict Chromium pooling.
- Zero-Config Compliant Outputs: Generates standards-compliant XML sitemaps containing automatic depth-based priority calculations (1.0 to 0.1) and last-modified dates, ready for search engine indexing.