Keep HTML documents under crawl limits
Checks HTML document size against Googlebot crawl limits
- Googlebot stops parsing HTML beyond approximately 15MB; content after that point is not indexed
- Large HTML is usually caused by inline JSON data dumps, excessive inline SVG, or unminified JavaScript in `<script>` tags
- Target HTML document size under 2MB for optimal crawl efficiency; investigate anything over 5MB
Rule Details
Googlebot has a documented parse limit of approximately 15MB per HTML document. Content beyond this threshold is silently ignored. Even well below that limit, large HTML documents waste crawl budget.
Code Examples
❌ Avoid — massive inline JSON payload
<!-- Next.js pages router with too much data in getServerSideProps -->
<script id="__NEXT_DATA__" type="application/json">
{
"props": {
"pageProps": {
"allProducts": [/* 5,000 products × 2KB each = 10MB of JSON */]
}
}
}
</script>✅ Fix — fetch only what you render
// pages/products/index.tsx
export async function getStaticProps() {
// Only pass the 20 products visible on this page
const products = await getProducts({ limit: 20, page: 1 })
return { props: { products } }
// Load additional pages via client-side API calls
}❌ Avoid — inline SVG that should be an external file
<!-- 200KB SVG inlined directly in HTML -->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 500">
<!-- thousands of path elements... -->
</svg>✅ Fix — external SVG file
<!-- Serve as an image (cannot be styled with CSS) -->
<img src="/images/illustration.svg" alt="Product illustration" width="500" height="250">
<!-- Or as an inline SVG sprite for icons (small, reusable) -->
<svg aria-hidden="true"><use href="/icons.svg#arrow-right"></use></svg>Why It Matters
- Index completeness: Content after Googlebot's parse limit is not indexed.
- Crawl budget: Large pages take longer to fetch and parse, leaving fewer resources for other pages.
- Core Web Vitals: Large HTML documents delay Time to First Byte (TTFB) and Largest Contentful Paint (LCP).
Common Causes of Oversized HTML
| Cause | Typical Size Contribution |
|---|---|
Inline Next.js __NEXT_DATA__ JSON | 100KB–5MB |
| Inline SVG files | 10KB–500KB each |
| Base64-encoded images | Varies (33% larger than binary) |
Unminified <script> blocks | 50KB–1MB |
| Massive inline CSS (utility classes) | 50KB–300KB |
How to Measure HTML Size
# Measure uncompressed HTML size
curl -so /dev/null -w '%{size_download}\n' https://yoursite.com/page
# Measure with Accept-Encoding to see what Googlebot receives
curl -H 'Accept-Encoding: gzip, br' -so /dev/null -w '%{size_download}\n' https://yoursite.com/pageTarget ranges:
- Under 100KB: Excellent
- 100KB–2MB: Acceptable (investigate large sections)
- 2MB–5MB: Needs optimisation
- Over 5MB: Critical — risk of partial indexing
Exceptions
- Staging, utility, login, account, or internal search pages may intentionally use different crawl or index signals if they are not meant to rank.
- Temporary migration states can produce noisy intermediate signals; flag the live production URL pattern, not one-off transition artifacts.
- When redirects, canonicals, robots directives, or indexability signals conflict, fix the strongest final signal first instead of reporting every downstream symptom as a separate blocker.
Verification
Automated Checks
- Inspect rendered HTML and HTTP headers to confirm the expected metadata or crawlability signal is present.
- Test the affected URL with Google Search Console or equivalent tooling where relevant.
- Re-crawl a representative page set after deployment.
Manual Checks
- Confirm the change does not create conflicting canonical-url, robots, or structured-data signals.
Use with AI
Copy these prompts to use with your AI assistant, or install the MCP server to use directly from Claude, Cursor, or Windsurf.
Check
Verify implementation
Measure the raw HTML response size (before compression) for each page. Flag pages over 2MB (investigate) and over 5MB (critical). Identify the cause of oversized HTML: (1) Large inline JSON (`<script id='__NEXT_DATA__'>` or similar). (2) Inline SVG files. (3) Base64-encoded images in HTML. (4) Inline CSS with large amounts of utility classes. (5) Unminified scripts in `<script>` tags.
Fix
Auto-fix issues
1. Measure: `curl -so /dev/null -w '%{size_download}' https://yoursite.com/page | awk '{print $1/1024 " KB"}'` 2. If large inline JSON is the cause (common in Next.js `__NEXT_DATA__`): - Reduce data passed to `getServerSideProps`/`getStaticProps` — only pass what the page renders - Use React Server Components (Next.js 13+) to avoid client hydration payloads 3. If inline SVG is the cause: move SVGs to external files and load with `<img>` or `<use>`. 4. If base64 images are the cause: serve images from a CDN and reference via URL. 5. Enable gzip/Brotli compression on the server — Googlebot fetches the compressed response. 6. Minify HTML output in production (remove whitespace and comments).
Explain
Learn more
Google's crawl infrastructure parses only the first ~15MB of an HTML document. Pages that exceed this limit have their tail content silently omitted from Google's index. Beyond the hard limit, large HTML documents consume more crawl budget, meaning fewer of your pages are crawled per day. This particularly affects large e-commerce sites or pages that server-render large datasets into HTML.
Review
Code review
Check the response `Content-Length` header or measure the raw HTML byte count. Inspect `<script type='application/json'>` or `<script id='__NEXT_DATA__'>` blocks — count their size in bytes. Flag any single block over 500KB. Check for inline SVG elements (look for `<svg>` in body HTML) that should be external files. Verify HTML is served with gzip or Brotli encoding.
