Skip to main content
Beta: Front-End Checklist is currently in beta. Some issues are still being fixed. Thanks for your patience.
SEOMedium

Keep HTML documents under crawl limits

Checks HTML document size against Googlebot crawl limits

Utilities
Quick take
Typical fix time 10 min
  • Googlebot stops parsing HTML beyond approximately 15MB; content after that point is not indexed
  • Large HTML is usually caused by inline JSON data dumps, excessive inline SVG, or unminified JavaScript in `<script>` tags
  • Target HTML document size under 2MB for optimal crawl efficiency; investigate anything over 5MB
Why it matters: Googlebot has a documented crawl size limit of approximately 15MB per HTML document. Content beyond this threshold is not parsed or indexed. Excessively large HTML also slows Googlebot crawls, reducing how many of your pages are crawled per budget period.

Rule Details

Googlebot has a documented parse limit of approximately 15MB per HTML document. Content beyond this threshold is silently ignored. Even well below that limit, large HTML documents waste crawl budget.

Code Examples

❌ Avoid — massive inline JSON payload

<!-- Next.js pages router with too much data in getServerSideProps -->
<script id="__NEXT_DATA__" type="application/json">
{
  "props": {
    "pageProps": {
      "allProducts": [/* 5,000 products × 2KB each = 10MB of JSON */]
    }
  }
}
</script>

✅ Fix — fetch only what you render

// pages/products/index.tsx
export async function getStaticProps() {
  // Only pass the 20 products visible on this page
  const products = await getProducts({ limit: 20, page: 1 })
  return { props: { products } }
  // Load additional pages via client-side API calls
}

❌ Avoid — inline SVG that should be an external file

<!-- 200KB SVG inlined directly in HTML -->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 500">
  <!-- thousands of path elements... -->
</svg>

✅ Fix — external SVG file

<!-- Serve as an image (cannot be styled with CSS) -->
<img src="/images/illustration.svg" alt="Product illustration" width="500" height="250">
 
<!-- Or as an inline SVG sprite for icons (small, reusable) -->
<svg aria-hidden="true"><use href="/icons.svg#arrow-right"></use></svg>

Why It Matters

  • Index completeness: Content after Googlebot's parse limit is not indexed.
  • Crawl budget: Large pages take longer to fetch and parse, leaving fewer resources for other pages.
  • Core Web Vitals: Large HTML documents delay Time to First Byte (TTFB) and Largest Contentful Paint (LCP).

Common Causes of Oversized HTML

CauseTypical Size Contribution
Inline Next.js __NEXT_DATA__ JSON100KB–5MB
Inline SVG files10KB–500KB each
Base64-encoded imagesVaries (33% larger than binary)
Unminified <script> blocks50KB–1MB
Massive inline CSS (utility classes)50KB–300KB

How to Measure HTML Size

# Measure uncompressed HTML size
curl -so /dev/null -w '%{size_download}\n' https://yoursite.com/page
 
# Measure with Accept-Encoding to see what Googlebot receives
curl -H 'Accept-Encoding: gzip, br' -so /dev/null -w '%{size_download}\n' https://yoursite.com/page

Target ranges:

  • Under 100KB: Excellent
  • 100KB–2MB: Acceptable (investigate large sections)
  • 2MB–5MB: Needs optimisation
  • Over 5MB: Critical — risk of partial indexing

Exceptions

  • Staging, utility, login, account, or internal search pages may intentionally use different crawl or index signals if they are not meant to rank.
  • Temporary migration states can produce noisy intermediate signals; flag the live production URL pattern, not one-off transition artifacts.
  • When redirects, canonicals, robots directives, or indexability signals conflict, fix the strongest final signal first instead of reporting every downstream symptom as a separate blocker.

Verification

Automated Checks

  • Inspect rendered HTML and HTTP headers to confirm the expected metadata or crawlability signal is present.
  • Test the affected URL with Google Search Console or equivalent tooling where relevant.
  • Re-crawl a representative page set after deployment.

Manual Checks

  • Confirm the change does not create conflicting canonical-url, robots, or structured-data signals.

Use with AI

Copy these prompts to use with your AI assistant, or install the MCP server to use directly from Claude, Cursor, or Windsurf.

Check

Verify implementation

Measure the raw HTML response size (before compression) for each page. Flag pages over 2MB (investigate) and over 5MB (critical). Identify the cause of oversized HTML: (1) Large inline JSON (`<script id='__NEXT_DATA__'>` or similar). (2) Inline SVG files. (3) Base64-encoded images in HTML. (4) Inline CSS with large amounts of utility classes. (5) Unminified scripts in `<script>` tags.

Fix

Auto-fix issues

1. Measure: `curl -so /dev/null -w '%{size_download}' https://yoursite.com/page | awk '{print $1/1024 " KB"}'` 2. If large inline JSON is the cause (common in Next.js `__NEXT_DATA__`): - Reduce data passed to `getServerSideProps`/`getStaticProps` — only pass what the page renders - Use React Server Components (Next.js 13+) to avoid client hydration payloads 3. If inline SVG is the cause: move SVGs to external files and load with `<img>` or `<use>`. 4. If base64 images are the cause: serve images from a CDN and reference via URL. 5. Enable gzip/Brotli compression on the server — Googlebot fetches the compressed response. 6. Minify HTML output in production (remove whitespace and comments).

Explain

Learn more

Google's crawl infrastructure parses only the first ~15MB of an HTML document. Pages that exceed this limit have their tail content silently omitted from Google's index. Beyond the hard limit, large HTML documents consume more crawl budget, meaning fewer of your pages are crawled per day. This particularly affects large e-commerce sites or pages that server-render large datasets into HTML.

Review

Code review

Check the response `Content-Length` header or measure the raw HTML byte count. Inspect `<script type='application/json'>` or `<script id='__NEXT_DATA__'>` blocks — count their size in bytes. Flag any single block over 500KB. Check for inline SVG elements (look for `<svg>` in body HTML) that should be external files. Verify HTML is served with gzip or Brotli encoding.

Sources

References used to support the guidance in this rule.

Further Reading

Tools and supplementary material for exploring the topic in more depth.

Zarządzanie budżetem indeksowania | Infrastruktura indeksowania Google  |  Crawling infrastructure  |  Google for Developers

Dowiedz się, czym jest budżet indeksowania i jak można zoptymalizować indeksowanie przez Google dużych i często aktualizowanych witryn.

Google for DevelopersGuide
Google Crawler (User Agent) Overview | Google Crawling Infrastructure  |  Crawling infrastructure  |  Google for Developers

Google crawlers discover and scan websites. This overview will help you understand the common Google crawlers including the Googlebot user agent.

Google for DevelopersGuide

Rules that often go hand-in-hand with this one.

Enable text-based compression

Compress text resources (HTML, CSS, JS) using Gzip or Brotli to reduce data transfer size.

Performance
Make content easy for LLMs to parse

Analyzes how well LLMs can parse and understand the content

SEO
Keep linked PDFs under 60 MB

Checks linked PDF sizes against Googlebot 60MB truncation limit

SEO
Resolve internal broken links

Detects and fixes internal links that return 404 or 5xx errors to improve user experience.

SEO

Was this rule helpful?

Your feedback helps improve rule quality. This stays internal for now.

Loading feedback...
0 / 385