Sitemap Best Practices

Published: April 12, 2026 | Author: Aubrey Yung

Table of contents

A sitemap is a file that lists the URLs on your website and tells search engines which pages exist and when they were last updated. Think of it as a directory you hand directly to Googlebot so it doesn't have to figure everything out on its own.

Sitemaps are particularly useful for large sites, new sites with few inbound links, sites with rich media content, or any site where important pages risk being missed during a routine crawl.

With that being said, having a sitemap doesn't guarantee crawling: Google emphasizes that sitemap is only a hint and Google may not use the sitemap at all.

This guide covers everything you need to know: what to include, where to place it, how to keep it healthy, and the mistakes that quietly undermine your crawl efficiency.

Types of sitemaps and when to use each

Not all sitemaps serve the same purpose, and understanding the differences will help you decide what your site actually needs.

  • XML sitemap is the standard format for search engines. It lists your URLs in a structured format that Googlebot, Bingbot, and other crawlers can parse efficiently. This is the one that matters most for SEO.
  • HTML sitemap is a page on your website written in plain HTML that lists your site's pages in a human-readable format. It primarily serves users who can't find what they're looking for through navigation. It has minimal direct SEO value today but can improve internal linking for very large sites.
  • Image sitemap extends your XML sitemap with image-specific metadata. It helps Google discover images that might not be found through crawling alone — particularly images loaded via JavaScript.
  • Video sitemap works the same way for video content, allowing you to provide metadata like video title, description, duration, and thumbnail URL directly to Google.
  • News sitemap is reserved for Google News–approved publishers. It surfaces articles published within the last 48 hours and follows stricter formatting requirements than standard XML sitemaps.

Most sites need only an XML sitemap. If your site relies heavily on images or video for traffic, add the relevant specialist extensions. HTML sitemaps are optional and rarely a priority unless your site has deep navigation or a large page count.

Does your sitemap need to be in the root directory?

No, your sitemap does not technically have to live at the root of your domain, but placing it there is the strong default recommendation.

According to Sitemap protocol,

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.

This path-scope rule has practical implications depending on how your site is structured:

  • Standard single-domain sites: Placing the sitemap at https://example.com/sitemap.xml covers your entire site with no issues. This is the default for most CMS platforms (WordPress with Yoast, Rank Math, etc.) and the safest choice.
  • Subdirectory-based sites (e.g. example.com/blog/): If your blog section is managed separately and has its own sitemap at https://example.com/blog/sitemap.xml, that sitemap can only reference blog URLs. You cannot include pages from example.com/shop/ or the homepage in that file. To cover the whole site, you either need a root-level sitemap or a sitemap index file at the root that references all section-level sitemaps.
  • Subdomain sites (e.g. blog.example.com): A sitemap on a subdomain can only reference URLs within that subdomain. blog.example.com/sitemap.xml cannot list pages on www.example.com. Each subdomain needs its own sitemap, submitted separately in Google Search Console under the respective property.
  • Sitemaps hosted on a different host (CDN or centralised multi-domain setup): If your sitemap is hosted on a different host than the URLs it references — whether that's a CDN subdomain like cdn.example.com covering example.com, or a central sitemap host managing several domains — each target host must add a Sitemap: directive in its own robots.txt pointing to the externally hosted file. This proves ownership and resolves the cross-submission error that would otherwise occur. Even with this in place, each sitemap file must still contain URLs from one host only.

Regardless of where you place your sitemap, you should declare its location in your robots.txt file:

Sitemap: https://example.com/sitemap.xml

This line tells any crawler where to find your sitemap, without requiring them to guess. It's especially important if your sitemap is not at the conventional root path. This declaration overrides the path-scope discovery mechanism; Google reads the full URL from robots.txt and fetches it directly.

What URLs to include (and exclude) in your sitemap?

The rule of thumb is only include URLs you want to be indexed and URLs that return a 200 status code. Every URL in your sitemap is an implicit recommendation to Google that the page is worth crawling and indexing. Polluting your sitemap with low-quality or technically incorrect URLs wastes crawl budget and sends mixed signals.

Include:

  • Canonical versions of all important pages
  • Pages that return HTTP 200
  • Pages without a noindex directive
  • Hreflang alternates (when using multilingual sitemaps)

Exclude:

  • Pages with a noindex meta tag or X-Robots-Tag header
  • Redirect URLs (301, 302) — list the final destination only
  • Pages returning 4xx or 5xx errors
  • Duplicate content pages — list the canonical URL only
  • Paginated pages (e.g. /page/2/, /page/3/) unless each page has unique, indexable content
  • Faceted navigation URLs (e.g. /shoes?color=red&size=10)
  • Admin, login, or thank-you pages
  • Staging or development environment URLs

A common mistake is letting your CMS auto-generate a sitemap and never auditing what ended up in it. Run a crawl with Screaming Frog or Sitebulb periodically and cross-reference sitemap URLs against their actual status codes and indexability.

What is the size of a sitemap?

Each individual sitemap file has hard limits: a maximum of 50,000 URLs and a maximum uncompressed file size of 50MB. If your site exceeds either limit, you must split the sitemap into multiple files.

The correct way to manage multiple sitemaps is with a sitemap index file — an XML file that lists the locations of your individual sitemap files rather than listing URLs directly. A sitemap index file follows this structure:

typescript
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2025-04-01</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2025-04-01</lastmod>
  </sitemap>
</sitemapindex>

The sitemap index file itself also has limits: it can reference a maximum of 50,000 child sitemaps. In practice, this is never a constraint.

Splitting by content type (pages, posts, products, images) is a common and sensible approach. It makes it easier to identify crawl issues by section and to update only the relevant child sitemap when content in that section changes.

Do <priority> and <changefreq> tags matter?

The XML sitemap protocol supports two optional tags that sound useful in theory: <priority> (a value from 0.0 to 1.0 indicating how important a URL is relative to others on your site) and <changefreq> (a hint about how often the page changes — hourly, daily, weekly, monthly, etc.).

In practice, Google has publicly stated in the documentation that it ignores both <priority> or <changefreq> as crawl signals. Bing's position is similar. These tags were part of the original sitemaps.org protocol but never gained traction with the major search engines as actual ranking or crawling inputs.

You can safely omit both tags. If your CMS generates them automatically, they won't cause harm, they'll just be ignored. The only exception worth noting: if you're generating sitemaps manually and <changefreq> values are wildly incorrect (e.g. daily on pages that haven't changed in three years), it reflects poorly on the overall accuracy of your sitemap data, even if Google ignores the tag itself.

The <lastmod> tag, by contrast, does matter. Google uses it as a signal to prioritise recrawling pages that have changed recently. Only populate <lastmod> with the genuine last-modified date of the page content. Do not set it to today's date on every URL, as that defeats the purpose and trains Google to distrust your sitemap's timestamps.

10 Sitemap best practices

1. Only include canonical, indexable URLs

Every URL in your sitemap should be the canonical version of a page that returns a 200 status and has no noindex directive. If a page points to a canonical elsewhere, list the canonical — not the page itself. Sitemaps with non-canonical, redirecting, or noindexed URLs confuse crawlers and waste your crawl budget.

2. Place your sitemap in the root directory — or declare it in robots.txt

Your default sitemap location should be https://yourdomain.com/sitemap.xml. If this isn't possible for technical reasons, place it as high in the path hierarchy as the URLs it covers, and always declare its exact location in robots.txt with a Sitemap: directive. See Section 2 for a full breakdown of scope implications.

3. Use accurate <lastmod> timestamps

Populate the <lastmod> field only when a page's content has genuinely changed. Use the ISO 8601 date format (2025-04-01 or 2025-04-01T09:00:00+00:00). Avoid setting all timestamps to today's date — this erodes Google's trust in your sitemap and reduces the signal's usefulness for prioritising recrawls.

4. Keep your sitemap up to date

Your sitemap should reflect the current state of your site at all times. For most sites this means generating it dynamically (via your CMS or a plugin) rather than maintaining a static file manually. If you're on WordPress, Yoast SEO and Rank Math both handle this automatically. For custom builds, schedule a sitemap regeneration script to run whenever content is published or updated.

5. Use a sitemap index file for large sites

If your site has more than a few thousand URLs, or if different sections are managed separately, use a sitemap index file at the root that references child sitemaps. Organise child sitemaps by content type or section. This makes crawl monitoring easier and ensures large sites stay within per-file limits.

6. Submit your sitemap to Google Search Console and Bing Webmaster Tools

Manual submission in both platforms gives you indexing status data and error reporting. In Google Search Console, go to Sitemaps under the Indexing section and submit your root sitemap URL. In Bing Webmaster Tools, go to Sitemaps and do the same. Submission is not a one-time task — resubmit if you create new child sitemaps or make significant structural changes.

Pro Tip

I recommend submit all your children sitemaps individually in Search Console as well, so you can leverage the sitemap filter in Page Indexing Report to monitor the indexing status of particular site sections.

7. Declare your sitemap in robots.txt

Add the following line to your robots.txt file regardless of whether you've submitted via Search Console:

Sitemap: https://yourdomain.com/sitemap.xml

This ensures any crawler that reads robots.txt — including crawlers other than Googlebot — can find your sitemap without guessing. It's a passive, low-maintenance way to broadcast your sitemap's location to all bots simultaneously.

8. Use consistent URL formats

Pick one format and apply it everywhere: trailing slash or no trailing slash, https or http, www or non-www. Your sitemap URLs must match your canonical URLs exactly. Inconsistencies — even minor ones like a missing trailing slash — can cause Google to treat sitemap URLs and crawled URLs as different pages, undermining the signal entirely.

9. Don’t set-and-forget after a site migration

Site migrations are the most common moment when sitemaps break silently. After a migration, audit your sitemap to ensure all URLs reflect the new domain, new URL structure, and correct canonical setup.

Remove any old domain or obsolete URL references immediately. Submit the updated sitemap in Search Console and verify there are no crawl errors on the new property.

Pro Tip:

Don’t remove your old sitemap immediately, so that you can monitor the indexing status of old pages - this is particularly useful if you have set up redirect.

10. Skip <priority> and <changefreq> unless you have a clear reason

Both tags are ignored by Google and largely by Bing. Omit them to keep your sitemap lean and accurate. The one field worth spending effort on is <lastmod> — keep it honest and it will earn you faster recrawls on updated content.

Conclusion

A well-maintained sitemap is one of the lower-effort, higher-leverage things you can do for your site's crawl health. It doesn't guarantee indexing, but it gives search engines a clear, authoritative list of the pages you consider important.

Aubrey Yung

Aubrey Yung

Aubrey is an SEO Manager and Schema Markup Consultant with years of B2B and B2C marketing experience. Outside of work, she loves traveling and learning languages.

Related Post