The Implementation of a Sitemap
What is a sitemap?
You may have heard of the term sitemap and never actually seen it, well, it’s nothing special and may not appear useful to you or I.
A traditional sitemap is a list of all published web pages on your website. It has a standard format to enable various systems to understand your website's content architecture and it’s not intended for people to read directly.
The sitemap contains limited information about your published content, like the modified date for each of your web pages, the priority used for re-indexing of a page compared to other pages, and the frequency each web page is likely to be updated.
If you’re providing content in multiple languages for your website, the sitemap should include all language versions for each of the web pages and they should be grouped by the specific web page, as this helps the indexer (or search engine) understand where to find alternate language variations of the same web page and content.
The sitemap is designed for computer systems to read the content of your website (or at least a list of links to the content) without having to crawl each web page and follow links found there. It’s like a machine readable table of contents for your website.
The sitemap needs to contain the web page links for your website as a minimum and it doesn’t have to include the other data, although you will be rewarded by the search engines for providing additional details.
Providing the alternate language versions for each page, if you have them, will help the search engine provide the most applicable web page for the visitor - ideally in their language.
Sitemaps are created using XML to describe the page links of your website.
XML, or Extensible Markup Language, is a language designed to store and transport data. XML focuses on describing data in a structured format that can be easily shared and processed by different systems and applications.
Here’s an example of a sitemap:
Why is a sitemap important?
The sitemap forms part of the steps required for your website to rank highly in search results, by helping search engines and other tools to find pages and updates to pages on your website.
For a search engine (e.g. Google, Bing or others) to know the links of website pages, they can index from the home page and crawl each of the links provided on the page. However, this is considered a resource hungry and slow approach.
The sitemap provides the list of web page links along with language variations and this allows the search engine to index these web pages before, or instead of, trying to crawl the page links found.
This approach is faster than web crawling, and with the addition of reindexing priorities and modified dates, it makes the search engine's job much faster indeed.
Sitemaps are not only for use by search engines, although that is the traditional use-case). Other tools which might find a sitemap useful include SEO ranking tools, Content mapping tools, Content Management Systems and AI information gathering/web-crawling. The sitemap will help these understand the frequency of updated content, priority pages and alternate language variations, which helps saves time and resources in the same way that the search engine crawling does.
What about robots.txt?
Normally you would create this text file with the sitemap. It doesn’t change very often and it contains references to the sitemap or sitemaps (mor on that in a minute).
Whilst a sitemap is traditionally ‘/sitemap.xml’ it doesn’t have to be. This is where the robots.txt file provides information to bots and tools that indicate the rules for the website and where to find the sitemap.
The robots.txt file includes the allowed and not-allowed website paths for different user-agents (the user-agent is a header provided by requests for web pages from bots, tools and browsers).
Search engine bots and other crawler type tools should read the robots.txt to find the correct location of your sitemap and to check where they’re allowed to crawl when following links on pages.
The robots.txt file may list one or more sitemaps. That’s right: you can list multiple sitemaps.
Here’s an example of a robots.txt:
Sitemaps have limits and this means you may need to manage more than one sitemap. Basic limits include a 50MB file size and a 50,000 URL limit. You can read more about this Sitemap.org website.
If you do find yourself hitting limitations, you can split your sitemap into several files (or URL’s) and then chose to either list them individually in the robots.txt or through the use of a sitemap index.
The sitemap index is basically an index of the sitemap URL’s.
Here’s an example of a sitemap index:
Does Generative AI use a sitemap or robots.txt?
This is an emerging technology and there’s no single rule. Some AI models will use the robots.txt and sitemap files to locate the pages of a website, when they were modified, which languages the content is presented in etc. This is certainly useful for any indexing of content, whether it’s a search engine or an AI model. However it doesn’t mean that all AI models will use this approach or whether they would adhere to the rules defined in the robots.txt file.
Model Context Protocol (MCP) is an open standard that lets AI models use external tools and services through a unified interface. This includes tools to read content from different sources (like a website) and there are tools for scraping each page of a website and these are commonly used.
The sitemap is still the most appropriate content index for websites, although new standards are emerging. One standard from early 2025 is the LLMS.txt file and another is the Cloudflare Content Signals Policy. It is important to understand that you could use these approaches now, even though the impact is currently unproven.
Who are arcast?We are experienced technology advisers, providing modernisation and AI process lead solutions in digital experience and content management. arcast's people demonstrates exceptional listening skills and then find the right technical solution for your problems. We are known as ‘M-shaped people' whereby we pride ourselves on being able to operate at various levels within organisations We roll-up our sleeves and help you through the tough times, whilst improving your digital estate and processes. Reach out to a member of the arcast team or contact us to speak to somebody who cares. |
Important parts of a sitemap
When setting up your sitemap, there are a number of elements that need to be considered as mentioned above. Most pages will use a common priority and frequency, but the home page an key or feature pages are likely to require a higher priority and a more immediate frequency.
An example would be for the home page to have a priority of 1.0 and a frequency of hourly or daily. A search results page or a page that contains the latest news on a fast moving website, might have a frequency set to always, meaning the page is always changing and therefore the page should always be checked.
Pages that are less likely to change, like article pages, may have a longer frequency of monthly, yearly or even never (although I’d be weary of using ‘never’).
If you’re providing alternate language variations for a page in your sitemap, you should be sure to include all language versions in the list. Here’s an example:
Notice that the "https://www.example.com/page.html"
page link is provided in the ‘loc’ field and in the alternate language links. This ensures the page is included in the list of available languages. When a page has multiple language versions, these alternative language links should be provided in the web page header as well as the sitemap.
Content Management Systems
We’ve worked with many Content Management Systems (CMS) over the years and the principles of them haven’t really changed. They’re a great tools to manage your content and deliver it through API’s and templates to websites, mobile apps and other channels.
When all of the content is available in the CMS is accessible from the CMS, there is a good opportunity to dynamically create the sitemap.
Content Management Systems come in all shapes, from the likes of Umbraco and WordPress that have an all-in-one application (back-office editing in the same application as the front-end website), to scaled systems like SDL Tridion/Web that you publish static and dynamic content to separate web serving infrastructure, through to headless solutions like Storyblok that provide an API for your web server to retrieve the content.
With all of these Content Management Systems you’re able to construct a dynamic sitemap. Ultimately this saves you time and money and it enables you to use best practice to ensure your sitemap is working for you.
Without a CMS, it’s common to use third-party tools to crawl your website and then allow manual edits. One popular tools is screaming frog - which I remember for all the wrong reasons.
Providing editorial controls
If you’re using a CMS to manage your website(s) content and find there is no feature built in to manage the sitemap, check the plugins for that CMS, as often a third-party has provided this feature.
As a minimum, the expectation is to have the option to exclude or include a web page in the site map. It would have a setting at page/document and folder level to allow easy control over what is and isn’t included in the sitemap.
Taking this a step further, it's possible to add the priority and frequency settings, although this isn’t always required.
The multiple language variations should be automated, so that you do not have to manually specify each of the language page variations. This only becomes awkward when you decide to show completely different content for each language, at which point you should ask yourself whether a separate site is required.
How do you create a sitemap when using a CMS?
As mentioned, this is normally something the CMS can provide either as a built-in feature or through a third-party plugin, however sometimes that isn’t the case or you just want a different implementation.
We’ve built countless websites with Content Management Systems and implemented bespoke sitemap generation for them. it’s an easy feature to build and generally runs in the background without any user interaction.
There are some things to consider when creating your own sitemap generation feature:
-
Does it generate the sitemap on a schedule (i.e. once per day) or does it update the sitemap whenever there is a page change (or new page)?
-
Do you reindex the entire sitemap or only insert/update the specific page?
-
Does the CMS provide the necessary API/SDK to retrieve web pages or content items and there descendants?
-
Do you have dynamic content pages and do they also need to be included in the sitemap?
-
How do you exclude pages or sections from the sitemap?
-
Is the generation task resource intensive and will it take time to create or impact other users, or website visitors?
All these considerations will be different for each implementation.
Caching the sitemap
Once you’ve built your sitemap generator and verified the contents are correct, updated on time and you’re ready to take it live, considere what happens if multiple requests for the sitemap happen at the same time?
Performance testing your website and especially any dynamic or functional content (like the sitemap) is an important step. We advise caching resource hungry pages and this includes dynamic content pages and sitemaps.
Using a cache to store a copy of the latest version of the sitemap will improve time to view for the search engine or system. You can cache on the web server, in the CMS or at a Content Delivery Network (CDN).
It’s important to remember to flush the content/pages from cache when you have an update and to remember that if you’re sending cache headers to the browser (a CDN will do this) then you cannot always manage when the visitor will see the update (in those situations you have to review your cache policy and perhaps implement a different type of cache or lower the time in cache).
Without a cache on dynamic content or functional content pages, you risk high resource usage on your web server and unintentional denial of service (DoS) situations.
Checking your sitemap
There are various online services that can be used to check your sitemap validity, this includes Google and other search engines.
I prefer to use a simple tool, such as the service provided by XML Sitemaps.
It’s important that you check that your sitemaps are valid, since this can easily go unnoticed and can have a big impact in your search engine ranking.
Extending the sitemap
Google is one example of a search engine that supports sitemap extensions that allow you to provide additional content in the sitemap. The search engine will use additional content to improve search results for visitors.
This includes listing images from web pages that form part of the content of the page (not furniture type images) or videos that have content contextual meaning to the page.
Google also recommends managing a separate news sitemap, although not strictly required to be separate. It is recommended by Google as it helps them categories those pages, and similar to the frequency value, it helps the search engine understand that the pages will not update frequently.
It will be interesting to see whether the sitemap standards are expanded for AI models in the future or whether they continue with new standards like the mentioned LLMS.txt approach.
Summary
If you care about visitors finding your website content when they search using search engines like Google or AI models like Gemini or chatGPT, you should ensure you have an up to date robots.txt and sitemap. These are fundamental structural items that every website should implement. Not having these or not keeping them up to date, will have a negative impact for your audience trying to find content on your website.
Sitemaps are trivial to implement and have a positive impact for your visitors and for your business.
If you’re in any doubt or have questions, please do get in touch and a member of the arcast team will happily arrange a call.