You can use IMPORTXML function in Google Sheets to get the URLs from an XML sitemap:
=IMPORTXML("YOUR_SITEMAP_URL", "//*[local-name() ='url']/*[local-name() ='loc']")
By regularly extracting and analysing the URLs in a sitemap, you can monitor changes and updates to the website’s structure. This can help identify issues like broken links or missing pages. You can also extract URLs from a sitemap to build a content inventory and facilitate content audits.
In this guide, we’ll walk you through the process step by step.
Before we dive into the extraction process, ensure you have the following prerequisites in place:
- Access to Google Sheets: You’ll need a Google account to access Google Sheets. If you don’t have one, you can create a free Google account.
- A Sitemap URL: Obtain the sitemap URL of the website you want to extract URLs from. A sitemap is an XML file that contains a list of URLs from a website, making it easier for search engines to crawl and index web pages.
How to find the sitemap of a website?
To find the sitemap XML file for a website, you can follow these steps:
Check the Default Location
Many websites use the default location for their sitemap XML file, which is often https://www.example.com/sitemap.xml or https://example.com/sitemap.xml. Try entering these URLs directly into your web browser to see if the sitemap is available there.
Check the Robots.txt File
If you can’t find the sitemap in the default location, you can also check the website’s robots.txt file. This file often contains information about the location of the sitemap. You can access it by adding /robots.txt to the end of the website’s URL.
For example, if the website is example.com, you would go to example.com/robots.txt.
Look for a line in the robots.txt file that specifies the location of the sitemap. It may look something like this:
Sitemap: https://www.example.com/sitemap.xml
If you find this line, it will provide you with the URL of the sitemap XML file.
⭐️ Pro Tips:
Remember that not all websites have sitemap XML files, and some may have them hidden or located in non-standard locations. Additionally, the availability of the sitemap may depend on the website’s configuration and whether they choose to make it publicly accessible.
How to Extract URLs from a Sitemap Using Google Sheets
Google Sheets is a versatile tool that can be used for various data-related tasks. One of its handy features is the ability to extract URLs from a sitemap, making it easier to analyse and work with website data.
Step 1: Create a new Google Sheets
First, create a new sheet and rename the document as anything you want, so you can easily come back to this file for further analysis.
Or you can also type sheets.new into the address bar of your web browser to use the shortcut to instantly create a new Google Sheets.
Step 2: Use ImportXML Function
In cell A1, paste the following Google Sheet formula:
=IMPORTXML("YOUR_SITEMAP_URL", "//*[local-name() ='url']/*[local-name() ='loc']")
Replace “YOUR_SITEMAP_URL” with the actual sitemap URL you want to extract data from.
Press Enter. Google Sheets will start importing the data from the sitemap.
Wait for Google Sheets to fetch the data. It may take a few moments depending on the size of the sitemap.
Column A will now contain a list of URLs extracted from the sitemap.
⭐️ Pro Tips:
The function doesn’t work with Sitemap Index. If you have a Sitemap index, use the URL of individual sitemap instead.
Step 3: Data Cleanup (Optional)
If you want to clean up the data or perform additional tasks, consider these optional steps:
Removing Duplicates
- To remove duplicates, select column A by clicking on the column header.
- Go to Data > Data Cleanup > Remove duplicates. Follow the on-screen prompts to remove any duplicate URLs.
Sorting Data
- To sort the data alphabetically, select column A again.
- Go to Data > Sort sheet A-Z (ascending) or Sort sheet Z-A (descending) as per your preference.
Step 4: Analyse your sitemap
You can now analyse the extracted URLs from the sitemap.
If you have access to the website’s Google Search Console, compare the sitemap data with the indexed pages. This can help you identify any discrepancies or issues that might affect search engine rankings.
You can also compare with your website to see if there are any missing pages or broken links included in your sitemap.
Common Errors and Troubleshooting with IMPORTXML in Google Sheets
When using the ImportXML function in Google Sheets to extract URLs from an XML sitemap, you can troubleshoot the issues by following these:
- Check the URL of the sitemap. Make sure that it is correct and complete.
- Use the correct XPATH query
- Make sure your syntax of the ImportXML function is correct.
Error 1. #N/A – “Imported Content Empty”
The error message “Imported content is empty” typically occurs when the IMPORTXML function in Google Sheets fails to retrieve any data from the specified URL using the provided XPath expression.
In the IMPORTXML function, we use the XPath expression to specify the location of the data you want to extract from the XML document.
However, as your sitemap index has a different structure, the above ImportXML function will not return a list of URLs.
In such cases, you should open your sitemap index and get the correct sitemap URL from there.
Error 2. #N/A – “Imported XML content cannot be parsed”
This error indicates that there is an issue with the structure or format of the XML syntax.
The XML content retrieved by IMPORTXML may have an invalid structure, such as missing tags, incorrect nesting, or other XML syntax errors. This can prevent Google Sheets from parsing the XML content correctly.
To resolve the issue, check if your XPath expression used in IMPORTXML is correct and accurately targets the desired elements containing URLs within the XML sitemap.
Error 3. #ERROR – “Formula parse error”
When the IMPORTXML function returns a “Formula parse error,” it indicates issues with the structure or syntax of the function.
The most common reason is that you forget to put your sitemap URL in quotes. Unless you put your sitemap in an individual cell, then you should use the cell reference without quotes.
So, check for any syntax errors within the IMPORTXML function, such as mismatched quotation marks. Ensure that the function is properly structured according to Google Sheets syntax or XPath syntax.
Conclusion
Extracting URLs from a sitemap using Google Sheets is a straightforward process that can save you time and effort when dealing with website data. By following these steps, you can easily obtain and work with a list of URLs from any sitemap.
Whether you’re an SEO professional, a web developer, or simply curious about a website’s structure, this method can be a valuable addition to your toolkit.