How to Create and Analyze XML Sitemap using pysitemaps?

11-03-2023
Python

Sitemaps are used to systematically inform 3rd parties (users, search engines, websites) about the structure of your websites. They should be carefully designed to avoid any unwanted errors, (possible)redirect codes, or invalid URLs.

Sitemaps can be submitted to articles, pages, images, pdf files or for videos. They help search engines find appropriate content without in-depth crawling. These sitemaps also help users to avoid extra crawl depths (usually, search engines only go up to a crawl depth of 6).

pysitemaps is ideal for such tasks for static sites generated using python.

What is pysitemaps python package ?

pysitemps is a python package to systematically generate such sitemaps for static sites using python. There is no additional library dependency as pystitemps only uses the python standard library (tested with version 3.10.)

Which Porblems can be solved with pysitemaps package?

  1. No additional packages need to be installed
  2. Neat and clean code
  3. It can be added to any python static site generator (your own or something like pelican or mkdocs)
  4. can read, write and analyze sitemaps

How to Create and Analyze XML Sitemaps using pysitemaps?

First, install it from PyPi using the following command

pip install pysitemaps

Or you can install it from the source by downloading the source code. Open the cmd console and then navigate to download the source code. Now install it using the following command.

pip install -e .

Now import Sitemap, `XmlDocument and Urlfrom pysitemaps module and use them as shown in the following example.

from pysitemaps import Sitemap, Url, XmlDocument

How to Read Sitemap

Reading a sitemap from a XML file stored on a local computer is very easy with pysitemap.

You only need to create a Sitemap object with the website name and then call the .read function. In the ‘.read()’ function you can also specify the full path of your sitemap.xml file.

An example of this procedure is shown below.

smp = Sitemap(website_name="seowings.org")
smp.read("sitemap.xml")
print(smp.as_dict())

How to Create Sitemap

You can create a website sitemap by using the Sitemap Object with website_name, file_path of the sitemap.xml file, and optional xsl_file.

Then append Urls to the sitemap objects. Later, you can use the .write() function to save the files to your local disk. The file will be saved to file_path, which you specified at the time of Sitemap object creation.

smp = Sitemap(
    website_name="https://www.seowings.org/",
    file_path="sitemap.xml",
    xsl_file="https://www.seowings.org/main-sitemap.xsl",
)
smp.append(
    Url(
        loc="https://www.seowings.org/a.html",
        lastmod="2022-12-25",
        images_loc=["https://www.seowings.org/a1.png"],
    )
)
smp.append(
    {
        "loc": "https://www.seowings.org/b.html",
        "lastmod": "2023-05-01",
        "images_loc": [
            "https://www.seowings.org/b1.png",
            "https://www.seowings.org/b2.png",
        ],
    }
)
smp.write()
print(smp.as_dict())

How to Locate/Find Sitemap of any Website

If you want to locate the sitemap of any website on the internet, then create a Sitemap object and use .fetch(include_urls=False). This function will look for all possible locations where a website can have a sitemap.

The following snippet will locate the sitemap of the seowings.org website.

smp = Sitemap(website_name="https://www.seowings.org/")
smp.fetch(include_urls=False)
print(smp.as_dict())

How to Fetch Sitemap of any Website

You can also use the fetch function to download the complete sitemap of a website. For this, you need to enable include_urls=True in the fetch function as shown in the python snippet below.

This snippet will locate and download the sitemap of the seowings.org website.

smp = Sitemap(website_name="https://www.seowings.org/")
smp.fetch(include_urls=True)
print(smp.as_dict())

How to Fetch the Index Sitemap of any Website?

smp = Sitemap(website_name="https://www.dw.com/")
smp.fetch(include_urls=False)
print(smp.as_dict())

How to Extract Urls from XML Sitemap Document?

There are certain use cases when you might want to extract URLs from the sitemap’s url_set tag. For this, you can create XmlDocument Objet and then use the add_from_text function to extract Urls from the raw text (text needs to be a valid sitemap XML file).

In the following snippet, we are showing how to create an XmlDocument and then add Urls to it using three different approaches. - using Url Objet - using url paramters i.e. loc, lastmod and images_loc - extracting Urls from raw XML text.

from datetime import datetime

news_sitemap = XmlDocument(
    "sitemap-news.xml",
    lastmod=datetime.now().strftime("%Y-%m-%d"),
    include_urls=False,
)
news_sitemap.add_object(Url("b.html", "2023-05-02", ["img1.png", "img2.png"]))
news_sitemap.add_url(
    loc="c.html", lastmod="2023-01-02", images_loc=["img4.png", "img5.png"]
)
news_sitemap.add_from_text(
    """
    <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <url>
            <loc>z.html</loc>
            <lastmod>2022</lastmod>
            <image:image>
                <image:loc>z.png</image:loc>
            </image:image>
        </url>
        <url>
            <loc>dz.html</loc>
            <lastmod>2022</lastmod>
            <image:image>
                <image:loc>z.png</image:loc>
            </image:image>
            <image:image>
                <image:loc>a.png</image:loc>
            </image:image>
        </url>
    </urlset>
    """
)
print(news_sitemap.as_dict())

How to Contribute?

Suggestions, criticism, comments and contributions are welcome. Please open an issue on the repository home page on GitHub.

If you have fixed any bug or added any useful feature, please clone this repository. Include your changes and create a pull request. We appreciate your efforts and will add them at earliest.

Documentation

Documentation of pysitemaps is live at pysitemaps.pages.dev.