How to use the site crawler?

The site crawler is a way to import an existing website and create a sitemap from it.


To use the site crawler, follow these steps:

  1. In the dashboard, click on the New Project button, or open an existing project, but please note that importing a sitemap via the website crawler will overwrite your current sitemap.

  2. Click on the Import button on the toolbar.

    HELP-600-import.gif

  3. In the import panel, from the available import options select Website crawler.

    In the blank field, enter your existing website's URL.

    help-600-crawler.gif

  4. Select one of the following options:

    1. Cell Text:

      1. Select File/Directory name to display a file/directory name in your sitemap page label. This will use the "path" part of the URL as cell text.

        For example, in this link the cell text is test/second-link/page: http://example.com/test/second-link/page

      2. Select Header <h1> Tag to include the main header in your sitemap page label. This will use the text from the first <h1> element.

      3. Select Page <title> Tag to include the page title in your sitemap page label.

        This will use the text from the <title> tag.

      4. Additionally, check the Exclude Common Text From Pages On Import option - it will remove common recurring text from imported titles that recur throughout your website. For example, you can remove repetitive SEO text strings like "| Company Name" that proceeds or follows the title text.

    2. Follow Mode:

      1. Select Domain And All Subdomains if you want the site to crawl the domain and all subdomains.
        For example, if you enter http://example.com/, our Site Crawler will fetch pages from the example.com domain as well as all subdomains, e.g. foo.example.com or bar.example.com.

      2. Select Domain Only if you would like our Site Crawler to follow links only from the specified domain, e.g. example.com (and also www.example.com).

      3. Select Domain And Directory Path Only if you would like to restrict the access to specific domain and the directory path only. If you enter http://www.example.com/dir/ this will only follow links beginning with http://example.com/dir/ and http://www.example.com/dir/.

      4. Don't Follow Query String Variables - will exclude links containing ? and & characters, for example: http://example.com/link?param=1&page=2

        This option is helpful if you have many pagination pages or dynamically generated calendar.

    3. Check additional options to:

      1. Add Links to add an URL to imported pages.

      2. Add Meta Description Note to add a note with the content from the <meta description> tag.

      3. Limit Number of Pages to set the maximum number of pages you want the import tool to fetch. For example, entering 10 will download a maximum of 10 pages from the given website.

      4. Filter Directories to add directories you want the site crawler to avoid.

        1. Enter a directory name in the blank field. Include /* after the directory name to exclude all subdirectories.
          For example, if you enter /articles/* and /blog/test/ our Site Crawler will ignore the http://example.com/articles/* page and all subdirectories and also the http://example.com/blog/test/ page.

        2. Click the plus sign to add another directory name. You can add as many directories as you want.
      5. If your website is password protected select use Basic HTTP authentication and enter your username and password (Note: currently it works only with the HTTP Basic Auth method).

      6. Check the Ignore ROBOTS.TXT file rules option if your server has a restrictive file that disallows any web crawlers to access your website.

      7. Ignore HTTP Cookies option disregards cookies that your website tries to store.

      8. Use a custom User Agent string - some websites have custom firewalls which blocks everything with crawler in its name. This option by default is pre-populated with your browser's name, so our crawler will pretend to be a real browser, and your firewall system will not block it.

      9. Import SEO meta data to Content Planner imports SEO details (metadatas and URL slug) about each page to Content Planner tool.

      10. Import Into a Section will import a sitemap as a section of a certain page without overwriting the whole project.

  5. Click Import when completed.

Depending on the size of an existing site, you may wait several minutes before your sitemap is built. The number on import screen represents the number of pages already downloaded. For example, Gathering Links (13) means the crawler has scanned 13 pages so far.

HELP-600-loading-page.gif

If you don't see any progress for a few minutes please click the Cancel button and try again. You can also stop the crawler process manually - just click the Stop & Save button, it will generate a sitemap just from the successfully scanned pages.

Note: Keep in mind that the crawler can only crawl publicly accessible websites. It cannot crawl intranet sites or websites with custom authentication: pages where a username and password are required to log in.

During the crawling process you can close the browser window, edit another sitemap, or crawl another website on another sitemap.

You have now successfully built a sitemap using the site crawler. Your new sitemap is now ready for editing and customization!

The site crawler feature is only available to users with the Pro, Team and Agency account subscription.

Note: For security reasons site crawler is limited to 10 000 pages (you can fetch up to 10 000 pages during one crawling process).

Have more questions? Submit a request