Once you have researched relevant keywords and created the relevant copy, the next step in the process is getting your site indexed. However, your website has to be crawled before it gets indexed. You may be wondering what is the difference between crawling and indexing.
Well, crawling is the process through which search engine bots discover new content on your website, while indexing is the process of understanding your website’s content and placing it in a certain order depending on criteria such as keywords used, content relevance and more. If your content cannot be found online, as in it is not indexed, you may have a crawling issue. So, how do you identify any crawling issues you have?
Full Website Is Not Indexed
If you cannot find any of your pages online, use the expression site:domain.com to find out if they are really not indexed. If that is the case, that may mean that there was a problem crawling your whole website. A common way to find out why is by looking at your robots.txt file. This file contains instructions on what bots can and cannot do on your website. If you find “Disallow: /” on your robots.txt file, that means you have blocked bots from accessing your whole website and it will neither be crawled nor indexed.
Another issue that stems from your robots.txt file is having a noindex rule in there. This rule tells bots not to crawl and therefore not index the pages that fall under this rule. Sorting this out is as easy as editing your robots.txt file.
For WordPress users, firewall software can block bots from crawling a website. These rules are put in place to prevent malicious bots from crawling the site and scraping content. You should start by checking if your firewall is blocking Googlebot or any other crawling bot. You can then whitelist these bots so they are free to crawl the website and keep the settings that ensure malicious bots are not able to.
Internal Server Issues
If your website cannot be found by search engine bots, one common cause is an internal server issue. This usually presents itself with 5xx errors like 500, 502 and 503. If you have such internal server errors, that means even human visitors cannot access your website even if they have direct links and originate from somewhere other than a search engine.
Identifying these errors is often as simple as opening any URL on a browser. Ideally, you should start with high-authority pages that are much more likely to be indexed. Using third-party tools that also check for server errors is another way of identifying whether internal server issues are the reason why your website is not being crawled and therefore not indexed.
4xx Page Errors
4xx errors are another common crawling issue that many website owners experience from time to time. These errors tell you that a resource (URL or page) cannot be reached with the most common of them being a 404 error. As with 5xx errors, you can identify these errors by visiting the affected URL or resource using a browser.
If the content exists but the URL is wrong, you can fix the URL and then request for a recrawl. If not, you will need to either recreate the content from a backup or other source or optionally redirect the resource to a working one.
Checking the URL
The URL that is displayed on your address bar or your analytics console can tell you a lot about the specific crawling issues you have. If you are seeing 404 errors and your URL contains the characters “%20” or “%0A”, that may indicate that you have a space or a line break somewhere in your URL and the HTML parser is encoding them. This is often done to prevent the injection of malicious requests into your URL.
This issue often stems from your sitemap, so check it to see if the sitemap is encoded correctly and if the links showing these errors are displayed as one line with no spaces anywhere on the URL.
Another common crawling issue is seeing two URLs that are joined together. This issue often stems from the source code of the website being written incorrectly, especially when it comes to adding relative and referring links. In the source code, you may have the protocol https:// or http:// missing.
4xx Errors Resulting from a Redirect
Redirects are important in structuring content and ensuring visitors can be redirected to a page when the URL they visited does not exist. In some cases, the URL you are redirecting to might also be unavailable leading to crawling issues. You can use crawling software to see which URLs have a small number of crawls resulting from them. This will tell you that the crawler stopped there because the link was broken.
Additionally, you can fix the redirect the same way you would fix it for 4xx errors.
SEO Tag Issues
Google uses certain tags to understand your content. Some of these include the hreflang tag that tells bots the language the website used as well as the canonical tag that tells bots what the main URL of your website is. If these tags are incorrect, missing or duplicated, they may confuse crawlers leading to crawling issues.
To find out if this is why your website is not indexed, start with the Google Search Console. While these issues will appear on Google Search Console, such as incorrect canonical tags leading to two of the same pages being indexed, they may not show up as errors. But it is always better to correct them to avoid any confusion with duplicate content.
You can also analyze the results of a crawl on your website. These results may have incorrect or missing values, especially for key pages on your website because here is where it is easy to spot them. You can then compare these results with the SEO tags you expect to see.
Every webmaster should be aiming for a website that does not have any crawling issues. That said, there are different types of crawling issues that can pop up and several tools that webmasters can use to identify and correct them.