Creating high-quality, original content is difficult. You have to think about interesting topics, do a lot of research, write the content, proofread it, ensure it is formatted, work on the SEO, and then publish it. Once you do all this, you then have to then worry about dealing with scrapers who will steal this content and use it on their own sites.
Plagiarism and content scraping has been a problem for the better part of the last two decades and is not only relegated to news websites. So, what can you do to stop scrapers from stealing your content, potentially stealing your audience, and monetizing your hard work?
One of the ways scrapers steal content is by blatantly selecting it and then copying it. You can stop this by disabling both text selection and right clicks. This is a bit technical, so if you own a WordPress website, use a plugin to make things easier.
If you want to dig into the code, you want to locate the “body” tag on your website. It will look something like this “<body>”. Replace the tag with this “<body onmousedown=”return false” onselectstart=”return false”>”.
There are several ways you can use links to stop scrapers. One of these is adding as many internal links to your website as possible. Many scrapers leave your internal links in the content they post on their websites, meaning they do not get to benefit from it as much as they thought they would. You end up stealing traffic back from them, especially if these links were added to high-value keywords.
Also, if they leave the internal links intact, you can know where the scraped content is being used and can log those links by looking at the referring links using Google Console or Ahrefs. You can then use these details in the next step.
A great way to deal with scrapers is to take their website down (although you have to do this a lot, as they may start new websites). However, this does work and takes resources away from the scrapers. The first thing you do is to contact them and ask them to remove the content.
If they do not abide, it is time for a takedown. You do this by filing a Digital Millennium Copyright Act (DMCA) request with their host. To avoid the hassle of going through their website to find their contact and other information, do a Whois lookup. This will give you all the details of the person or company that owns the website.
The request will also tell you who is hosting the website and you can then get in touch with them to file a DMCA request. Many hosts take this very seriously and will often take down websites that have violated their DMCA rules.
Next, you have to find the IP address of the offending party. Use the contact information you got from Whois and third-party services to find their IP address. Once you have the IP address, it is time to block them.
HTAccess is a file used to determine who can visit your website and how they can do it. It is located at the root of your site’s directory and it lets you deny permissions, such as visiting the admin section or certain pages. You can check your access logs to see if the IP address you got above has accessed the website. If it has, you can block them by editing your HTAccess file to include the line “Deny from <IP address>” but without the angle brackets. You can also edit the file to redirect their IP address to some dummy content.
One thing developers recommend is redirecting the scraper back to their own IP address. This causes an infinite loop and their system to crash.
Some scrapers do not go to your website for the content but instead, go to your RSS feed. Your RSS feed can be an important tool in helping stop scraping. All you have to do is set the RSS feed to show a summary of your content and not the whole of it. This way, scraping will only result in some snippets and not all the content.
There are good bots and bad bots. Good bots include those from search engines that help index and rank your content. However, there are a lot of bad bots that exist to scrape the content from your website. Each search engine bot has a listed IP address and an identity. By comparing this information with info that comes from search engines, you can differentiate between good bots and bad bots and block the bad ones.
Many scrapers steal your content to try to rank it as their own. A sitemap can be incredibly helpful in preventing this. A sitemap contains the time content was posted and updated, usually up to the second. By updating your sitemap regularly, you can let search engines know that your content is the original version.
By doing so, the other party with the duplicate content gets hit with penalties and their content never gets to rank. Although this will not stop all scrapers, it will render their efforts moot or slow them down enough.
Content scraping is a problem every website with valuable information must worry about. Even though eliminating it is almost impossible, there are things you can do to protect your efforts and content, while frustrating the efforts of the offending parties.