I’ve been dealing with a website recently that’s home page has stopped ranking in Google, I’ve seen this before a number of times and is often due to another site scraping the content. This happens all the time and in the majority of cases it has no effect on the main site but occasionally it does.
Sites such as the ones below can often scrap your site and appear for your content when you do not:
- whois.domaintools.com
- aboutus.org
The problem specifically with whois.domaintools.com is that it spiders your content and then displays it within an “SEO browser”, however Google can spider this content and as stated above this can often rank higher than the original source.
Solution
Well the first solution which in the majority of cases works is simply rewrite your home page content! But domain tools can spider your website again and then take your new content so here we are left with a lose lose situation, or are we?
When researching this subject i came across one of the best Google webmaster forum threads I have ever read. Why? because two top contributors had a two page argument over the following:
- One said you should block domain tool via Robot.txt
- The other said use IP blocking
After reading the two articles which are over a year old I checked to see if “Phil Payne” who advised the robots.txt method was correct.
And unfortunately for the other contributor it looks like he was, check: http://whois.domaintools.com/hotlines.co.uk
You will notice that the site “hotlines” no longer has text displayed on Domain tools. (The other server information is still there.)
So how did he do it?
Simply he looked at domain tools information about their spider and stopped it in the robots.txt file.
“http://www.domaintools.com/webmasters/surveybot.php”
# DomainTools
User-agent: SurveyBot
Disallow: /
I hope you find this as useful as I did.
that’s great news! will be getting that added to my robots file asap!
The problem with using robots.txt is that it’s up to the person scraping your site to respect it. Phil is wrong in using the word “blocks” in regards to your robots.txt file. Luckily no site gets big enough to matter and then dares to disobey robots.txt that I know of. The problem with blocking IP address is that they change–this means you could be blocking someone in the future unintentionally as well as not blocking the new IPs the scraper may have taken. Also, if a crawler sees a robots.txt file that disallows it (which can’t happen if you’re blocking the IP), it usually takes that to mean “remove me from your site” which is what most people want anyway.
Given the current environment and sense of fair play, robots.txt is definitely the better choice.