Content Thieves: Is a Scraper Stealing Your Content?
Posted on November 30th, 2007 by DotComMogul under Most RecentWelcome. If you're new here, you may want to subscribe to my RSS feed. Don't forget to leave a comment. I * DO * FOLLOW so you get a link back.
Visit Firestorm Forums for Free Traffic and Promotion Resources
Recently I’ve become aware that there is a scraper site that is scraping content from a lot of blogs. I’ve reported them to their host with little success, reported them to Google but they are still indexed. I emailed him and 5 of my articles were removed but many are still there. So … I took matters a little step further. I have a hellish little blackhat tool and instead of using it to my benefit, I used it to his detriment. Don’t bother going there. I just tried it again and apparently, I did something right because the whole site is redirecting to Yahoo now. You could just type site:myfreelcd.com into Google and click on “cache” to view the site and articles. You might even find your own articles there … ROFLMAO
Definition of Scraper Site from Wikipedia
A scraper site is a website that copies all of its content from other websites using web scraping.[1] No part of a scraper site is original. A search engine is not a scraper site: sites such as Yahoo and Google gather content from other websites and index it so that the index can be searched with keywords. Search engines then display snippets of the original site content which they have scraped in response to your search.
In the last few years, and due to the advent of the Google Adsense web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines.[1] Open content sites such as Wikipedia are a common source of material for scraper sites.
It has hundreds of pages of scraped content indexed in Google, including most of the pages in this blog. No credit is given to the source of the articles and no link back unless your link is in the article itself. They even have an article about scraping and how to make scraped sites look more natural in the recent articles list: ironic since the article was probably scraped.
I have used Google’s spam reporting link to report this scraper site so we’ll see if that has any effect on his search results.
I found a very helpful list of the 20 Best Anti-Plagiarism Tools here as well as some very good articles and remedies on content theft. There are a lot more articles on content theft here and here is a site that tells you specifically what to do in case of copyright infringement.
I recently installed the Digital Fingerprint Wordpress plugin. It inserts a Digital Fingerprint (a unique term that you make up) in each post to make it easy for you to find your content in the search engines. I found the HotCPA Scraper site by the incoming links in my Wordpress management console. A search in Google using site:myfreelcd.com results in hundreds of pages listed by this blog of scraped … stolen content. I emailed the contact listed below in the Whois search I did on the domain and listed about 5 articles that I found of mine. He removed them without replying to my email, but on further investigation, I found many more of my pages buried in his content.
He’s also very good at getting it indexed very quickly. I wrote the article on Google Trends yesterday and within hours his site was indexed for that article. That means, that when Google crawls my site and finds the same article, I’m the one who could get slapped with a duplicate content penalty or not get listed at all.
I also contacted his host, Hostgator and told them about the problem. At first they said that if you have an rss feed published, it is fair game. I responded back that I didn’t consider copyright infringement whether via rss feed or any other method to be fair game. They then sent an email stating that I could pursue it with them if I sent a ton of documentation, pdfs of my content, pdfs of his content, proof that the content is mine, etc, so in reality, I think pursuing the matter through the Report Spam to Google form might achieve better results.
Here’s another scraper site that is scraping my content on a regular basis. The difference here is that they only publish an excerpt and then link back to me. There are two or three of these all with exactly the same design and same advertising widget … probably all one owner running a bunch of scrapers/splogs to get views to their widget/offer and Adsense. This method of scraping content could actually benefit me since I am getting a backlink every time they scrape an excerpt from my site, so scrape away scrapers.
UPDATE: Received an apology from the site owner for scraping the content. I’m a reasonable person and have accepted the apology. I am hoping to see the site deindexed from Google and am still pursuing that so that the original authors don’t suffer duplicate content penalties. The site owner has removed all of the content (it is no longer redirecting to Yahoo, but is returning 404 Not Found errors, so it’s probably just a matter of time before it is deindexed. Thank you for doing that.
![]()
Popularity: 1% [?]














November 30th, 2007 at 5:12 pm
The link to the 20 tools is broken… otherwise, great article, which I intend to refer to in a blog post (with full attribution, of course!) over the weekend!
Thanks for the info!
November 30th, 2007 at 6:02 pm
Good article, thanks for the info will link it from my blog.
Well done.
Ted.
November 30th, 2007 at 7:23 pm
Thanks …. fixed the link to the 20 tools.
December 1st, 2007 at 4:08 pm
Thank you for this information. I am very new to blogging and can use all the help I can get. I don’t know much about this and don’t have a lot of time to investigate it right now but it is on my list.
It’s pretty scary, isn’t it? I would like to think that honest hard working bloggers would prevail and cheaters like this would be ignored by people looking for real information from real people and that they would want to develop some kind of actual relationship or bond for what ever area of interest you are blogging about.
But in the real world, too many people are just looking for as much information as they can find in one place and don’t really care where the information came from.
We can only hope that if we get the word out about this kind of blatant thievery among our blog sites that people might recognize a scrape site for what it is, report them and boycott them.
In the long run it will pay to be honest…maybe not in this world, but certainly in the next.
Thanks again and God bless you.
Barry
December 2nd, 2007 at 3:05 am
First off, I am sorry that Hostgator treated you like that. However it is more than a little surreal to see you advertise for them right beside your frustration with them.
The truth is, contrary to what many would have you believe, that copyright law does not change based upon the format the content is distributed in. The fact a song is sent over radio to millions does not give you permission to rebroadcast it, for example.
I would advise a different tact with Hostgator. I have worked with them dozens of times in the past and always had good results with this tact. File a full DMCA notice with them, the template is on my site. Send it to their abuse account and the site should go down.
Hosts will never cooperate with you on these matters unless you provide a full, legal notice. If you do that, they are generally very eager to please.
Still, there is no reason for them to say such things. They should have just directed you to their DMCA policy and let that be that. I am sorry they decided to populate the Web with half-truths and misunderstandings.
I hope that this helps and please do not hesitate to write me if I can help in any way.