How annoying is it when you make a post and 5 other posts rank above yours in the search engines all that have your content wrapped around huge Adsense units. When you goto the site not only is it copied word for word but there is zero attribution to the source.
Fighting this battle is a full time job in itself. I have many friends who will spend all day(s) worrying about a post that stole there content and ranks over them but does not give them credit. They will even sometimes get a lawyer to file a cease and desist and all that even though many times the scrapers are located in countries.
So what can you do?
I came up with this idea a while back to put a link back to my site in my blog feed. This works because if search engines think a blog is worthy enough to outrank yours then it should pass you juice as the authority of the article. If the site doesnt rank (lets face it 100% of the traffic to these scrapers is search engine generated) then its a wash because the search engine has already identified and the site never had any link juice (page rank) to pass in the first place.
I talked to Joost De Valk about the idea and he has made a plugin for it.
So of course this brings the question… isn’t this gaming the search engines trying to get backlnks?
I do not think so but it would be nice to hear a comment from a search engine engineer (you listening Matt
). In my opinion its all about intent. In your link back to your site if you are using the keyword of “buy Viagra” then … ya. In my case if I am using the keyword of “shoemoney” I think that is fair.
UPDATE:
Matt Cutts – Googles lead spam engineer has responded in the comments:
Comment by Matt Cutts
2008-01-10 14:41:09Don’t cloak the link or make the anchortext spammy, but otherwise: sure. See the interview I did with Stephan here: http://www.stephanspencer.com/search-engines/matt-cutts-interview where I said that syndicating articles with a link to the original article was smart:
Thanks Matt!











It’s funny, but since I started doing this two thing have happened. One, I get scrapped less. Two, I am building up more lnks which is increasing traffic and Technorati rank.
We have problems with bots/scrapers at work too, we have a little script that checks how many requests/hour they make, what their entry page is, and where their ip is from. Based upon that, if they look spammy, we route them to 127.0.0.1. Oops.
Not publishing isn’t an option.
I’ve watched automation, followed it closely actually, and even spammers use analytics. They cut content thats not working and replace it with content that is. The content thats working… well it outranks the source. Yes, they do rather enjoy that too, all the way to the bank. I haven’t oppened my ass I swear.
Thanks for the tip. I am new to this and was really shocked by how much of this goes on!
Awesome, I will definitely be trying this out. I’ve had a rash of scraped posts lately – one website in particular seems to have subscribed to every single one of my RSS feeds and uses almost all of my updates on his porn sites (regardless of the topic of my post). He at least links back, but many other’s don’t. Either way, shit’s weak.
Indeed, there’s nothing wrong with using such a plugin since the principle is the same as with article directories – you get credit as the author of a certain post.
Most respectable hosting companies will react and remove that website from their server if they are confrunted with something like this, they simply don’t want to be associated with such sites and prefer to lose a customer.
Indeed, linking to other relevant articles you’ve written from within your posts is definitely an approach which helps since some people may remove the link if they notice it at the end of a post but if it is contained within the post it will most likely not get noticed.
Or borrowed content scumbag seo, same thing.
This might be harsh but you’re talking out of your ass and haven’t learned shit from the spammers you’ve been “watching” for the last 6 months.
Do you really think they go out and hand pick feeds to scrape due to quality of the content and poor markup?
If you really want to be aware of “blackhat” then you might want to delve into it more then just being aware of the term.
Think automation. They need to create thousands of these sites…where are some places the you can find an unlimited supply of rss feeds? blogsearch? weblogs?
The only way to stop your feed from being scraped is…not to publish one.
Great. Thanks a lot Matt. I’ll be sure to start manually linking to my own articles this way.
Hey. That actually sounds cool. Playing fire with fire.
Seriously? I’ve had my fair share of problems with these scrapers but my analytics code? LOL.
Oh God. I hope it doesn’t come to that point where we can’t even link to our own content with decent keywords.
I tore ATVStyle a new one, err I mean, I hacked it to no end while trying out all the open source and php code I was playing with. Yes, I made the “please wait while page loads” part of the header background that sits under the adsense box. You can’t see that part of the background unless adsense is slow to load. When adsense is slow to load, odds are that the page is too, the text helps keep visitors patient a second or two longer. I noticed ads were slow to load a couple of times so besides optimizing I tested out a few ideas to see if they would help, that did so I kept it.
Great plugin, I saw it on digg so I had to add it! I can imagine that this will help you out.
Its not gaming the search engines because a scraper used your content and you shoved a backlink in. Its right on point with the idea behind using backlinks as a measure. Someone on another site thought enough of your site to link to you. Thats the idea.
Nice site, http://atvstyle.com/index.php
but why does the image say “Please wait a second while page loads… ” ? Did you really make that part of your graphic? heeh, funny
Nice post.
Amen to that. At least once a day I get some guy scraping my content and expecting a trackback. No thanks.
hey good idea. I don’t think I’ve had this problem yet but it’s good to know there’s a plugin for it.
Well the majority that don’t are using stock scripts anyone can find and chances are they couldn’t code on themselves if they wanted to. They typically don’t last long doing it this way.
There’s a related posts in feeds plugin that works a bit better than this
http://www.earnersblog.com/rss-deep-links/
I fully understand your annoyance. There is a site from India that repeteadly steals content from my site, and possibly other sites as well. They change the title tags slightly, they remove all links to my site, and they rank better for keywords in the search engines, possibly because they have a slightly higher pagerank. I would say the site is a mixture of stolen content from several sources, and some original content from site visitors. Perhaps that is how they get away with it.
The site stealing the content has a page full of google ads before you can even read any content at all, clearly a made for adsense site.
This has been reported in a google spam report, nothing appears to have been done in that regard. Ok, that is fair enough.
But when you report it to google adsense, and they fail to take any action, it feels as though they are happy to profit from stolen content themselves, though I know that is clearly not the case. I still consider google as a company with good ethics, and they are quite simply still the best at what they do.
Google put a lot of resources into the link buying/selling problem, it would be nice to see them make a similar effort to the problem of dealing with reported cases of content theft.
i will use the frige to put water and share with my neighbours around my house in africa here.
Jeremy was talking about adsense spam pages stealing his content. Doesn’t Google actively search for paid links and find them in “bad neighborhoods”? If thats the case it may be wiser to avoid placing “backlinks” on articles that are to be syndicated and instread rely on a webmaster to give you due credit (spammers rarely do).
Finding adsense spam pages outranking you with your own content is bad enough but getting downgraded by Google because of links from bad neighborhoods is likely worse IMO. How can we be protect ourselves from Google too ?
Bleh, that post left an aftertaste in my mouth. I should have added that I love both blogs. I’m wishing I had Matt on speed dial to ask if he’s really sugesting that links from junk sites are something you want, Google filters out link buyers, and slams them hard, by finding those.
OMG I just threw up into my mouth!
Got your attention? Good, I’ve been watching several spammers closely over the past 6 months since they are extremely creative in their methods and unfortunately to succeed you need to practice whitehat while being aware of the black. Heres something much more effective than a plugin, its a peek at WHY they target you.
They target you because you write great articles and are kind enough to place them on crappy websites. By crappy I mean extremely easy to beat in functionality with only a little S E – and fricken – O.
This might sound harsh – If you’re weak, don’t read it.
Shoemoney.com – well over 200 markup mistakes, 3 critical design flaws and a myriad of things that could be done “better”.
JohnChow.com – Same deal, tons of markup mistakes, lots of critical flaws, ummm do ya really need TWO sets of description tags and TWO sets of keyword tags John?
You get the idea. Search engines really do like the little details, like proper and accurate alt tags, lack of “marketing and cool speak”, strong, descriptive content placed highly on the code page (as opposed to buried underneath a massive head area followed by three dozen javascript calls). I could easily go on.
Get the point? Great articles on poorly SEO’d blogs = money in the bank for scrapers. Google bans the scraper sites right and left, they just sign up again. Heck some of them are even completely legit (and surprisingly good) sites 6 days a week and then on a random day they switch in adsense spam pages for 24 hours on whatever term Google is giving them love for.
Writing articles is like driving – you can be the best at both but you still need the horsepower underneath the hood to get anywhere. Oh and no, i’m not really a prick, i’m just passionate about SEO and double keyword tags bug me… and I bet Matt too.
Well that is reassuring Shoe. I know you mentioned before having a lot of smaller sites linking to your own site before. But, I was a little nervous about doing that incase I got banned. So, thanks for asking Matt the question.
Just wanted to say that this is already working GREAT for me within hours of setting it up. Thanks Shoe and Joost! =)
he’s having an existential crisis
This is a clever idea, plus it’ll notify you of scrapers by the trackbacks.
Great plugin, one which i’m going to be installing
You’ve just become a commentor by posting this comment
Read a few comments up…Matt Cutts have commented
.
This should usually happen. However, its not always the case as sometimes scraper content can outrank the original content
The majority don’t because, they simply aren’t smart enough to do so.
nope. most of the scrapers available today don’t offer this function.
It’s a think called “blackhatseo”
That’s a nice way.
My suggestion at another place was to employ a decent hacker instead of a lawyer and give them hell for a while.
But hat’s not fair, no?
My blog is not that popular but i have come across the same scenario and i must say it’s annoying. I’ll give the plugin a try.
Nice work man, but how the feck are they outranking you in the first place?
This is a great idea for a plugin. I have been having major issues with scraper sites over the past few weeks. Glad there’s a possible fix for it! Thanks Joost!
Great idea to get back at the scrapers. I’ll be using this plug in for sure.
And if they are smart…they are running the scrappers through a front host that they put up themselves so your complaints will be going to the person doing the scrapping anyway.
Ha…I’m not black, white, or gray. The lines move so much for those stereotypes that it’s become pointless to try and categorize. Just like with the paid links deal…a lot of “white hats” were turned gray->black over night.
I liked the solution reported on SEO Black Hat, where you find the IP address of the worse scrapers and the IP delivery them different content.
http://seoblackhat.com/2006/07/14/ip-delivery-to-stop-rss-content-thieves/
I agree. There’s nothing unethical about placing a link in your own content.
I think most hosting companies won’t take down content from a paying customer, unless it is illegal.
In a perfect world this would be the case, but I found that sometimes the scrapper content does out rank the original. I think we cannot be passive about it and depend on Google and Yahoo to get it straight.
Jeremy, wonderful blog. Thanks for all that you put into it. I just discovered it a few days ago after reading Amit’s blog which I have also really enjoyed. I will definitely become a frequent reader and possibly commenter.
A respectable hosting provider would most likely have this approach but there are a lot of “questionable” hosts out there who would not proceed in this manner.
Matt Cutts has dropped by an confirmed that such an approach is allowed, as long as you don’t go over-board with your anchor text.
If your website is popular and if you do a good job of optimizing then yes, you will probably rank better but you never know what other SEOs may be up to and, as such, surprises are not out of the question.
It definitely makes sense to assume that, if you have links pointing to the original article from all sorts of different sources, the odds of being outranked by some of them decrease considerably.
Under normal circumstances, you should indeed have no problems outranking those websites, but on the other hand, never say never
You could even use anchor text such as “this post was originally written by”, but you would stand a higher chance of having your link removed this way.
While there may be some who will filter out your link I agree, most such websites will not be doing anything about it.
Under normal circumstances, there’s no reason why they should have anything against it since the principle is the same as with syndicated articles: you are receiving credit as the author.
It’s really a shame that it has come to this: making such websites work for you when they have no business stealing your content in the first place.
Alan Johnson
Thanks … this is a great plugin and really needed. Now let those content sucking thieves scrape my site … go ahead … I want you to. lol Actually, I had one scraper who scraped my entire site with no link back to me and I got the site banned from Google … in one week, it no longer existed.
Complain to the hosting company and ask them to put a block or take down the site, they are liable if they public copied work. They are more likely to take the content down than the actual person copying it.
I can see a benefit to this plugin for sure… thanks.
Fishing for scrapers… that’s like black blackhat stuff there… do two black hats make a white?
It wouldn’t surprise me if the smart ones did but the bulk of the people that scrape my stuff don’t.
I’ve had scrapers pull the weirdest stuff from my sites. One guy scraped everything including my analytics code lol.
I’d agree with you, Joost. It’s our own content and adding links to it to your own site should never be considered spam. Even if you did use keywords as the anchor text. Pretty soon we’ll have Google telling us that we can only link to our own content with the anchor text of “here” and occasionally “home”.
Awhile back, there was a smiliar issue with google at least trying to determine who was the oringal creator of content, when matt cutts did a post on bacon polenta, and another site out ranked him for that particular term, with his article and everything.
http://www.google.com/search?num=100&hl=en&q=bacon+polenta&btnG=Search
A well-optimized blog will usually outrank the sites that scrape it, though it’s best to make sure one’s blog is in good repair – we had a rough time with the transition to Wordpress 2.3.x and all of the scrapers outranked us before we eventually noticed that all of our URLs were totally fsck’d up. Once this was fixed we were back on top of the world.
We’ve found it helpful to hack our blogging software to eliminate duplicate content (which also helps us outrank scrapers). Good suggestions for this at http://sebastians-pamphlets.com/how-to-seo-sanitize-a-wordpress-theme/
It seems fair to me also, but I guess we need Google or someone to confirm it so we dont end up getting punished for ‘gaming the system’..
How it is possible that your post is not the top result when you made the content first? Since your site is very popular google must be here all the time and the get the content first from you and not from the copy cats.
I got a new blog I created up to PR 2 within a short time, possibly because it was targeted by scrapers really quickly. I have links to my tags in every post
I think it is a great idea but I too link back to my “business site” in every post, usually more than once.
I also thought that scrapers removed any links, particulary because they yank my content, and never post the author bio information. As I write this, it appears that Joost’s site is getting hit prety hard as its taking forever to load with my T-3 connection.
My company has been blogging for a few years. We publish posts with embedded links a lot of the time. When we get scraped, whoever’s doing it has thus far removed/replaced all links. This may be because there’s a certain scraper with a certain MO, who likes our particular keyword niche for his MFA stuff- so I obviously can’t say that this happens across the www. Lately there’s been a pattern where the scraper strips out all links but inserts a full-URL-as-anchor-text link under each post, probably so they look more like a real blog and less like a spamgasm, and also helps them try to trackback spam us.
That having been said, Joost has always struck me a pretty smart guy, and I’m not inclined to doubt him right off the bat – is there anything that this plugin does in particular which can prevent the link it inserts from being stripped out?
that’s what i would say “ripagarli con la stessa moneta” (pay them back with the same currency)
Don’t cloak the link or make the anchortext spammy, but otherwise: sure. See the interview I did with Stephan here: http://www.stephanspencer.com/search-engines/matt-cutts-interview where I said that syndicating articles with a link to the original article was smart:
“Stephan Spencer: When one’s articles or product info is syndicated, is it better to have the syndicated copies linked to the original article on the author’s site, or is it just as good if it links to the home page of the author?
Matt Cutts: I would recommend the linking to the original article on the author’s site. The reason is: imagine if you have written a good article and it is so nice that you have decided to syndicate it out. Well, there is a slight chance that the syndicated article could get a few links as well, and could get some PageRank. And so, whenever Google bot or Google’s crawl and indexing system see two copies of that article, a lot of the times it helps to know which one came first; which one has higher PageRank.
So if the syndicated article has a link to the original source of that article, then it is pretty much guaranteed the original home of that article will always have the higher PageRank, compared to all the syndicated copies.”
Now just consider that scraping is syndication of your article that you didn’t give permission for, and bob’s your uncle.
Rather than using a plugin, I just try to link back to relevant articles from my own blog as and when there’s an appropriate link. I think that’s better from an SEO perspective and also has the benefit of getting more RSS subscribers back on the actual site.
depends if they clean the stolen content or not and most don’t
It depends on how you set up your scraper site.
You can auto gen content from RSS feeds and post little blurgs about the content you steal like “Check out what so and so said (then stolen content)” that is done to lessen the complaints.
Been using scrapers to get links for a long time. I wrote a script that auto-gen a rolling rss feed that targets what ever keywords I want to fish for to get some links and then I ping it for in intervals. The RSS scrapers pick up my feed and post my content I use to fish for them along with a link to my site.
Your site will rank lower for same content only if search engine crawlers crawl that content after the higher ranked sites. There are some other factors as well, but for a high traffic blog like yours its pretty much true.
To avoid that best technique would have been submit your content to social networking sites like Digg.com and others. Since those sites have much more visitors then yours and also their content more frequently changes, crawlers crawl them very frequently. If you submit your content there crawlers will see that, and crawl your content making sure you get the credit.
If the search engine alg is right, and your site have enough authority, your original post must will rank over the stolen content. So, when you are building authority over time, eventually you will not need worry about that.
most do not
I was also under the impression that scrapers will remove all links. Anybody wants to elaborate on that?
Would it make more sense to link back using the title of the post as the anchor text?
Don’t these scrapers intentionally remove all links? Wouldn’t they just remove this one?
ahh wow what timing …. we spoke about making this a couple weeks ago but he just posted it today.
Yeah, heard about this earlier today over at blogstorm too, must’ve been the time difference Shoe.
Thought it was good and downloaded it, but haven’t had anytime to try it out.
heh, i coded exactly same thing about a half year ago
I wouldn´t see why Google would have anything against this, but since sites are being penalized left and right for no good reason I guess an opinion would be nice
Sounds like FeedFooter, I’ll have to check it out. Blogstorm had a post on this subject earlier today too, must be “anti-splogger” day!
Yeah well, I look at it differently: the content is ours, SE’s should make sure NOT to index the scraped content, we’re giving them another hint as to whether the content is original or not