Thursday, April 27, 2006

How spammers are beating CAPTCHA.

(Ok this is not exactly SEO, but then I know you would be interested in this).
Just in case you donot know CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart.

Captcha are the pictures containing words you have to spell before you can post a comment in blogs, write something on digg or make a free mail account.
Now spammers need a lot of free email account. They want to comment spam your blog. For this they need to beat the captcha.
Spammers are beating captcha in two ways. Unless the image is very blurred/grainy iage processing software can be used to get the words in them. The guy at http://www.mperfect.net/aiCaptcha/
gives an example of how captcha can be beaten using software. But there is an even better way. Social Engineering.
What is the internet most used for? I donot have the statistics, but I am willing to bet that PORN is right there at the top. And what is even better than porn? Free porn, obviously.
When Mr. BigSpammer needs to break a million captchas he makes a tie up with BigFreePornSite.com. His software gets the captcha images and sends them to BigFreePornSite.com. When Joe TeenHighOnSex visits BigFreePornSite.com he asked to post the text in captcha image which is sent to Mr. BigSpammer's servers. Lo, the captcha is broken. Now Mr. Big Spammer can comment spam, digg spam, yahoo spam.

Monday, April 24, 2006

Add Links for Del.icio.us, Digg, and More to Blogger Posts

Social bookmarking sites can be a very effective way to get visitors to your sites. If your readers like what you say, why not give them a chance to bookmark you at del.icio.us and other similar sites.
To add a quick link on your blog to all of the popular traffic-boosting sites , simply add the code below to your template. I generally add it it just below the content part but you can put it anywhere.

( How to edit blogger template).

Del.icio.us Link:
http://del.icio.us/post?url=< $BlogItemPermalinkURL$> &title=< $BlogItemTitle$>

Digg Link:
http://digg.com/submit?phase=2&url="<$BlogItemPermalinkURL$>"

Technorati Cosmos Link:
http://technorati.com/cosmos/search.html?url=< $BlogItemPermalinkURL$>

Furl Link:
http://furl.net/storeIt.jsp?t=< $BlogItemTitle$> &u=< $BlogItemPermalinkURL$>

reddit Link:
http://reddit.com/submit?url=< $BlogItemPermalinkURL$> &title=< $BlogItemTitle$>

Do automatic content generators work?

If you are in a hurry and cannot wait to read the rest of the article, no they do not.

What are automatic content generators?

Automatic content generators are software which claim to create content, in the form of articles automatically. This might seem an amazing capability, to write articles without human intervention, but the software rewrites existing articles to create new one.

There are three main ways in which these content generators work.

  1. Scraping. The software gets different parts of the articles from different places and joins them all together to create a new article.
  2. Thesaurus substitution. Synonyms are substituted in the original article to create the new article.
  3. Markov chains: Markov chain is a technique in which a statistical model of the existing article is created and the new article is created using the statistical model.

Of all these methods Markov chain holds the most promise as it is hardest of all the methods to detect.

So what are Markov chains?

Apart from being lots of bullshit in computer science, they are a tool to create pseudo random text from a statistical of another text. Since it is based on non random text, most of the times it will follow the rules of English grammar. Given a large non random text to create the statistical model, it will generate text which can sometimes pass the scrutiny of humans.

Markov chain takes into account what words follow a given set of words. Based on this data the new text is created.

My experiments with Markov chain.

Most black hat SEO techniques leave some footprint which the SEs use to identify the article as automatically generated. This leaves commercial automatic content generators vulnerable. I wanted to check if the SEs are able to identify Markov chain content. For this purpose I wrote my own software. I tried to remove other signs which might flag the content as automatically generated. In particular, the size of files was changed. I removed the trailing sentences which ended abruptly. Paragraph breaks were introduced.

A site was created with such content and hosted on Tripod. It was given a link from PR 3 page. We checked the position of the web pages in SE from time to time. After a period of four months no references to the automatically created web pages were found.

So the final words.

Since the web pages were not included in the SEs indexes, the value of creating such web pages is very limited. There are some commercial SW which claim to create automatic articles. I have tried only one of them, so I cannot make claims on their effectiveness. But basically all use the same algorithms. So the results should hold for others as well.

References.

  1. URLs to created web pages. List at http://seo-experiments.blogspot.com/.
  2. Source and Binaries of the SW used to create web pages. http://www.fileshack.us/files/1058/MarkovSeo.zip

Saturday, April 01, 2006

"You must be feeling a bit like alice now?"
Morpheus to Neo, the Matrix.
I surely am an alice wandering the SEO wonderland. And from what I gather, no one knows any thing in SEO. Ok let me rephrase it to, no one knows most of the things in the SEO. The SEO wondeland I have been wandering consists of the forums of highrankings, digitalpoint and WMW.
Are reciprocal links dead? Almost, but *mutual* links are in!
Are directories the next big thing? Umm, erm if they are niche, or they are DMOZ.
Ok these were the easy ones.
If I name a page link.html does SE ignore it?
Do SEs respect the nofollow tag?
For once guys give me an honest to god, clear answer.
I have a real grudge with Google. Why donot they make the PR data publically available, say via their API? There are ways to get PR data, via sites such as www.prchecker.info/check_page_rank.php. By not making PR data publically available, google is only hurting everyone.
If I really want to know PR of a site, I do have tricks using which I can get the PR. But well its against the TOS. So they hurt the webmaster community. But then it does not help them any way. When people get PR without API, methinks the server load on google will be heigher. Not that google would be concerned about it or anything.
So why not
1. Make the PR publically available via say its API?
2. If you donot want to do so, why not do away with PR data all together. Use it only internally. Never show it to an outside guy. Or is PR just a way to force you to use goog toolbar!
The SEO world as seen by a completely brain dead.