Monday, April 24, 2006

Do automatic content generators work?

If you are in a hurry and cannot wait to read the rest of the article, no they do not.

What are automatic content generators?

Automatic content generators are software which claim to create content, in the form of articles automatically. This might seem an amazing capability, to write articles without human intervention, but the software rewrites existing articles to create new one.

There are three main ways in which these content generators work.

  1. Scraping. The software gets different parts of the articles from different places and joins them all together to create a new article.
  2. Thesaurus substitution. Synonyms are substituted in the original article to create the new article.
  3. Markov chains: Markov chain is a technique in which a statistical model of the existing article is created and the new article is created using the statistical model.

Of all these methods Markov chain holds the most promise as it is hardest of all the methods to detect.

So what are Markov chains?

Apart from being lots of bullshit in computer science, they are a tool to create pseudo random text from a statistical of another text. Since it is based on non random text, most of the times it will follow the rules of English grammar. Given a large non random text to create the statistical model, it will generate text which can sometimes pass the scrutiny of humans.

Markov chain takes into account what words follow a given set of words. Based on this data the new text is created.

My experiments with Markov chain.

Most black hat SEO techniques leave some footprint which the SEs use to identify the article as automatically generated. This leaves commercial automatic content generators vulnerable. I wanted to check if the SEs are able to identify Markov chain content. For this purpose I wrote my own software. I tried to remove other signs which might flag the content as automatically generated. In particular, the size of files was changed. I removed the trailing sentences which ended abruptly. Paragraph breaks were introduced.

A site was created with such content and hosted on Tripod. It was given a link from PR 3 page. We checked the position of the web pages in SE from time to time. After a period of four months no references to the automatically created web pages were found.

So the final words.

Since the web pages were not included in the SEs indexes, the value of creating such web pages is very limited. There are some commercial SW which claim to create automatic articles. I have tried only one of them, so I cannot make claims on their effectiveness. But basically all use the same algorithms. So the results should hold for others as well.

References.

  1. URLs to created web pages. List at http://seo-experiments.blogspot.com/.
  2. Source and Binaries of the SW used to create web pages. http://www.fileshack.us/files/1058/MarkovSeo.zip

9 comments:

NORTH said...

What's that saying, "there is nothing free in this world"? Automatic content generation is easily possible, but man it sucks.

-NORTH

YOLKSMOKE <-- curious, aren't you?
http://www.yolksmoke.com

shabda said...

When you say man it sucks, do you mean that ,
1. The idea sucks? Yes I agree. It destroys the quality of articles found on web
2. Th software sucks. I disagree. It was not my idea to give a turnkey spamming software, just a proof of concept.

pierre said...

I'm not sure I agree with the conclusion you guy's got. It's unclear whether one experiment proves anything.

It is my opinion the problem with your experiment is in *your* content on the website not being worthy of linking to in the first place. Nothing personal - my opinion.

That being said, all automated Markov content algorithms do is learn patterns in the text and simply reproduce them in another way. So, from a semantic point, they should be very similar to the original text that it was trained upon.

I have some ideas on how you can test this from a more scientific point of view. Namely, because I do have a Ph.D. in Computer Science and work with Markov chains all the time (i.e., queueing theory)

You can see my website at

www.cs.usm.maine.edu/~pfiorini

If you want to collaborate on an experiment with content generators, I'll give you an edu link.

e-mail me at pfiorini@maine.edu

- PMF

pierre said...

Hi,

We e-mailed each other earlier regarding automated content (i.e., Markov generated) not being indexed by Google. This is *not* the case as I have done this. Amazingly, Google does index this jibberish.

Also, you can visit my page at http://www.cs.usm.maine.edu/~pfiorini/research/LSI_ArticleBot.htm
for an example of an article that was synomized byArticleBot. Basically, it's completely unreadable.

Suggestion. I think that if you want to get a Markov generated page indexed by Google, I think you need some link to it that has a reasonable bit of authority. In my study, I used PR4 (my home page), and that worked very well. In fact, in a matter of 3 days, Google indexed the page.

Cheers,

- Pierre

pierre said...

Oh,

Just to be clear, some pages were generated using a Markov Generator (I forget which), and Google *did* index them.

- PMF

Bali Villa Rental said...

Do not let machine writes up your mind, the sense will lost...

I created some vacation rental websites, and should I use unique content generator to produce unique content of its villa description ???

No No and No...

Unique content generator is okay if you just make money without giving something useful

hl67 said...

How does this affect programs like WP Robot?

Android app developers said...

Such a wonderful post.I like your Imagination.Nice to share with us.Great work.
Android app developers

Webby99 said...

Anyone heard of Argo-content generator?