Thursday, August 23, 2007
Python fun with reddit URLs
Scraping the 1000 highest rated submissions at reddit.com/top,
1. The sites with most entries are www.nytimes.com, reddit.com, www.flickr.com www.youtube.com, www.washingtonpost.com. Xkcd.com beats en.wikipedia.org by getting 11 entries to wikipedia's 10 .
2. http://reddit.com/goto?id=1328g got maximum points ever, 1937. (Hint, hint)
3. The average points for top 1000 submissions are 682.851
4. Longest title has 83 words and says
Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
5. There are 637 unique sites.
6. Average title length is 11.774
7. 516 sites have only one submission.
8. The most common uncommon word in the title is [pic] (54 repeatations).
raw output from python program
Scraping the top 1000 sites on reddit.com
1. Sites with most submissions are news.yahoo.com, news.bbc.co.uk, www.youtube.com, www.nytimes.com, www.washingtonpost.com
2. Average points are 55.138
3. Largest title is 51 words.
4. Total unique sites 566
5. Average title length 10.308
6. 365 sites have only one submission.
7. The most common uncommon word in title is iraq (34 repeatations).
raw output from python program
Scapping the all time top submissions on programming.reddit.com/top
1. Sites with most submissions are www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com.
2. Average points are 221.961
3. Longest title is 47 words.
4. Average title length is 8.385
5. 559 sites have only one submission.
6. Total unique sites are 675
7. 7. The most common uncommon word in title is programming (obviously) (58 repeatations).
8. Lisp is the most common language name in the title, followed by python.
9. Maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
raw output from python program
The python program can be found at paste.lisp.org. It needs BeautifulSoup to work. It can work on any subreddit if you modify the base_url in the script. Running this script would be a heavy resource drain on the reddit servers. So, Please do not abuse it. If you need the output file of these, just mail me, and I would be willing to send them to you.
****fun with reddit urls(base_url = http://reddit.com/top?offset=)****
total sites are 1000
total unique sites 637
top 20 sites are [(u'reddit.com', 24), (u'www.nytimes.com', 24), (u'www.flickr.com', 19), (u'www.youtube.com', 15), (u'www.washingtonpost.com', 14), (u'news.bbc.co.uk', 12), (u'news.yahoo.com', 12), (u'xkcd.com', 11), (u'en.wikipedia.org', 10), (u'www.guardian.co.uk', 10), (u'www.craigslist.org', 9), (u'consumerist.com', 7), (u'www.google.com', 7), (u'www.msnbc.msn.com', 7), (u'www.snopes.com', 7), (u'money.cnn.com', 6), (u'www.crooksandliars.com', 6), (u'www.dailymail.co.uk', 6), (u'community.livejournal.com', 5), (u'pressesc.com', 5)]
Sites with only one entry 516
maximum points are 1937 by http://reddit.com/info/1328g/comments
average points are 682.851
average title length 11.774
largest title has length 83 and is Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
50 most common words are [(u'the', 354), (u'to', 291), (u'of', 243), (u'a', 223), (u'in', 144), (u'and', 133), (u'The', 111), (u'you', 105), (u'for', 104), (u'is', 95), (u'on', 82), (u'-', 71), (u'I', 56), (u'that', 52), (u'with', 51), (u'from', 49), (u'it', 49), (u'A', 47), (u'are', 46), (u'at', 45), (u'this', 39), (u'What', 38), (u'by', 37), (u'not', 37), (u'an', 36), (u'How', 35), (u'You', 33), (u'about', 33), (u'as', 33), (u'your', 33), (u'This', 29), (u'his', 29), (u'[pic]', 27), (u'Bush', 26), (u'be', 26), (u'have', 26), (u'like', 26), (u'up', 26), (u'if', 25), (u'no', 25), (u'Why', 24), (u'can', 24), (u'do', 21), (u'they', 21), (u'what', 21), (u'US', 20), (u'get', 20), (u'or', 20), (u'we', 20), (u'Google', 19)]
50 most common words, ignoring case are [('the', 467), ('to', 307), ('a', 270), ('of', 252), ('in', 162), ('and', 143), ('you', 138), ('for', 117), ('is', 108), ('on', 92), ('-', 71), ('this', 71), ('that', 64), ('it', 61), ('what', 59), ('i', 58), ('from', 56), ('with', 56), ('[pic]', 54), ('are', 53), ('not', 50), ('at', 48), ('your', 48), ('an', 47), ('how', 47), ('if', 41), ('by', 40), ('about', 39), ('as', 36), ('can', 34), ('why', 34), ('no', 33), ('we', 33), ('have', 32), ('do', 31), ('his', 31), ('they', 31), ('(pic)', 30), ('like', 29), ('up', 28), ('bush', 27), ('one', 27), ('be', 26), ('who', 25), ('all', 23), ('it's', 23), ('so', 23), ('was', 23), ('when', 23), ('but', 22)]
****fun with reddit urls(http://reddit.com/?offset=)****
total sites are 1000
total unique sites 566
top 20 sites are [(u'news.yahoo.com', 22), (u'news.bbc.co.uk', 21), (u'www.youtube.com', 18), (u'www.nytimes.com', 16), (u'www.washingtonpost.com', 12), (u'www.wired.com', 11), (u'www.cnn.com', 9), (u'thinkprogress.org', 8), (u'www.guardian.co.uk', 8), (u'www.salon.com', 7), (u'blog.wired.com', 6), (u'www.chinapost.com.tw', 6), (u'www.dailymail.co.uk', 6), (u'www.myfoxdfw.com', 6), (u'www.opednews.com', 6), (u'www.reuters.com', 6), (u'www.telegraph.co.uk', 6), (u'www.timesonline.co.uk', 6), (u'apnews.myway.com', 5), (u'en.wikipedia.org', 5)]
Sites with only one entry 365
maximum points are 895
average points are 55.138
average title length 10.308
largest title has length 51 and is [Quote] A tyrant must put on the appearance of uncommon devotion to religion. Subjects are less apprehensive of illegal treatment from a ruler whom they consider god-fearing and pious. On the other hand, they do less easily move against him, believing that he has the gods on his side - Aristotle
50 most common words are [(u'the', 306), (u'of', 232), (u'to', 221), (u'a', 159), (u'in', 157), (u'and', 125), (u'The', 105), (u'for', 94), (u'-', 90), (u'on', 79), (u'is', 68), (u'with', 48), (u'by', 41), (u'that', 39), (u'A', 37), (u'Iraq', 37), (u'from', 34), (u'Bush', 33), (u'are', 31), (u'New', 30), (u'at', 29), (u'as', 28), (u'have', 26), (u'you', 26), (u'How', 25), (u'your', 25), (u'Of', 24), (u'US', 24), (u'about', 23), (u'In', 22), (u'not', 22), (u'For', 21), (u'I', 20), (u'To', 19), (u'be', 19), (u'this', 19), (u'Vietnam', 18), (u'an', 18), (u'they', 18), (u'American', 17), (u'no', 17), (u'U.S.', 16), (u'was', 16), (u'their', 15), (u'will', 15), (u'Is', 14), (u'What', 14), (u'Why', 14), (u'You', 14), (u'has', 14)]
50 most common words, ignoring case are [('the', 414), ('of', 257), ('to', 241), ('a', 196), ('in', 179), ('and', 132), ('for', 117), ('on', 94), ('-', 90), ('is', 83), ('with', 62), ('that', 46), ('by', 44), ('are', 40), ('new', 40), ('you', 40), ('not', 39), ('iraq', 37), ('your', 37), ('from', 35), ('at', 34), ('bush', 33), ('how', 33), ('as', 30), ('us', 30), ('about', 28), ('an', 28), ('have', 28), ('be', 25), ('do', 25), ('they', 25), ('no', 24), ('this', 24), ('war', 23), ('will', 23), ('it', 21), ('i', 20), ('my', 20), ('out', 20), ('what', 20), ('police', 19), ('has', 18), ('vietnam', 18), ('we', 18), ('why', 18), ('american', 17), ('if', 17), ('says', 17), ('their', 17), ('was', 17)]
****fun with reddit urls(base_url = http://programming.reddit.com/top?offset=)****
total sites are 1000
total unique sites 675
top 20 sites are [(u'www.codinghorror.com', 31), (u'www.joelonsoftware.com', 22), (u'groups.google.com', 17), (u'xkcd.com', 16), (u'thedailywtf.com', 12), (u'programming.reddit.com', 10), (u'worsethanfailure.com', 10), (u'paulgraham.com', 9), (u'blogs.msdn.com', 8), (u'blogs.sun.com', 8), (u'www.defmacro.org', 8), (u'arstechnica.com', 7), (u'en.wikipedia.org', 7), (u'kerneltrap.org', 7), (u'steve-yegge.blogspot.com', 7), (u'weblog.raganwald.com', 7), (u'codist.biit.com', 6), (u'scienceblogs.com', 6), (u'www.paulgraham.com', 6), (u'diveintomark.org', 5)]
Sites with only one entry 559
maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
average points are 221.961
average title length 8.385
largest title has length 47 and is "The "you don't own your computer" paradigm is not merely wrong. It is violently, disastrously wrong, and the consequences of this error are likely to be felt for generations to come, unless steps are taken to prevent it." On the need for a Hippocratic Oath for programmers.
50 most common words are [(u'the', 186), (u'to', 168), (u'of', 159), (u'a', 148), (u'The', 137), (u'in', 103), (u'and', 89), (u'-', 79), (u'for', 77), (u'on', 71), (u'is', 66), (u'Why', 54), (u'you', 54), (u'I', 46), (u'How', 45), (u'A', 38), (u'Programming', 38), (u'with', 36), (u'your', 33), (u'Google', 32), (u'What', 30), (u'by', 30), (u'Lisp', 29), (u'about', 26), (u'from', 26), (u'Software', 25), (u'it', 25), (u'not', 25), (u'an', 24), (u'are', 24), (u'code', 22), (u'that', 22), (u'Python', 21), (u'do', 21), (u'Linux', 20), (u'be', 20), (u'programming', 20), (u'software', 20), (u'Web', 18), (u'To', 17), (u'at', 17), (u'this', 17), (u'Is', 16), (u'all', 16), (u'as', 16), (u'how', 16), (u'why', 15), (u'--', 14), (u'Microsoft', 14), (u'Ruby', 14)]
50 most common words, ignoring case are [('the', 323), ('a', 186), ('to', 186), ('of', 164), ('in', 110), ('and', 98), ('for', 83), ('is', 82), ('on', 80), ('-', 79), ('you', 71), ('why', 69), ('how', 61), ('programming', 58), ('i', 46), ('software', 45), ('your', 41), ('with', 40), ('what', 39), ('not', 35), ('code', 34), ('it', 34), ('lisp', 34), ('an', 33), ('about', 32), ('by', 32), ('google', 32), ('are', 30), ('from', 30), ('do', 29), ('web', 29), ('all', 25), ('be', 25), ('computer', 25), ('my', 25), ('this', 25), ('that', 24), ('one', 22), ('language', 21), ('linux', 21), ('python', 21), ('can', 20), ('at', 19), ('new', 19), ('things', 18), ('when', 18), ('as', 17), ('it's', 17), ('like', 17), ('programmers', 17)]
www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com
Friday, January 19, 2007
The war for your search bar
If you are anything like me, you probably have Google toolbar installed in your primary browser. It does many things, but foremost, it allows you to search Google without going to google.com.
But there is a search bar built right within your browser. It sits right next to the address bar.

Now you might think that no one would care about such a puny, tiny- winy search bar. And sir, can you be more wrong?
It all started when I wanted to install picasa, this is what I get in the last step of installation.

Now picasa is an image management software. Why should it try to reset my search preferences? Oh and by the way, the default option is to switch the default search engine, not to
retain your preferences.

Aw! Not so fast yahoo baby. Cap'n Google won't let you change the default option.

Well then lets try MSN toolbar.

So does opening Gmail change search preferences too? Look like it does not. Thank god for small mercies.


Do they do this with firefox too?

Looks like they do.
When you install a toolbar aren't you already reserving a part of your screen real estate to that search engine. And then shouldn't the toolbar offer to leave your search bar, instead of trying to capture it?
Thursday, January 18, 2007
ACAPTCHA - Almost Completely Automated Public Turing test to tell Computers and Humans Apart
Captcha generally (but not always)solve the problem of comment and other spam. But this comes at a price. Users with low visibility and other disablities find solving captcha hard. And blind users cant solve it unless you provide an alternative audio captcha. Why, even Seth hates it!
Negative captcha - where you hide form fields via CSS so user can't see it and hence not fill it, while bots will, is an interesting possibility. But let me itroduce ACAPTCHA - "Almost Completely Automated Public Turing test to tell Computers and Humans Apart" to you. This is waht you do.
1. There are some questions which are very easy for humans to answer but very difficult for bots to understand. Take "What color is a blue towel?" or "Is a green towel red?". Any (well most) humans can answer that qwestion in a snap, but probably not bot can.
2. Create a centralized AND rapidly changing repository of such questions. May be allow users to submit new questions and answers there. May be peer review questions before accepting them, whatever you do get a large and fast changing repositary.
3. Create a plugin/architecture where you get a random question for the repositary (ala Akismet which is a distributed anti spam engine) and ask users to solve it.
There are already some sites which try to do something similar. They ask question where they ask something like "What is 2 + 2". The problem is, it is probably very easy to break this. As soon as this becomes mainstream, you can be sure that the bots will break trough and abuse. To beat completely automated systems, you need to bring in human intelligence.
Updates -
Foo asked: ". The repo would have to include the *answers* and be as easily downloadable, right? Right. So Mr. Spammer wins again."
And I say: Well no the idea is that the central repository has say a million questions and answers. And whenever any site wants to check using a ACaptcha, they ask for a question-answer pair(Using an API). Now no one excepting the repository has all the questions and each time the spammers get a new question. This is why you need the repository to get new questions quickly, so that spammers can not build up a bank of questions over time and know there answers.
Sunday, June 25, 2006
What does not work with SEO.
So instead of adding to that garbage and telling you what works in SEO, let me tell you what doesnot.
If there is a technique which everyone is using, run. Run fast and away from it. It is going to be abused by shady SEO guys. ANd then you can be sure that the SE would penalise it.
Artice submission and directory submissions are the things to do right now. But with every one using it, I wonder for how long that is going to stay that way!
Wednesday, May 03, 2006
How google helps spammers and destroys your internet experience.

Are you a webmaster? Quick, one term which just spoils your day. Was it MFA? MFA- Made for adsense sites. Automated sites which just copy content and add no value.
Google's lax enforcing of Adsense TOS means that spammers can show adsense on crappy site and get away with it. It means that you would be forced to see pages with a sentences and 3 ad units. It means search engine spammers can get away with anything and visitors lose time, publishers loose money and valid adsense ads get a bad rep.
But that is not even the bad part.
Google is actively, ok almost actively promoting these Black hat techniques.
What does Joe BlackHatter needs to create a MFA site? Softwares. Softwares to spew out a bazzilion automated sites. Now Google has declared a Jehad against automated/scrapped content. So you would think that they would not touch it with a 10 foot barge pole. Ok lets just ask google http://www.google.com/search?q=AUTOMATIC+CONTENT+GENERATOR.
The result

As a leading Search Engine one would expect Google to take a pro-actice role in weeding them out.
Just for fun, there are some more search results where google lists spammy softwares. (Only sponsored results are shown.)

http://www.google.com/search?q=Adsense
http://www.google.com/search?q=CLOAKING
So next time you see Matt Cutts crying about Bhack Hat SEOs, being scum of the earth, ask him just drop him a line.
(If you liked this story, why not digg it?)
Thursday, April 27, 2006
How spammers are beating CAPTCHA.
Just in case you donot know CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart.

Captcha are the pictures containing words you have to spell before you can post a comment in blogs, write something on digg or make a free mail account.
Now spammers need a lot of free email account. They want to comment spam your blog. For this they need to beat the captcha.
Spammers are beating captcha in two ways. Unless the image is very blurred/grainy iage processing software can be used to get the words in them. The guy at http://www.mperfect.net/aiCaptcha/
gives an example of how captcha can be beaten using software. But there is an even better way. Social Engineering.
What is the internet most used for? I donot have the statistics, but I am willing to bet that PORN is right there at the top. And what is even better than porn? Free porn, obviously.
When Mr. BigSpammer needs to break a million captchas he makes a tie up with BigFreePornSite.com. His software gets the captcha images and sends them to BigFreePornSite.com. When Joe TeenHighOnSex visits BigFreePornSite.com he asked to post the text in captcha image which is sent to Mr. BigSpammer's servers. Lo, the captcha is broken. Now Mr. Big Spammer can comment spam, digg spam, yahoo spam.
Monday, April 24, 2006
Add Links for Del.icio.us, Digg, and More to Blogger Posts
To add a quick link on your blog to all of the popular traffic-boosting sites , simply add the code below to your template. I generally add it it just below the content part but you can put it anywhere.
( How to edit blogger template).
Del.icio.us Link:
http://del.icio.us/post?url=< $BlogItemPermalinkURL$> &title=< $BlogItemTitle$>
Digg Link:
http://digg.com/submit?phase=2&url="<$BlogItemPermalinkURL$>"
Technorati Cosmos Link:
http://technorati.com/cosmos/search.html?url=< $BlogItemPermalinkURL$>
Furl Link:
http://furl.net/storeIt.jsp?t=< $BlogItemTitle$> &u=< $BlogItemPermalinkURL$>
reddit Link:
http://reddit.com/submit?url=< $BlogItemPermalinkURL$> &title=< $BlogItemTitle$>
Do automatic content generators work?
What are automatic content generators?
Automatic content generators are software which claim to create content, in the form of articles automatically. This might seem an amazing capability, to write articles without human intervention, but the software rewrites existing articles to create new one.
There are three main ways in which these content generators work.
- Scraping. The software gets different parts of the articles from different places and joins them all together to create a new article.
- Thesaurus substitution. Synonyms are substituted in the original article to create the new article.
- Markov chains: Markov chain is a technique in which a statistical model of the existing article is created and the new article is created using the statistical model.
Of all these methods Markov chain holds the most promise as it is hardest of all the methods to detect.
So what are Markov chains?
Apart from being lots of bullshit in computer science, they are a tool to create pseudo random text from a statistical of another text. Since it is based on non random text, most of the times it will follow the rules of English grammar. Given a large non random text to create the statistical model, it will generate text which can sometimes pass the scrutiny of humans.
Markov chain takes into account what words follow a given set of words. Based on this data the new text is created.
My experiments with Markov chain.
Most black hat SEO techniques leave some footprint which the SEs use to identify the article as automatically generated. This leaves commercial automatic content generators vulnerable. I wanted to check if the SEs are able to identify Markov chain content. For this purpose I wrote my own software. I tried to remove other signs which might flag the content as automatically generated. In particular, the size of files was changed. I removed the trailing sentences which ended abruptly. Paragraph breaks were introduced.
A site was created with such content and hosted on Tripod. It was given a link from PR 3 page. We checked the position of the web pages in SE from time to time. After a period of four months no references to the automatically created web pages were found.
So the final words.
Since the web pages were not included in the SEs indexes, the value of creating such web pages is very limited. There are some commercial SW which claim to create automatic articles. I have tried only one of them, so I cannot make claims on their effectiveness. But basically all use the same algorithms. So the results should hold for others as well.
References.
- URLs to created web pages. List at http://seo-experiments.blogspot.com/.
- Source and Binaries of the SW used to create web pages. http://www.fileshack.us/files/1058/MarkovSeo.zip
Saturday, April 01, 2006
Morpheus to Neo, the Matrix.
I surely am an alice wandering the SEO wonderland. And from what I gather, no one knows any thing in SEO. Ok let me rephrase it to, no one knows most of the things in the SEO. The SEO wondeland I have been wandering consists of the forums of highrankings, digitalpoint and WMW.
Are reciprocal links dead? Almost, but *mutual* links are in!
Are directories the next big thing? Umm, erm if they are niche, or they are DMOZ.
Ok these were the easy ones.
If I name a page link.html does SE ignore it?
Do SEs respect the nofollow tag?
For once guys give me an honest to god, clear answer.
If I really want to know PR of a site, I do have tricks using which I can get the PR. But well its against the TOS. So they hurt the webmaster community. But then it does not help them any way. When people get PR without API, methinks the server load on google will be heigher. Not that google would be concerned about it or anything.
So why not
1. Make the PR publically available via say its API?
2. If you donot want to do so, why not do away with PR data all together. Use it only internally. Never show it to an outside guy. Or is PR just a way to force you to use goog toolbar!