Thursday, August 23, 2007

Python fun with reddit URLs

These days I spend a lot of time on reddit. So I got a itch, to find out which sites are the most popular on reddit, what do people comment on and what is the average points a url submitted gets. So I wrote a quick python program(Description) to scrape reddit, and found that

Scraping the 1000 highest rated submissions at reddit.com/top,
1. The sites with most entries are www.nytimes.com, reddit.com, www.flickr.com www.youtube.com, www.washingtonpost.com. Xkcd.com beats en.wikipedia.org by getting 11 entries to wikipedia's 10 .
2. http://reddit.com/goto?id=1328g got maximum points ever, 1937. (Hint, hint)
3. The average points for top 1000 submissions are 682.851
4. Longest title has 83 words and says
Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
5. There are 637 unique sites.
6. Average title length is 11.774
7. 516 sites have only one submission.
8. The most common uncommon word in the title is [pic] (54 repeatations).
raw output from python program

Scraping the top 1000 sites on reddit.com
1. Sites with most submissions are news.yahoo.com, news.bbc.co.uk, www.youtube.com, www.nytimes.com, www.washingtonpost.com
2. Average points are 55.138
3. Largest title is 51 words.
4. Total unique sites 566
5. Average title length 10.308
6. 365 sites have only one submission.
7. The most common uncommon word in title is iraq (34 repeatations).
raw output from python program

Scapping the all time top submissions on programming.reddit.com/top
1. Sites with most submissions are www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com.
2. Average points are 221.961
3. Longest title is 47 words.
4. Average title length is 8.385
5. 559 sites have only one submission.
6. Total unique sites are 675
7. 7. The most common uncommon word in title is programming (obviously) (58 repeatations).
8. Lisp is the most common language name in the title, followed by python.
9. Maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
raw output from python program


The python program can be found at paste.lisp.org. It needs BeautifulSoup to work. It can work on any subreddit if you modify the base_url in the script. Running this script would be a heavy resource drain on the reddit servers. So, Please do not abuse it. If you need the output file of these, just mail me, and I would be willing to send them to you.

****fun with reddit urls(base_url = http://reddit.com/top?offset=)****
total sites are 1000
total unique sites 637
top 20 sites are [(u'reddit.com', 24), (u'www.nytimes.com', 24), (u'www.flickr.com', 19), (u'www.youtube.com', 15), (u'www.washingtonpost.com', 14), (u'news.bbc.co.uk', 12), (u'news.yahoo.com', 12), (u'xkcd.com', 11), (u'en.wikipedia.org', 10), (u'www.guardian.co.uk', 10), (u'www.craigslist.org', 9), (u'consumerist.com', 7), (u'www.google.com', 7), (u'www.msnbc.msn.com', 7), (u'www.snopes.com', 7), (u'money.cnn.com', 6), (u'www.crooksandliars.com', 6), (u'www.dailymail.co.uk', 6), (u'community.livejournal.com', 5), (u'pressesc.com', 5)]
Sites with only one entry 516
maximum points are 1937 by http://reddit.com/info/1328g/comments
average points are 682.851
average title length 11.774
largest title has length 83 and is Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
50 most common words are [(u'the', 354), (u'to', 291), (u'of', 243), (u'a', 223), (u'in', 144), (u'and', 133), (u'The', 111), (u'you', 105), (u'for', 104), (u'is', 95), (u'on', 82), (u'-', 71), (u'I', 56), (u'that', 52), (u'with', 51), (u'from', 49), (u'it', 49), (u'A', 47), (u'are', 46), (u'at', 45), (u'this', 39), (u'What', 38), (u'by', 37), (u'not', 37), (u'an', 36), (u'How', 35), (u'You', 33), (u'about', 33), (u'as', 33), (u'your', 33), (u'This', 29), (u'his', 29), (u'[pic]', 27), (u'Bush', 26), (u'be', 26), (u'have', 26), (u'like', 26), (u'up', 26), (u'if', 25), (u'no', 25), (u'Why', 24), (u'can', 24), (u'do', 21), (u'they', 21), (u'what', 21), (u'US', 20), (u'get', 20), (u'or', 20), (u'we', 20), (u'Google', 19)]
50 most common words, ignoring case are [('the', 467), ('to', 307), ('a', 270), ('of', 252), ('in', 162), ('and', 143), ('you', 138), ('for', 117), ('is', 108), ('on', 92), ('-', 71), ('this', 71), ('that', 64), ('it', 61), ('what', 59), ('i', 58), ('from', 56), ('with', 56), ('[pic]', 54), ('are', 53), ('not', 50), ('at', 48), ('your', 48), ('an', 47), ('how', 47), ('if', 41), ('by', 40), ('about', 39), ('as', 36), ('can', 34), ('why', 34), ('no', 33), ('we', 33), ('have', 32), ('do', 31), ('his', 31), ('they', 31), ('(pic)', 30), ('like', 29), ('up', 28), ('bush', 27), ('one', 27), ('be', 26), ('who', 25), ('all', 23), ('it's', 23), ('so', 23), ('was', 23), ('when', 23), ('but', 22)]


****fun with reddit urls(http://reddit.com/?offset=)****
total sites are 1000
total unique sites 566
top 20 sites are [(u'news.yahoo.com', 22), (u'news.bbc.co.uk', 21), (u'www.youtube.com', 18), (u'www.nytimes.com', 16), (u'www.washingtonpost.com', 12), (u'www.wired.com', 11), (u'www.cnn.com', 9), (u'thinkprogress.org', 8), (u'www.guardian.co.uk', 8), (u'www.salon.com', 7), (u'blog.wired.com', 6), (u'www.chinapost.com.tw', 6), (u'www.dailymail.co.uk', 6), (u'www.myfoxdfw.com', 6), (u'www.opednews.com', 6), (u'www.reuters.com', 6), (u'www.telegraph.co.uk', 6), (u'www.timesonline.co.uk', 6), (u'apnews.myway.com', 5), (u'en.wikipedia.org', 5)]
Sites with only one entry 365
maximum points are 895
average points are 55.138
average title length 10.308
largest title has length 51 and is [Quote] A tyrant must put on the appearance of uncommon devotion to religion. Subjects are less apprehensive of illegal treatment from a ruler whom they consider god-fearing and pious. On the other hand, they do less easily move against him, believing that he has the gods on his side - Aristotle
50 most common words are [(u'the', 306), (u'of', 232), (u'to', 221), (u'a', 159), (u'in', 157), (u'and', 125), (u'The', 105), (u'for', 94), (u'-', 90), (u'on', 79), (u'is', 68), (u'with', 48), (u'by', 41), (u'that', 39), (u'A', 37), (u'Iraq', 37), (u'from', 34), (u'Bush', 33), (u'are', 31), (u'New', 30), (u'at', 29), (u'as', 28), (u'have', 26), (u'you', 26), (u'How', 25), (u'your', 25), (u'Of', 24), (u'US', 24), (u'about', 23), (u'In', 22), (u'not', 22), (u'For', 21), (u'I', 20), (u'To', 19), (u'be', 19), (u'this', 19), (u'Vietnam', 18), (u'an', 18), (u'they', 18), (u'American', 17), (u'no', 17), (u'U.S.', 16), (u'was', 16), (u'their', 15), (u'will', 15), (u'Is', 14), (u'What', 14), (u'Why', 14), (u'You', 14), (u'has', 14)]
50 most common words, ignoring case are [('the', 414), ('of', 257), ('to', 241), ('a', 196), ('in', 179), ('and', 132), ('for', 117), ('on', 94), ('-', 90), ('is', 83), ('with', 62), ('that', 46), ('by', 44), ('are', 40), ('new', 40), ('you', 40), ('not', 39), ('iraq', 37), ('your', 37), ('from', 35), ('at', 34), ('bush', 33), ('how', 33), ('as', 30), ('us', 30), ('about', 28), ('an', 28), ('have', 28), ('be', 25), ('do', 25), ('they', 25), ('no', 24), ('this', 24), ('war', 23), ('will', 23), ('it', 21), ('i', 20), ('my', 20), ('out', 20), ('what', 20), ('police', 19), ('has', 18), ('vietnam', 18), ('we', 18), ('why', 18), ('american', 17), ('if', 17), ('says', 17), ('their', 17), ('was', 17)]


****fun with reddit urls(base_url = http://programming.reddit.com/top?offset=)****
total sites are 1000
total unique sites 675
top 20 sites are [(u'www.codinghorror.com', 31), (u'www.joelonsoftware.com', 22), (u'groups.google.com', 17), (u'xkcd.com', 16), (u'thedailywtf.com', 12), (u'programming.reddit.com', 10), (u'worsethanfailure.com', 10), (u'paulgraham.com', 9), (u'blogs.msdn.com', 8), (u'blogs.sun.com', 8), (u'www.defmacro.org', 8), (u'arstechnica.com', 7), (u'en.wikipedia.org', 7), (u'kerneltrap.org', 7), (u'steve-yegge.blogspot.com', 7), (u'weblog.raganwald.com', 7), (u'codist.biit.com', 6), (u'scienceblogs.com', 6), (u'www.paulgraham.com', 6), (u'diveintomark.org', 5)]
Sites with only one entry 559
maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
average points are 221.961
average title length 8.385
largest title has length 47 and is "The "you don't own your computer" paradigm is not merely wrong. It is violently, disastrously wrong, and the consequences of this error are likely to be felt for generations to come, unless steps are taken to prevent it." On the need for a Hippocratic Oath for programmers.
50 most common words are [(u'the', 186), (u'to', 168), (u'of', 159), (u'a', 148), (u'The', 137), (u'in', 103), (u'and', 89), (u'-', 79), (u'for', 77), (u'on', 71), (u'is', 66), (u'Why', 54), (u'you', 54), (u'I', 46), (u'How', 45), (u'A', 38), (u'Programming', 38), (u'with', 36), (u'your', 33), (u'Google', 32), (u'What', 30), (u'by', 30), (u'Lisp', 29), (u'about', 26), (u'from', 26), (u'Software', 25), (u'it', 25), (u'not', 25), (u'an', 24), (u'are', 24), (u'code', 22), (u'that', 22), (u'Python', 21), (u'do', 21), (u'Linux', 20), (u'be', 20), (u'programming', 20), (u'software', 20), (u'Web', 18), (u'To', 17), (u'at', 17), (u'this', 17), (u'Is', 16), (u'all', 16), (u'as', 16), (u'how', 16), (u'why', 15), (u'--', 14), (u'Microsoft', 14), (u'Ruby', 14)]
50 most common words, ignoring case are [('the', 323), ('a', 186), ('to', 186), ('of', 164), ('in', 110), ('and', 98), ('for', 83), ('is', 82), ('on', 80), ('-', 79), ('you', 71), ('why', 69), ('how', 61), ('programming', 58), ('i', 46), ('software', 45), ('your', 41), ('with', 40), ('what', 39), ('not', 35), ('code', 34), ('it', 34), ('lisp', 34), ('an', 33), ('about', 32), ('by', 32), ('google', 32), ('are', 30), ('from', 30), ('do', 29), ('web', 29), ('all', 25), ('be', 25), ('computer', 25), ('my', 25), ('this', 25), ('that', 24), ('one', 22), ('language', 21), ('linux', 21), ('python', 21), ('can', 20), ('at', 19), ('new', 19), ('things', 18), ('when', 18), ('as', 17), ('it's', 17), ('like', 17), ('programmers', 17)]

www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com