Sunday, September 02, 2007

Parasite hosting - Or why social networking sites need to review user generated content.

To directly see the list of highly ranked spammy users pages within reddit
From wikipedia: Parasite hosting is the process of hosting a site on some one else's server without their consent, generally for the purpose of search engine benefit.

One of the most competitive keywords which everyone wants to rank for is buy viagra online

On that and many other similar searches, you would start noticing that user pages of social bookmarking sites like reddit, get very prominent rankings.
These sites have a very high TrustRank and amazing authority scores. So even off topic pages are very easily ranked high for pages created on these domains.
I was playing with the sitexplorer python library I wrote. The user pages on the reddit domain outrank even the subreddit pages.
Using a little python I found that of the top 400 pages on the reddit.com domain, 102 are user pages of the form reddit.com/user/{username}/. All of these are pages promoting prescription pills like Viagra or Cialis.
(The python program [1]
List of the user pages in the top 400 pages on reddit)

How can the social bookmarking sites combat this?
1. Use a robots.txt. For reddit.com this can be as simple as using

User-Agent: *
Disallow: /user

2. Have kill words which are not allowed in the user names. This allows the pages to be indexed with facing parasite hosting.



import re
all_pages = []
for i in range(4):
start = i*100 + 1
results = get_page_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
pages = [el['Url'] for el in results]
all_pages.extend(pages)

pat = '/user/([a-zA-Z0-9_]*)'
rep = re.compile(pat)
users = [rep.search(el) for el in all_pages]
users_ = [el.groups()[0] for el in users if el is not None]



List of the users on the reddit.com site. (These links are nofollowed).
[u'Buy_viagra_online_', u'BUY_VIAGRA_MEDS', u'CHEAP_VIAGRA_PRICE', u'ORDER_VIAGRA_NOW', u'tylerton', u'VIAGRA_ONLINE_CHEAP', u'Buy_viagra_online', u'DISCOUNT_VIAGRA_NOW', u'order_viagra_cheap', u'order_viagra_online', u'order_cialis_cheap', u'Viagra_', u'viagraagain', u'cialisagain', u'BUY_VIAGRA_ONLINEE', u'tramadolagain', u'phentermineagain', u'levitraagain', u'BUY_VIAGRA_ONLINE3', u'BUY_FLAGYL_ONLINE', u'CIALIS_LOWEST_PRICES', u'BUY_HOODIA_ONLINE', u'ORDER_VIAGRA_TODAY', u'Ephedra_Pills', u'Viagra_online', u'BUY_VIAGRA_ONLINE2', u'dans_movies', u'panda_movies', u'cialispills', u'BUY_VIAGRA1', u'Viagrapills_Online', u'phenterminepills', u'tramadolpills', u'free_porn_movies', u'phenterminepharm', u'Soma_Carisoprodol', u'cialispharm', u'viagraonline', u'Cheapest_Fioricet', u'Meridia_Diet_Pills', u'viagrapills', u'valiumpills', u'levitrapharm', u'cialis_buy', u'Buy_Percocet', u'viagrapharm', u'BUY_VIAGRA_MD', u'Tramadol_Hcl', u'Generic_Propecia', u'xanaxpill', u'BUY_LEVITRA_ONLINE1', u'tramadolpill', u'cialis_cheap', u'tramadolpharm', u'CIALIS_BEST_PRICES', u'ordertramadol', u'phenterminepill', u'order_levitra_med', u'orderxanax', u'cialis_online_drug', u'orderphentermine', u'generic_cialis_pill', u'ordercialis', u'CIALIS_ONLINE', u'BUY_VIAGRA_ONLINE1', u'online_buy_cialis', u'CHEAP_VIAGRA_PRICES', u'generic_levitra_pill', u'orderviagra', u'LEVITRA_SALE', u'CHEAP_VIAGRA_ONLINE', u'Order_Cialis_Online0', u'BUY_DISCOUNT_VIAGRA', u'order_cialis_online', u'BUY_VIAGRA_TODAY', u'VIAGRA_LOWEST_PRICE', u'CHEAP_VIAGRA_PILL', u'viagracialis', u'Suboxone', u'VIAGRA_BEST_PRICES', u'FDA_CIALIS_ONLINE', u'DISCOUNT_VIAGRA_A', u'cialistop', u'Buy_viagra_', u'phenermine', u'FDA_LEVITRA', u'phenterminetop', u'insura', u'viagrapharmacy', u'cialispharmacy', u'viagratop', u'levitrapharmacy', u'phenterminepharmacy', u'autoverzekering', u'discount_viagra', u'goba', u'levitratop', u'generic_cialis']
>>>

Getting started on SEO programming (using Python)

The python code is here.
Usage instructions are here

You own a website and want to keep track of its placement in search engines. You want to know who is linking to you, how many of your pages are indexed in the search engines. You want to tell the SE when you update you sitemaps or when you update your website.

The SiteExplorerApi from yahoo makes this extremely convenient. And with Google discontinuing the Soap Search API, this is the only feasible choice.
The site explorer api is a Rest service. You construct a URL, and make a request, from your browser, from your command line, or any place else. You need to parse the server's response to get the data in the format of your choice.
We would write a thin Python wrapper over this rest service so that we can construct our queries in python.

(to follow these examples, you need this python code, and simplejson library)

Some simple examples.
1. We want to get the top 1000 sites which link to reddit

all_urls = []
for i in range(10):
start = i*100 + 1
results = get_inlink_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
urls = [el['Url'] for el in results]
all_urls.extend(urls)


2. We want the 400 highest rated pages on reddit.

all_pages = []
for i in range(4):
start = i*100 + 1
results = get_page_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
pages = [el['Url'] for el in results]
all_pages.extend(pages)


3. Google.com has updated its sitemap. We want to let Yahoo know of it.

do_ping(u'http://www.google.com/sitemap.xml')


4. I have updated SeoDummy. Lets tell yahoo of that.

do_update_notification('YahooDemo', 'http://www.seodummy.blogspot.com/')


5. You can use these methods in conjunction to get some advanced functionality. For example, you can use get_inlink_data and get_page_data together to get a breakup of who links to each of your subpages.
For examples of some cool SEO tools, you can go here.


You would need to get simplejson to use this library. We get the response from yahoo in Json, and simplejson is needed to parse that.
There are four methods corresponding to the 4 yahoo api calls. The arguments for each method are exactly same as required arguments for the REst api, excepting
output
and
callback
, which are never used.

get_inlink_data(inLinkData)
get_page_data(pageData)
do_ping(ping)
do_update_notification(update_notification)

Thursday, August 23, 2007

Python fun with reddit URLs

These days I spend a lot of time on reddit. So I got a itch, to find out which sites are the most popular on reddit, what do people comment on and what is the average points a url submitted gets. So I wrote a quick python program(Description) to scrape reddit, and found that

Scraping the 1000 highest rated submissions at reddit.com/top,
1. The sites with most entries are www.nytimes.com, reddit.com, www.flickr.com www.youtube.com, www.washingtonpost.com. Xkcd.com beats en.wikipedia.org by getting 11 entries to wikipedia's 10 .
2. http://reddit.com/goto?id=1328g got maximum points ever, 1937. (Hint, hint)
3. The average points for top 1000 submissions are 682.851
4. Longest title has 83 words and says
Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
5. There are 637 unique sites.
6. Average title length is 11.774
7. 516 sites have only one submission.
8. The most common uncommon word in the title is [pic] (54 repeatations).
raw output from python program

Scraping the top 1000 sites on reddit.com
1. Sites with most submissions are news.yahoo.com, news.bbc.co.uk, www.youtube.com, www.nytimes.com, www.washingtonpost.com
2. Average points are 55.138
3. Largest title is 51 words.
4. Total unique sites 566
5. Average title length 10.308
6. 365 sites have only one submission.
7. The most common uncommon word in title is iraq (34 repeatations).
raw output from python program

Scapping the all time top submissions on programming.reddit.com/top
1. Sites with most submissions are www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com.
2. Average points are 221.961
3. Longest title is 47 words.
4. Average title length is 8.385
5. 559 sites have only one submission.
6. Total unique sites are 675
7. 7. The most common uncommon word in title is programming (obviously) (58 repeatations).
8. Lisp is the most common language name in the title, followed by python.
9. Maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
raw output from python program


The python program can be found at paste.lisp.org. It needs BeautifulSoup to work. It can work on any subreddit if you modify the base_url in the script. Running this script would be a heavy resource drain on the reddit servers. So, Please do not abuse it. If you need the output file of these, just mail me, and I would be willing to send them to you.

****fun with reddit urls(base_url = http://reddit.com/top?offset=)****
total sites are 1000
total unique sites 637
top 20 sites are [(u'reddit.com', 24), (u'www.nytimes.com', 24), (u'www.flickr.com', 19), (u'www.youtube.com', 15), (u'www.washingtonpost.com', 14), (u'news.bbc.co.uk', 12), (u'news.yahoo.com', 12), (u'xkcd.com', 11), (u'en.wikipedia.org', 10), (u'www.guardian.co.uk', 10), (u'www.craigslist.org', 9), (u'consumerist.com', 7), (u'www.google.com', 7), (u'www.msnbc.msn.com', 7), (u'www.snopes.com', 7), (u'money.cnn.com', 6), (u'www.crooksandliars.com', 6), (u'www.dailymail.co.uk', 6), (u'community.livejournal.com', 5), (u'pressesc.com', 5)]
Sites with only one entry 516
maximum points are 1937 by http://reddit.com/info/1328g/comments
average points are 682.851
average title length 11.774
largest title has length 83 and is Barak Obama in 2002: "I know that even a successful war against Iraq will require a US occupation of undetermined length, at undetermined cost, with undetermined consequences. I know that an invasion of Iraq without a clear rationale and without strong international support will only fan the flames of the Middle East, and encourage the worst, rather than best, impulses of the Arab world, and strengthen the recruitment arm of al-Qaeda.
I am not opposed to all wars. I’m opposed to dumb wars."
50 most common words are [(u'the', 354), (u'to', 291), (u'of', 243), (u'a', 223), (u'in', 144), (u'and', 133), (u'The', 111), (u'you', 105), (u'for', 104), (u'is', 95), (u'on', 82), (u'-', 71), (u'I', 56), (u'that', 52), (u'with', 51), (u'from', 49), (u'it', 49), (u'A', 47), (u'are', 46), (u'at', 45), (u'this', 39), (u'What', 38), (u'by', 37), (u'not', 37), (u'an', 36), (u'How', 35), (u'You', 33), (u'about', 33), (u'as', 33), (u'your', 33), (u'This', 29), (u'his', 29), (u'[pic]', 27), (u'Bush', 26), (u'be', 26), (u'have', 26), (u'like', 26), (u'up', 26), (u'if', 25), (u'no', 25), (u'Why', 24), (u'can', 24), (u'do', 21), (u'they', 21), (u'what', 21), (u'US', 20), (u'get', 20), (u'or', 20), (u'we', 20), (u'Google', 19)]
50 most common words, ignoring case are [('the', 467), ('to', 307), ('a', 270), ('of', 252), ('in', 162), ('and', 143), ('you', 138), ('for', 117), ('is', 108), ('on', 92), ('-', 71), ('this', 71), ('that', 64), ('it', 61), ('what', 59), ('i', 58), ('from', 56), ('with', 56), ('[pic]', 54), ('are', 53), ('not', 50), ('at', 48), ('your', 48), ('an', 47), ('how', 47), ('if', 41), ('by', 40), ('about', 39), ('as', 36), ('can', 34), ('why', 34), ('no', 33), ('we', 33), ('have', 32), ('do', 31), ('his', 31), ('they', 31), ('(pic)', 30), ('like', 29), ('up', 28), ('bush', 27), ('one', 27), ('be', 26), ('who', 25), ('all', 23), ('it's', 23), ('so', 23), ('was', 23), ('when', 23), ('but', 22)]


****fun with reddit urls(http://reddit.com/?offset=)****
total sites are 1000
total unique sites 566
top 20 sites are [(u'news.yahoo.com', 22), (u'news.bbc.co.uk', 21), (u'www.youtube.com', 18), (u'www.nytimes.com', 16), (u'www.washingtonpost.com', 12), (u'www.wired.com', 11), (u'www.cnn.com', 9), (u'thinkprogress.org', 8), (u'www.guardian.co.uk', 8), (u'www.salon.com', 7), (u'blog.wired.com', 6), (u'www.chinapost.com.tw', 6), (u'www.dailymail.co.uk', 6), (u'www.myfoxdfw.com', 6), (u'www.opednews.com', 6), (u'www.reuters.com', 6), (u'www.telegraph.co.uk', 6), (u'www.timesonline.co.uk', 6), (u'apnews.myway.com', 5), (u'en.wikipedia.org', 5)]
Sites with only one entry 365
maximum points are 895
average points are 55.138
average title length 10.308
largest title has length 51 and is [Quote] A tyrant must put on the appearance of uncommon devotion to religion. Subjects are less apprehensive of illegal treatment from a ruler whom they consider god-fearing and pious. On the other hand, they do less easily move against him, believing that he has the gods on his side - Aristotle
50 most common words are [(u'the', 306), (u'of', 232), (u'to', 221), (u'a', 159), (u'in', 157), (u'and', 125), (u'The', 105), (u'for', 94), (u'-', 90), (u'on', 79), (u'is', 68), (u'with', 48), (u'by', 41), (u'that', 39), (u'A', 37), (u'Iraq', 37), (u'from', 34), (u'Bush', 33), (u'are', 31), (u'New', 30), (u'at', 29), (u'as', 28), (u'have', 26), (u'you', 26), (u'How', 25), (u'your', 25), (u'Of', 24), (u'US', 24), (u'about', 23), (u'In', 22), (u'not', 22), (u'For', 21), (u'I', 20), (u'To', 19), (u'be', 19), (u'this', 19), (u'Vietnam', 18), (u'an', 18), (u'they', 18), (u'American', 17), (u'no', 17), (u'U.S.', 16), (u'was', 16), (u'their', 15), (u'will', 15), (u'Is', 14), (u'What', 14), (u'Why', 14), (u'You', 14), (u'has', 14)]
50 most common words, ignoring case are [('the', 414), ('of', 257), ('to', 241), ('a', 196), ('in', 179), ('and', 132), ('for', 117), ('on', 94), ('-', 90), ('is', 83), ('with', 62), ('that', 46), ('by', 44), ('are', 40), ('new', 40), ('you', 40), ('not', 39), ('iraq', 37), ('your', 37), ('from', 35), ('at', 34), ('bush', 33), ('how', 33), ('as', 30), ('us', 30), ('about', 28), ('an', 28), ('have', 28), ('be', 25), ('do', 25), ('they', 25), ('no', 24), ('this', 24), ('war', 23), ('will', 23), ('it', 21), ('i', 20), ('my', 20), ('out', 20), ('what', 20), ('police', 19), ('has', 18), ('vietnam', 18), ('we', 18), ('why', 18), ('american', 17), ('if', 17), ('says', 17), ('their', 17), ('was', 17)]


****fun with reddit urls(base_url = http://programming.reddit.com/top?offset=)****
total sites are 1000
total unique sites 675
top 20 sites are [(u'www.codinghorror.com', 31), (u'www.joelonsoftware.com', 22), (u'groups.google.com', 17), (u'xkcd.com', 16), (u'thedailywtf.com', 12), (u'programming.reddit.com', 10), (u'worsethanfailure.com', 10), (u'paulgraham.com', 9), (u'blogs.msdn.com', 8), (u'blogs.sun.com', 8), (u'www.defmacro.org', 8), (u'arstechnica.com', 7), (u'en.wikipedia.org', 7), (u'kerneltrap.org', 7), (u'steve-yegge.blogspot.com', 7), (u'weblog.raganwald.com', 7), (u'codist.biit.com', 6), (u'scienceblogs.com', 6), (u'www.paulgraham.com', 6), (u'diveintomark.org', 5)]
Sites with only one entry 559
maximum points are 1609 by http://upload.wikimedia.org/wikipedia/commons/1/17/Metric_system.png
average points are 221.961
average title length 8.385
largest title has length 47 and is "The "you don't own your computer" paradigm is not merely wrong. It is violently, disastrously wrong, and the consequences of this error are likely to be felt for generations to come, unless steps are taken to prevent it." On the need for a Hippocratic Oath for programmers.
50 most common words are [(u'the', 186), (u'to', 168), (u'of', 159), (u'a', 148), (u'The', 137), (u'in', 103), (u'and', 89), (u'-', 79), (u'for', 77), (u'on', 71), (u'is', 66), (u'Why', 54), (u'you', 54), (u'I', 46), (u'How', 45), (u'A', 38), (u'Programming', 38), (u'with', 36), (u'your', 33), (u'Google', 32), (u'What', 30), (u'by', 30), (u'Lisp', 29), (u'about', 26), (u'from', 26), (u'Software', 25), (u'it', 25), (u'not', 25), (u'an', 24), (u'are', 24), (u'code', 22), (u'that', 22), (u'Python', 21), (u'do', 21), (u'Linux', 20), (u'be', 20), (u'programming', 20), (u'software', 20), (u'Web', 18), (u'To', 17), (u'at', 17), (u'this', 17), (u'Is', 16), (u'all', 16), (u'as', 16), (u'how', 16), (u'why', 15), (u'--', 14), (u'Microsoft', 14), (u'Ruby', 14)]
50 most common words, ignoring case are [('the', 323), ('a', 186), ('to', 186), ('of', 164), ('in', 110), ('and', 98), ('for', 83), ('is', 82), ('on', 80), ('-', 79), ('you', 71), ('why', 69), ('how', 61), ('programming', 58), ('i', 46), ('software', 45), ('your', 41), ('with', 40), ('what', 39), ('not', 35), ('code', 34), ('it', 34), ('lisp', 34), ('an', 33), ('about', 32), ('by', 32), ('google', 32), ('are', 30), ('from', 30), ('do', 29), ('web', 29), ('all', 25), ('be', 25), ('computer', 25), ('my', 25), ('this', 25), ('that', 24), ('one', 22), ('language', 21), ('linux', 21), ('python', 21), ('can', 20), ('at', 19), ('new', 19), ('things', 18), ('when', 18), ('as', 17), ('it's', 17), ('like', 17), ('programmers', 17)]

www.codinghorror.com, www.joelonsoftware.com, groups.google.com, xkcd.com, thedailywtf.com

Friday, January 19, 2007

The war for your search bar

(Welcome reddit users)
If you are anything like me, you probably have Google toolbar installed in your primary browser. It does many things, but foremost, it allows you to search Google without going to google.com.
But there is a search bar built right within your browser. It sits right next to the address bar.


Now you might think that no one would care about such a puny, tiny- winy search bar. And sir, can you be more wrong?

It all started when I wanted to install picasa, this is what I get in the last step of installation.

Now picasa is an image management software. Why should it try to reset my search preferences? Oh and by the way, the default option is to switch the default search engine, not to
retain your preferences.












Bad, bad Google. Stealing my search bar! Yahoo would not do anything like that. Let's install Yahoo search bar.














Aw! Not so fast yahoo baby. Cap'n Google won't let you change the default option.








Well then lets try MSN toolbar.















So does opening Gmail change search preferences too? Look like it does not. Thank god for small mercies.




Lets see what happpens if I manually change the search settings.















Do they do this with firefox too?


















Looks like they do.














When you install a toolbar aren't you already reserving a part of your screen real estate to that search engine. And then shouldn't the toolbar offer to leave your search bar, instead of trying to capture it?

Thursday, January 18, 2007

ACAPTCHA - Almost Completely Automated Public Turing test to tell Computers and Humans Apart

(Welcome reddit users)

Captcha generally (but not always)solve the problem of comment and other spam. But this comes at a price. Users with low visibility and other disablities find solving captcha hard. And blind users cant solve it unless you provide an alternative audio captcha. Why, even Seth hates it!
Negative captcha - where you hide form fields via CSS so user can't see it and hence not fill it, while bots will, is an interesting possibility. But let me itroduce ACAPTCHA - "Almost Completely Automated Public Turing test to tell Computers and Humans Apart" to you. This is waht you do.

1. There are some questions which are very easy for humans to answer but very difficult for bots to understand. Take "What color is a blue towel?" or "Is a green towel red?". Any (well most) humans can answer that qwestion in a snap, but probably not bot can.
2. Create a centralized AND rapidly changing repository of such questions. May be allow users to submit new questions and answers there. May be peer review questions before accepting them, whatever you do get a large and fast changing repositary.
3. Create a plugin/architecture where you get a random question for the repositary (ala Akismet which is a distributed anti spam engine) and ask users to solve it.
There are already some sites which try to do something similar. They ask question where they ask something like "What is 2 + 2". The problem is, it is probably very easy to break this. As soon as this becomes mainstream, you can be sure that the bots will break trough and abuse. To beat completely automated systems, you need to bring in human intelligence.

Updates -
Foo asked: ". The repo would have to include the *answers* and be as easily downloadable, right? Right. So Mr. Spammer wins again."
And I say: Well no the idea is that the central repository has say a million questions and answers. And whenever any site wants to check using a ACaptcha, they ask for a question-answer pair(Using an API). Now no one excepting the repository has all the questions and each time the spammers get a new question. This is why you need the repository to get new questions quickly, so that spammers can not build up a bank of questions over time and know there answers.