Sunday, September 02, 2007

Parasite hosting - Or why social networking sites need to review user generated content.

To directly see the list of highly ranked spammy users pages within reddit
From wikipedia: Parasite hosting is the process of hosting a site on some one else's server without their consent, generally for the purpose of search engine benefit.

One of the most competitive keywords which everyone wants to rank for is buy viagra online

On that and many other similar searches, you would start noticing that user pages of social bookmarking sites like reddit, get very prominent rankings.
These sites have a very high TrustRank and amazing authority scores. So even off topic pages are very easily ranked high for pages created on these domains.
I was playing with the sitexplorer python library I wrote. The user pages on the reddit domain outrank even the subreddit pages.
Using a little python I found that of the top 400 pages on the reddit.com domain, 102 are user pages of the form reddit.com/user/{username}/. All of these are pages promoting prescription pills like Viagra or Cialis.
(The python program [1]
List of the user pages in the top 400 pages on reddit)

How can the social bookmarking sites combat this?
1. Use a robots.txt. For reddit.com this can be as simple as using

User-Agent: *
Disallow: /user

2. Have kill words which are not allowed in the user names. This allows the pages to be indexed with facing parasite hosting.



import re
all_pages = []
for i in range(4):
start = i*100 + 1
results = get_page_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
pages = [el['Url'] for el in results]
all_pages.extend(pages)

pat = '/user/([a-zA-Z0-9_]*)'
rep = re.compile(pat)
users = [rep.search(el) for el in all_pages]
users_ = [el.groups()[0] for el in users if el is not None]



List of the users on the reddit.com site. (These links are nofollowed).
[u'Buy_viagra_online_', u'BUY_VIAGRA_MEDS', u'CHEAP_VIAGRA_PRICE', u'ORDER_VIAGRA_NOW', u'tylerton', u'VIAGRA_ONLINE_CHEAP', u'Buy_viagra_online', u'DISCOUNT_VIAGRA_NOW', u'order_viagra_cheap', u'order_viagra_online', u'order_cialis_cheap', u'Viagra_', u'viagraagain', u'cialisagain', u'BUY_VIAGRA_ONLINEE', u'tramadolagain', u'phentermineagain', u'levitraagain', u'BUY_VIAGRA_ONLINE3', u'BUY_FLAGYL_ONLINE', u'CIALIS_LOWEST_PRICES', u'BUY_HOODIA_ONLINE', u'ORDER_VIAGRA_TODAY', u'Ephedra_Pills', u'Viagra_online', u'BUY_VIAGRA_ONLINE2', u'dans_movies', u'panda_movies', u'cialispills', u'BUY_VIAGRA1', u'Viagrapills_Online', u'phenterminepills', u'tramadolpills', u'free_porn_movies', u'phenterminepharm', u'Soma_Carisoprodol', u'cialispharm', u'viagraonline', u'Cheapest_Fioricet', u'Meridia_Diet_Pills', u'viagrapills', u'valiumpills', u'levitrapharm', u'cialis_buy', u'Buy_Percocet', u'viagrapharm', u'BUY_VIAGRA_MD', u'Tramadol_Hcl', u'Generic_Propecia', u'xanaxpill', u'BUY_LEVITRA_ONLINE1', u'tramadolpill', u'cialis_cheap', u'tramadolpharm', u'CIALIS_BEST_PRICES', u'ordertramadol', u'phenterminepill', u'order_levitra_med', u'orderxanax', u'cialis_online_drug', u'orderphentermine', u'generic_cialis_pill', u'ordercialis', u'CIALIS_ONLINE', u'BUY_VIAGRA_ONLINE1', u'online_buy_cialis', u'CHEAP_VIAGRA_PRICES', u'generic_levitra_pill', u'orderviagra', u'LEVITRA_SALE', u'CHEAP_VIAGRA_ONLINE', u'Order_Cialis_Online0', u'BUY_DISCOUNT_VIAGRA', u'order_cialis_online', u'BUY_VIAGRA_TODAY', u'VIAGRA_LOWEST_PRICE', u'CHEAP_VIAGRA_PILL', u'viagracialis', u'Suboxone', u'VIAGRA_BEST_PRICES', u'FDA_CIALIS_ONLINE', u'DISCOUNT_VIAGRA_A', u'cialistop', u'Buy_viagra_', u'phenermine', u'FDA_LEVITRA', u'phenterminetop', u'insura', u'viagrapharmacy', u'cialispharmacy', u'viagratop', u'levitrapharmacy', u'phenterminepharmacy', u'autoverzekering', u'discount_viagra', u'goba', u'levitratop', u'generic_cialis']
>>>

Getting started on SEO programming (using Python)

The python code is here.
Usage instructions are here

You own a website and want to keep track of its placement in search engines. You want to know who is linking to you, how many of your pages are indexed in the search engines. You want to tell the SE when you update you sitemaps or when you update your website.

The SiteExplorerApi from yahoo makes this extremely convenient. And with Google discontinuing the Soap Search API, this is the only feasible choice.
The site explorer api is a Rest service. You construct a URL, and make a request, from your browser, from your command line, or any place else. You need to parse the server's response to get the data in the format of your choice.
We would write a thin Python wrapper over this rest service so that we can construct our queries in python.

(to follow these examples, you need this python code, and simplejson library)

Some simple examples.
1. We want to get the top 1000 sites which link to reddit

all_urls = []
for i in range(10):
start = i*100 + 1
results = get_inlink_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
urls = [el['Url'] for el in results]
all_urls.extend(urls)


2. We want the 400 highest rated pages on reddit.

all_pages = []
for i in range(4):
start = i*100 + 1
results = get_page_data('YahooDemo', u'http://reddit.com', start = start, results = 100)
pages = [el['Url'] for el in results]
all_pages.extend(pages)


3. Google.com has updated its sitemap. We want to let Yahoo know of it.

do_ping(u'http://www.google.com/sitemap.xml')


4. I have updated SeoDummy. Lets tell yahoo of that.

do_update_notification('YahooDemo', 'http://www.seodummy.blogspot.com/')


5. You can use these methods in conjunction to get some advanced functionality. For example, you can use get_inlink_data and get_page_data together to get a breakup of who links to each of your subpages.
For examples of some cool SEO tools, you can go here.


You would need to get simplejson to use this library. We get the response from yahoo in Json, and simplejson is needed to parse that.
There are four methods corresponding to the 4 yahoo api calls. The arguments for each method are exactly same as required arguments for the REst api, excepting
output
and
callback
, which are never used.

get_inlink_data(inLinkData)
get_page_data(pageData)
do_ping(ping)
do_update_notification(update_notification)