Archive for Januar, 2009

The Bots are Back!

Samstag, Januar 10th, 2009

That was much quicker than I expected, mostly due to my very helpful friend Michael. Without getting all technically geeky on you, Michael showed me a very neat solution to pull out the bot data in an efficient way. The efficiency means it’s now 100x faster than it used and it uses less memory.

End result? The bot analytics are back and the Clig hardware lives to die some other day :) Thanks Michael!

Temporarily Disabling Bot Analysis

Freitag, Januar 9th, 2009

Following yesterday’s post about Cligs slowness, it’s clear that trimming the latest clig hits count from 100 to 10 helped a lot. The service is still occasionally acting up, so now I’ve disabled the bot analytics.

This is temporary! The bot analytics are being re-built as part of the changes I described yesterday to make them less draining on the server. I’m hoping everything will be done by early next week.

Speaking of which, the backend changes are half done so I’m slightly ahead of the schedule I’ve put for it - yeah, that happens sometimes! Fingers crossed the second half works out well too!

Service Updates Coming Soon

Donnerstag, Januar 8th, 2009

Cligs has been a bit slower than usual which means that we’ve found the next growth bottleneck that needs fixing. I’d like to explain to you what’s going on and tell you how it’s going to be fixed.

The Problem

The biggest problem is the My Cligs page, specifically, the "Cligs Getting Last 100 Hits" feature. I had reports from someone whose My Cligs page takes a full 45 seconds to load, and that’s not that unusual. All that time, the Cligs server is building the page, which slows down the service. At peak hours when everyone is checking their stats, Cligs can get really slow.

The second biggest problem is the introduction of the new bot detection analytics. This feature makes Cligs much slower than it should be, and I was not able to foresee that before launch. Now that it’s launched and everyone is commenting on how cool it is (i.e. using it!), Cligs is starting to strain.

The Fixes

Firstly, as a temporary measure, I’ve tweaked the "Cligs Getting Last 100 Hits" to make it list only the last 10 cligs that got hits. This makes the page loading much faster and doesn’t strain the server.

In the next few days, I’ll be tweaking the database and changing the way it stores data. This has two side effects:

  • It will make the analytics run much faster.
  • It will enable me to deploy new types of analytics and build new features that I’ve always wanted to add. One of those features will be unique to Cligs just like the Right Clig geotargeting.

When? The backend is being changed tonight and tomorrow, and I’m hoping to make the switch over the weekend quiet period just in case things go into digital lala land. The new features will be built once the changes are confirmed as working.

Analysis of Linking Patterns on Twitter: Cligs scores well!

Montag, Januar 5th, 2009

Summary

This post documents an analysis of 10.2 million tweets containing 2 million links. Those links were analyzed to understand which domains are most used. The analysis shows that:

  • tinyurl.com is the most used domain name;
  • cli.gs is the 18th most used domain name;
  • Of the top 50 domains used in links, 18 are URL shortening services;
  • Of the URL shortening services, Cligs is 10th most used.

As the owner of a URL shortening service here on Cligs, I’m very interested in the Twitter market, hence this analysis. I’m sharing the results because they are very interesting in their own right.

Note: A list of references is found at the end of this post.

Introduction

On December 22, a massive Twitter data scrape was released, with the blessing of Twitter, no less. How massive? Let’s look at some of the numbers:

  • 10.2 million tweets…
  • …from 8 million users…
  • …containing 2 million links…
  • …and 219 thousand hashtags.

And a lot more.

Of particular interest are the links data, handily extracted into a separate file. The file contained data for 2071291 links. These approximately 2.1 million links were analyzed to extract the domain name of the URL of the link. For example, for a link URL for http://cli.gs/abc123, the domain name is cli.gs.

The analysis counted the occurances of the domain names in the 2.1 million links. The domain names were then sorted by their occurance count in descending order. The top 50 domains were analyzed in detail.

Top 50 Domains Used on Twitter

First the data, which you can download as a PDF file at the end:

Domain Count URL Shortner? Notes
tinyurl.com 1048240 Yes
is.gd 107093 Yes
twitpic.com 88871 No Application to posts photos to Twitter.
bit.ly 67515 Yes
ff.im 40260 Yes Automatic shortner for Friendfeed. Counted as a URL shortner because it doesn’t use the friendfeed.com domain name.
twurl.nl 37575 Yes
blip.fm 25658 No Music app that posts to Twitter what you’re listening to.
bkite.com 24276 No Location based social network.
snipurl.com 23562 Yes
ping.fm 17780 No App to update social media sites.
snurl.com 12316 Yes
tr.im 12154 Yes
snipr.com 11933 Yes
loopt.us 8649 No Mobile social compass thing.
budurl.com 7076 Yes
www.flickr.com 6248 No Image sharing site.
twitter.com 5963 No Ummm…
cli.gs 4840 Yes Woohoo!
www.nicovideo.jp 3716 No Looks like a Japanese video site
be-a-magpie.com 3609 No Ad network for Twitter
movapic.com 3574 No Looks like a Japanese image site
tiny.cc 3296 Yes
hellotxt.com 3281 No App to update social media sites.
aweber.com 3103 No Marketing tools
raptr.com 3054 No Social platform for people who like to play and discover games
tgr.me 2877 Yes Automatic shortner for Twitter Groups.
zi.ma 2877 Yes
flickr.com 2779 No Image sharing site.
ad.vu 2684 Yes Adjix alternative.
twittgroups.com 2605 No Groups for Twitter
mrtweet.net 2549 No Social graph analysis.
EzineArticles.com 2522 No Article directory
qik.com 2506 No Mobille video sharing
www.myspace.com 2479 No Social network
www.last.fm 2475 No Music app that posts to Twitter what you’re listening to.
activerain.com 2266 No Real estate network
adjix.com 2243 Yes
www.desktoptopia.com 2227 No Desktop background manager
f.hatena.ne.jp 2113 No Looks like a Japanese image site
poprl.com 2087 Yes
www.squidoo.com 2060 No Creates topic-specific pages
piurl.com 1972 Yes
ow.ly 1961 Yes By Brightkit
www.ustream.tv 1864 No Live streaming video
zz.gd 1710 Yes
www.blogtv.com 1697 No Video sharing site
www.youtube.com 1662 No Video sharing site
xrl.us 1621 Yes
vimeo.com 1599 No Video sharing site
d.hatena.ne.jp 1589 No Looks like a Japanese image site

Analysis of Top URL Shortners Used on Twitter

The top 18 URL shortners accounted for 1395892 links, or 67.4% of them. Some URL shortening services have multiple alternative domains their users can opt to use. If we group these sister domains and look at the URL shortners, the data looks as follows:

Grouped Counts As % of Shortners
tinyurl.com 1048240 75.09%
is.gd 107093 7.67%
bit.ly 67515 4.84%
snipurl.com, snurl.com, snipr.com 47811 3.43%
ff.im 40260 2.88%
twurl.nl 37575 2.69%
tr.im 12154 0.87%
budurl.com 7076 0.51%
ad.vu, adjix.com 4927 0.35%
cli.gs 4840 0.35%
tiny.cc 3296 0.24%
tgr.me 2877 0.21%
zi.ma 2877 0.21%
poprl.com 2087 0.15%
piurl.com 1972 0.14%
ow.ly 1961 0.14%
zz.gd 1710 0.12%
xrl.us 1621 0.12%

Or in a more graphical form:

URL shortners of Twitter

Thoughts & Conclusions

  • We have an absolute classic of a long-tail data set: The top 50 domains accounted for 1628666 links (78.6%). This is a very narrow head and a very long tail of domain usage.
  • The tinyurl.com domain dominates, accounting for 75% of all URL shortening on Twitter. Thoughts about how much valuable traffic Twitter is simply throwing away by not owning on a postcard please.
  • Cligs is the 10th most used on Twitter which is not bad for a one-man show in 3 months!
  • Look at ff.im. That’s Friendfeed’s posting to Twitter mechanism. That’s a lot of posting going on there!
  • Twitpic is much more popular than I thought it would be. It’s the top non-shortening service used in links, and number 3 overall in the link rankings. I already had great respect for the Twitpic guy (yes, it’s also a one-man show) and now we have numbers to show just how great the service is.
  • Sanity check: I watch Twitter with a very keen eye, and my gut feeling is that the usage data above actually feels right. Nothing in it is very surprising and a lot of the rankings can be explained fairly easily. It’s important in any data analysis to do this kind of sanity check and I urge you to look at the numbers again and see if they make sense to you too.

References

New Feature: Detecting Non-Human Hits

Freitag, Januar 2nd, 2009

As of just now, Clig Details pages got a new feature: analytics showing you bot traffic.

What’s a bot? A bot/robot/crawler/spider is software that follows links. There are good bots, like GoogleBot which indexes the web to build Google’s search index, and there are bad bots that scrape content and do other nasties.

Since launching Cligs, the analytics showed you all traffic clumped together without breaking it down into humans clicking through versus bots automatically requesting the clig. Today’s update is a basic start to building more on this kind of breakdown and this type of analytics.

So what does today’s update do? Go to any Clig Details page (the bar graph icon) and under the Total Hits section you’ll see the breakdown; for a recent clig, this is what I see:

Screenshot of the Cligs Bots Analytics

As you can see, it’s a very simple breakdown: total number of hits (as previously was shown) with two items underneath it showing the number of hits that are bots and those that are humans.

So how does Cligs detects bots? Well some of them are very easy to detect and some not. Given the millions of clig requests Cligs has seen since launching, finding bots is easier than usual: there is a lot more data to mine for interesting traffic patterns.

The traffic analysis I did indentified a lot of IP addresses that exhibit bot-like behavior. I manually checked the top 100 IP addresses (yes, manually) and confirmed they are bots. These IP addresses were then added to a special list that the Cligs analytics check to produce the breakdown.

This technique has one very important side-effect: not all bots will be detected so the number of bots Cligs gives you is the minimum guaranteed number of hits that are bots. It may be that the rest of the traffic contains bot traffic too that’s not detected. No technique can guarantee 100% detection of human vs bot traffic but we can have a honest crack at it.

With time, more IP addresses will be added to detect more bots. Also, Cligs will soon be able to show you which bots have generated traffic on your cligs. Eventually, I’d like to see how to best merge the Latest Search Engine Bot Sightings section with the detailed bot analytics as search engine bots are one type of bots. Ideas and thoughts welcome :)