Thursday, January 31, 2008

PacificA: Replication in Log-Based Distributed Storage Systems

Large-scale distributed storage systems have gained popularity for storing and processing ever increasing amount of data. Replication mechanisms are often key to achieving high availability and high throughput in such systems. Research on fundamental problems such as consensus has laid out a solid foundation for replication protocols. Yet, both the architectural design and engineering issues of practical replication mechanisms remain an art. This paper describes our experience in designing and i...

D3S: Debugging Deployed Distributed Systems

Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D3S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D3S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. ...

TripIt is awesome


You know what I really like? TripIt.com. It's amazingly simple. You take all those travel confirmation emails that you get from your travel agent, hotels, car rental agencies, etc, and you just forward them to plans@tripit.com. That's all you have to do. You don't have to sign up for an account. You don't have to log on. You just forward those emails. You can do it right now.


You get a link back by email, with a beautifully organized itinerary, showing all your travel data plus maps, weather reports, and all the confirmation numbers for your flights and address for your hotels and so on.


It's kind of magical. You don't have to fill out lots of little fields with all the details, because they've done a lot of work to parse those confirmation emails correctly... it worked flawlessly for my upcoming trip to Japan.


Think of it this way. Suppose you want to enter a round trip flight on your calendar. The minimum information you need to enter is probably:



  1. the airline

  2. the flight number

  3. four times (departure and arrival, there and back)

  4. four time zones (or else your phone will tell you that your flight is at 5 pm when it's really at 2pm)

  5. a confirmation number (for when the airline denies that you exist)

  6. where you're going

All in all it takes a few minutes and is very error prone. Whereas, with TripIt, you just take that email from the airline or Orbitz, Ctrl+F, type plans@tripit.com, and send. Done.


TripIt is a beautiful example of the Figure It Out school of user interface design. Why should you need to register? TripIt figures out who you are based on your email address. Why should you parse the schedule data? Everyone gets email from the same 4 online travel agencies, 100-odd airlines, 15 hotel chains, 5 car rental chains... it's pretty easy to just write screen scrapers for each of those to parse out the necessary data.


Anyway, it's a shame I have to say this, but I have no connection whatsoever to tripit.com.


Not loving your job? Visit the Joel on Software Job Board: Great software jobs, great people.


Monday, January 28, 2008

List Of Crazy Laws By State

Pick your state and see what crazy laws it has. My favorite form my home state of Colorado: "In Denver, Colorado it is illegal for Barber's to give massages to nude customers unless it is for instructional purposes." I guess just make sure you are giving instruction and everything will be A-O.K.

Sunday, January 27, 2008

Bug

The universe started in 1970.  Anyone claiming to be over 38 is lying about their age.

Friday, January 25, 2008

Multiple String Pattern Matching


The Mailinator guys blogged about how they were using a modified Aho-Corasick style multiple string pattern matching algorithm to index 185 emails/s.


Aho-Corasick takes all the search strings and builds them into a Trie so that it can scan the whole document for N strings in once pass.


The problem is that Aho-Corasick doesn’t support more advanced constructs such as string jumping made possible by Boyer-Moore.


Boyer-Moore says that it would be more intelligent to search from the end of a word. This way when you find a mismatch it can figure out a ‘jump’ and skip characters! This has the counter intuitive property of being able to index the document FASTER when the word your searching for gets longer.


Crazy huh.


Now if only you could support Boyer-Moore jumping with Aho-Corasick style single pass indexing.


Turns out that’s already done.


Wu-Manber is another algorithm which combines the advantages of both Aho-Corasick and can index the document once as well as skip over words in a jump table.


Spinn3r uses Aho-Corasick for one of our HTML indexing component. It turns out that Wu-Manber won’t give us much of a speed boost because HTML comments begin and end with only three characters. It’s a 3x performance gain to migrate to Wu-Manber but our Aho-Corasick code is proven and in production.


We’re also no longer CPU bound so this isn’t on my agenda to fix anytime soon.




Facebook releases JavaScript Client Library

Wei Zhu seems to be cooking with gas recently, and has released the JavaScript Client Library for Facebook API, which is a client side JavaScript library that mimics the other language client libraries (PHP, Python, Java, Ruby, etc):



An application that uses this client library should be registered as an iframe type. This applies to either iframe Facebook apps that users access through the Facebook web site or apps that users access directly on the app’s own web sites.


The solution uses a cross domain receiver:



HTML:





  1.  



  2. <html xmlns="http://www.w3.org/1999/xhtml">






  3.     <title>cross domain receiver page</title>



  4. </head>



  5. <body style="background-color:Green;">



  6.     <script src="http://static.ak.facebook.com/js/api_lib/XdCommReceiver.debug.js" type="text/javascript"></script>



  7.     <script type="text/javascript">



  8.         FB_ReceiverApp$main();



  9.     </script>



  10. </body>



  11. </html>



  12.  






Then, with a few script src's you can talk to Facebook:



JAVASCRIPT:





  1.  



  2. // Create an ApiClient object, passing app’s api key and



  3. // a site relative url to xd_receiver.htm



  4. var api = new FB.ApiClient('<insert_your_app_key_here', '/xd_receiver.htm', null);



  5.        



  6. // require user to login



  7. api.requireLogin(function(exception) {



  8.     window.alert(“Current user id is “ + api.get_session().uid);



  9.  



  10.     // Get friends list 



  11.     api.friends_get(function(result, exception) {



  12.       Debug.dump(result, 'friendsResult from non-batch execution ')



  13.     });       



  14. });



  15.  






It is good to see a JavaScript API like this. Now you can stay in JavaScript land and write code that works with OpenSocial, Facebook, and more. NOTE: If you live in FBML? No cigar.





Tuesday, January 22, 2008

South India accounts for 40% of airline ticket sales on Internet

The Indian online travel market seems to be driven by the South, especially Bangalore. Around 40 per cent of airline ticket sales on the internet come from the four main South markets of Bangalore, Chennai, Hyderabad and Kochi, reports Times of India.

(more…)





Monday, January 21, 2008

Data Portability, Authentication, and Authorization

The social web is booming, signing up new users and generating new pieces of unique content at a steady clip. A recurring theme of the social web is "data portability," the ability to change providers without leaving behind accumulated contacts and content. Most nodes of the social web agree data portability is a good thing, but the exact process of authentication, authorization, and transport of a given user and his or her data is still up in the air. In this post I will take a deeper look at the current best practices of the social Web from the point of view of its major data hubs. We will take a detailed look at the right and wrong ways to request user data from social hubs large and small, and outline some action items for developers and business people interested in data portability and interoperability done right.




General issues



Friends, photographs, and other objects of meaning are essential parts of the social web. We're much more inclined to physically move from one city to the next if our friends, furniture, and clothes come along with us. The interconnectedness of the digitized social web makes the moving process much simpler: we can lift friends from one location into another, clone your digital photographs, and match your blog or diary entries to the structure of your new social home. Each of these digital movers represent what we generally call "social network portability" or, more generically, "data portability."



Social networks accelerate interactions and your general sense of happiness in your new home through automated pieces of software designed to help you move data, or simply mine its content, from some of the most popular sites and services on the Web. These access paths are roughly equivalent to a new physical location setting up easy transit routes between some of the largest cities to help fuel new growth.



Facebook Friend Finder e-mail import


Your e-mail inbox is currently the most popular way to construct social context in an entirely new location. Site such as Facebook request your login credentials for a large online hub such as Google, Yahoo!, or Microsoft to impersonate you on each network and read all data which may be relevant to the social network such as a list of e-mail correspondents. Every day social network users hand over working user names and passwords for other websites and hope the new service does the right thing with such sensitive information. Trusted brands don't like external sites collecting sensitive login information from their users and want to prevent a repeat of the phishing scams faced by PayPal and others. There is a better way to request sensitive data on behalf of a user, limited to a specific task, and with established forms of trust and identity.





  1. Use the front door

  2. Identify yourself

  3. State your intentions

  4. Provide secure transport




Use the front door



Google, Yahoo!, and Microsoft all support web-based authentication by third parties requesting data on behalf of an active user. The Google Authentication Proxy interface (AuthSub), Yahoo! Browser-Based Authentication, and Microsoft's Windows Live ID Web Authentication issue a security token to third-party requesters once a user has approved data access. This token can allow one-time or repeated access and is the preferred method of interaction for today's large data hubs. The OAuth project is a similar concept to web-based third-party authentication systems of the large Internet portals, and may be a common form of third-party access in the future.



Google Accounts Access example


Supporting websites provide limited account access to a registered entity after receiving authorization from a specific user. The user can typically view a list of previously authorized third parties and revoke access at any time. The third-party retains access to a particular account even after the user changes his or her password.



Imagine if you could give your local grocery store access to just your kitchen, but not hand over the keys to your entire house. A delivery person would be automatically scanned upon arrival, compared against a registry, and granted access to the kitchen if yo previously assigned them access. You could revoke their access to your kitchen at any time, but they never have access to your jewelry box or other non-essential functions within your house.




Identify yourself



Third-party applications requesting access should first register with the target service for accurate identification and tracking. Applications receive an identification key for future communications connected to a base set of permissions required to accomplish your task (e.g. read only or read/write). A registered application can complete a few extra steps for added user trust and less user-facing warning messages.




State your intentions



Your application or web service should focus on a specific task such as retrieving a list of contacts from an online address book. Your authentication requests should specify this scope and required permissions (e.g. read only) when you request a user's permission to access his or her data.



Google services with Gmail highlighted


An application declaring scope lets users know you are only interested in a single scan of their e-mail and you will not have access to their credit card preferences, stored home address, or the ability to send e-mails from their account. Not requesting full account access in the form of a username and a password creates better trust from the user and the user's existing service(s).




Provide secure transport



Armored Truck How will you transport my user's data back to your servers? Did you bring an armored car with your company's logo prominently displayed on the side or will my data sit in the back of your borrowed pick-up truck? Requesting applications should transport user data over secure communications channels to prevent eavesdropping and forged messages. Registered and verified secured communications will result in less user-facing warning messages of mistrust, and secure certificates are relatively inexpensive. Large portals such as Google or Microsoft will bump your communications (and privileges) to mutual authentication if you are capable.



Twitter SSL certificate Firefox view


Register an SSL/TLS certificate for your website to enable secure transport and further identify yourself. Certificates vary in cost and complexity from a free self-signed cert to paid certificates from a major provider with extended validation and server-gated cryptography. Google and Yahoo! use 256-bit keys. Windows Live and Facebook use 128-bit keys.




Summary



Data authorization is the first step in data portability. Emerging standards such as OAuth combined with established access methods from Internet giants provide specialized access for third-parties acting on behalf of another user. Sites interested in importing data from other services should take note of these best practices and prepare their services for intelligent interchange.


How widgets go viral

Enterprise RSS has some observations on features that make widgets go viral, largely inspired by observations on Facebook apps such as SuperPoke.



The parallel on the widget front is an "interaction loop", these are the hooks built in widgets that encourage interaction - by responding to content or sharing with others. The main difference between 'interaction' and 'viral' loops- not all interactions are viral, some interactions simply benefit the user through personalization or community interaction. Widgets differ slightly from social applications in that the end goal isn't always to get the user to send to a friend, widgets are typically used to provide a service to the end user, such as presenting personalized information or content.


Using a NewsGator widget as an example, he identifies a number of "interaction loops", e.g.:



  • Email link - tell a friend

  • Ratings - the canonical 5-star rating

  • Get This link

  • Create your own widget link







Thursday, January 17, 2008

Upgrade your Google Analytics tracker

Google Analytics logo


Google released a new version of its Google Analytics tracking code in December after a two-month limited beta. The new Google Analytics tracker is a complete rewrite of JavaScript inherited from the Urchin acquisition in 2005 and the first time the two products have been officially decoupled. The existing version of Google Analytics tracker, urchin.js, has been deprecated but should continue to function until the end of 2008. Google will only roll out new features on the new ga.js tracker. If you currently track website statistics using Google Analytics you should upgrade your templates to take advantage of the new libraries.



What changed?



The new Google Analytics tracker supports proper JavaScript namespacing and more intuitive configuration methods (e.g. _setDomainName instead of _udn). My tests show about a 100 ms faster execution even with a 24% increase (1514 bytes) in file size (ga.js is also minified).



The new tracking code makes advanced features a lot more accessible. You can now track a page on multiple Google Analytics accounts, which should help user generated content sites integrate their author's Google Analytics IDs alongside the company's own tracking account. The new event tracker lets you group a set of on-page related actions such as clicking a drop-down menu or typing a search query (very useful for widgets). Ecommerce tracking is now a lot more readable. You can read about all the tracker changes in the Google Analytics migration guide PDF.



Implementation



Switching your site tracker is pretty simple. Trackers are now created as objects and configured before the page is tracked.



<script type="text/javascript" src="http://www.google-analytics.com/ga.js"></script>
<script type="text/javascript">
var pageTracker=_gat._getTracker('
UA-XXXXXX-X');
pageTracker._initData();
pageTracker._trackPageview();
</script>



That's it. You are now running the new Google Analytics tracker. You'll need to swap in your Analytics account and profile IDs, which should be pretty easy to spot in your existing code.



Summary



Google Analytics tracking code is completely rewritten for faster on-page behavior that plays well with others. The old tracker will be deprecated within a year, and new features are only available to users running the new code. Existing Google Analytics users should swap out their tracking code to take full advantage of this free stats tool.


The Best Companion Tools for YouTube and other Web Videos


youtube video tools


Sites like YouTube, MySpace and Google Video host millions of video clips that you can either watch online or embed them in your web pages. And then there are “unofficial” tools to help you download YouTube movies to the hard drive.


You know them all so let’s look at a different set of YouTube tools that are incredibly useful and yet very simple.



youtube-deleted-video » Delutube.com - This service like Google Cache for YouTube.


If the owner has removed the video from YouTube servers or the YouTube staff have deleted the video on their own for violation of policies, DelUTube may help you watch the video as it could be still residing on one of the YouTube servers.


youtube video scenes » Scenemaker.net - The full 90 minute keynote video of Steve Jobs is on YouTube but you only want to embed that portion on your web page where he talks about Macbook Air.


No problem. With YouTube Scene maker, you can share only specific scenes of a YouTube video by defining the in and out points.


youtube-text-captions » Overstream.net - You saw an instructional video on YouTube that’s in French - you understand that language but not your blog readers.


With Overstream Editor, you can easily add subtitles or closed captions to any YouTube video without having to download it. The video will still be streamed from YouTube but the text in the captions will appear via Overstream.


upload youtube video » TubeMogul.com - This is like Google Analytics for your YouTube Account.


Key in your YouTube profile name and you can instantly see traffic across all videos that you have uploaded to YouTube. You can even schedule delivery of those reports via email. Amazing.


youtube-video-captions » BubblePly - While Overstream is for text captions, BubblePlay goes a step further and lets you add images, animated cliparts and even video clips over any YouTube video.


The best part is that you can convert all these subtitles and art objects into hyperlinks - so when the viewer clicks that area, he is transported to a particular website.


And do read our YouTube FLV Video Guide to learn about all the other interesting things that you can do with videos downloaded from YouTube.com.



The Best Companion Tools for YouTube and other Web Videos - Digital Inspiration





Triggit: WYSIWYG Content Insertion Tool and Platform

Triggit has created a very interesting tool. The problem they are trying to solve is that many people want to muck around with their websites, but don't want to grok HTML. They want to integrate with services (mash them up in a manual one off way) such as insert their videos from YouTube, photos from Flickr, and other publishing items. One big one is being able to add advertising to the site.


As techies we often think that things are easy enough, "What? You just put in some embed code.... how hard is that!" Triggit allows you to go meta and put in only one embed code, and then offer a toolbar for users to add content in WYSIWYG style.


How does this all work?


When you put in that script code on your site, it has a hook back to the Triggit platform. Say you want to add a photo from Flickr: In the tool you find the photo and place it on the screen and hit save. The action is saved back to Triggit 'add this photo to that location with this style'. Then when a visitor hits the site, the page is loaded, the JavaScript is run, and the action is sent down to the browser and the DOM is changed to add this image. Zach said that one of the biggest challenges was getting this working across the various browsers, so when you put an image at some place using Firefox (the first browser that is supported for the editor side), it ends up in the right place no matter which browser is used from the visitor perspective.


This means that you now have a logical page, but content is split between you real backend, and the Triggit servers. The advantage to this method is that you can integrate with anything. You don't need to have special code that groks a particular backend service, it is all generic.


So, this is a little out there. You are balancing between making it incredibly easy to update pages, and adding complexity by having content in separate places. The page could jump a little depending on the amount of information coming down, if Triggit gets dugg, or what have you. If successful though, you can see as developers that you could write plugins for the system and allow John Doe to easily tie in your content. That is the future promise.


Fancy giving it a go? Zach gave Ajaxian readers 300 invites (as the product IS in beta!). Head over to the signup page and use the code "ajaxian" and if you are in the first 300 you should be good to go. Oh, and for coolness factor, I believe that Rails, Erlang, and crazy JavaScript skills were used to make this happen.


I got to sit down with Zach Coelius and he discusses the product, and gives us a walk through:








I also have a short demo of it running on my own blog:








And, finally, they have their own screencast of the tool at work.


Press Release



Triggit, a San Francisco based startup, with the aim of making life a lot simpler for web publishers, today announced the beta launch of the first ever WYSIWYG web application for integrating third party elements into websites.


With Triggit, any web publisher can now drag and drop advertisements, Flickr pictures, YouTube videos and more, directly into their site without any skills in web programming. Triggit is free to use, and works on any site that accepts JavaScript. It does not require any downloads, access to FTP, or APIs, and installs easily by pasting one small piece of code in the site.


“Triggit is here to help anyone who would like to take full advantage of the resources of the web for their site who isn’t a programmer and who doesn’t think in code,” says Susan Coelius-Keplinger, Triggit’s COO.


At a time when more and more non-technical publishers are coming online, Triggit is focused on removing the complexity of using code to add third party objects to a page. Whether it is widget, a video of a dog skateboarding, or a banner ad, current technology still requires the use of embedded code to integrate these elements into a website. For publishers accustomed to using graphical user interfaces for all their computing, it can be a daunting task to modify and integrate code into their websites. One area where this is a particular problem is for online advertising networks who continue to lose hundreds of millions of dollars of potential earnings annually to web publishers who can’t integrate and optimize their ad units.


Triggit’s goal is to serve as a feature-rich tool whereby publishers can quickly and easily integrate all manner of widgets, content, advertising units, APIs and data from third party sites. In doing so, it operates as a distribution arm for companies seeking to spread the reach of their advertisements, widgets, content and data on the web. By making it easier for web publishers to integrate these objects into their websites, Triggit helps to expand the ability of these companies to reach larger online audiences and add new revenue streams. Ryan Tecco, Triggit’s CTO says “It is really early days for this technology. There are a lot of things waiting in the wings that we haven’t yet put into the tool. We are very excited to see where this goes”.





Wednesday, January 16, 2008

Announcing Starling

In various presentations throughout 2007, the Twitter team has made reference to a pure Ruby message queue server called Starling, written by our own Blaine Cook.

Starling is at the core of what we do at Twitter; it moves small messages around to daemons that work on jobs like processing updates, delivering messages, archiving user accounts, and so forth. An asynchronous messaging solution is becoming a necessity for big web applications, and Starling fits the particular needs we have at Twitter. It's fast, it's stable, it speaks the memcache protocol so it doesn't need a new client library, and it's disk-backed for persistence. When other parts of the Twitter site go down, Starling stays up. It's a champ, and we love it.

Until now, Starling has lived a sheltered life in the Twitter code base. We're happy to announce that Starling is now open source and freely available for anyone to use, modify, and improve. We're eager to see patches and to start a proper open source community around Starling.

To give Starling a try today, just sudo gem install starling on your favorite Ruby development box. Let's see some serious queues!

Brian Klug of the PBwiki team wanted to learn more about

Brian Klug of the PBwiki team wanted to learn more about JavaScript serving, so they created a JavaScript Library Test which tests the loading time of Dojo, jQuery, Prototype, YUI, and Protoculous.


The test compares packed vs. minified, gzipped vs not, cached, etc. with some interesting results (hint: don’t used packed!). You can use your browser to help test, or see the combined results of thousands of testers.


JavaScript Library Results





Wednesday, January 9, 2008

NewsGator Ships a NewsMonster-style Distributed Reputation System.


Looks like NewsGator might have just shipped a NewsMonster-style distributed reputation system.


From NewsGator’s announcement:


“It’s all about ubiquity,” said Greg Reinacker, NewsGator CTO and founder. “We have more than 100 Fortune 2000 companies using NewsGator Enterprise Server and our client products. In selling to these enterprises, we discovered that thousands of knowledge workers were already using one or more of our client products and we learned that we could drive the relevance of everyone’s experience by using the community’s anonymous content consumption patterns throughout the system. In general, we found that the more people that used our system, the more relevant we could make the product for each user. By making it easier for knowledge workers to use our clients we dramatically increase the size of our user community. Enterprises that then deploy our server can take advantage of the synchronization and increased relevance for every user supported by the system. Likewise, we can extend these capabilities to our online platform, which currently serves well over 1 million consumers and indexes 7 million new articles per day. The result is tremendous value and continued innovation for both consumer and enterprise users.”


From the NewsMonster documentation:


NewsMonster allows for the creation of a trust network of worthy bloggers which is managed by the user. NewsMonster then uses this network to build a popularity index of recent events and aggregated RSS content by reputation.


NewsMonster observes decisions of users and publishes implicit certifications/reputes into the users online profile. These are then shared with other NewsMonster peers and aggregated to form a relevance network and collaborative filter. NewsMonster pays attention to what you are currently interested in and recommends articles for you based on what you should be interested in.


In layman’s terms - there is a lot of relevance information that can be pulled from your subscriptions and shared with your friends. It doesn’t make sense for two NewsMonster users to simultaneously scan through blogs, often duplicating effort, trying to find interesting content. Why not use each other to make finding news easier? Divide and conquer!


NewsMonster used similar shared profile access to share these certifications with other NewsMonster users. We published the data onto archive.org servers (thanks Brewster) and other NewsMonster clients then aggregated the data and computed local reputation with a trust metric I developed.


This was all before Spinn3r, Tailrank, Techmeme, Reddit, Digg, Bloggrunner, etc.


Unfortunately, I was never able to finish up the full system. The NewsMonster assets were purchased by Rojo when we started the company. We then switched focus towards Rojo.com and I was never able to find the spare cycles to finish up NewsMonster.


Google Reader is pushing some of these ideas. Nice to see NewsGator moving in this direction as well.


I still think it’s a bit too early. I’m not even sure it’s a mainstream product just yet.


Selling into enterprises is a good idea though.