Friday, May 30, 2008

One of the details of Yahoo! BrowserPlus that people picked up on was the fact that it only runs against Yahoo! properties.


However, some folks have hacked that restriction so they can play with it locally, and assume that Yahoo! wouldn't like this.


Skylar Woodward of Yahoo! has posted that this isn't the case at all:



BrowserPlus was more-or-less designed to be hacked. Not hacked in the “I want to steal innocent users data and delete their files” sort of way, but in a manner that allows experimentation and freedom without compromising the security of pedestrian users. There’s more there to be mined, but enabling local development is a good place to start.


And goes on to show how you can get rid of the restriction:



Currently, BrowserPlus is restricted to Yahoo! sites; that includes restrictions for running local files. A simple addition to our test file exposes the error:



JAVASCRIPT:





  1.  



  2. else {greeting = "BrowserPlus is hiding. ("+res.verboseError+")";}



  3.  






The error BP_EC_UNAPPROVED_DOMAIN confirms the local domain (file://) isn’t permitted. That means it’s time to dig into the BP configuration files. On Mac these are in


/Users/[you]/Library/Application Support/Yahoo!/BrowserPlus/


On Windows XP, you’ll find them in something akin to


c:\\Documents And Settings\[you]\Local Settings\Application Data\Yahoo!\BrowserPlus\


and on Windows Vista…


c:\Users\[you]\AppData\Local\Yahoo!\BrowserPlus\


In the Permissions folder is a file similarly named which is what we’re looking for. Opening it up we see:



JAVASCRIPT:





  1.  



  2. "whitelist" : [



  3.         "^http(s?)://(.*)\\.yahoo\\.com$",



  4.         "^http(s?)://(.*)\\.yahoo\\.com:[0-9]+$"



  5.     ],



  6.  






The intuitive addition to this list is:



JAVASCRIPT:





  1.  



  2. "whitelist" : [



  3.         "^http(s?)://(.*)\\.yahoo\\.com$",



  4.         "^http(s?)://(.*)\\.yahoo\\.com:[0-9]+$",



  5.         "^file://$"



  6.     ],



  7.  






The file is modified, but BrowserPlus hasn’t picked up the changes yet. The clean way to force this is to close all open browser windows. (BrowserPlus shuts down when no pages are using it.) The dirty way to do this is to search for BrowserPlusCore in your process list and kill it using your favorite platform-available tool. Either way, after opening test.html back up we should see our “Hello World.” Sweet - now we’re ready to start playing.


There is one final catch. BrowserPlus is fairly proactive about security so it helps to know that the permissions file will be overwritten on a regular basis. The savvy way around this would be a simple build script or at least a handy copy of our modified permissions file that we can use to reapply the changes in between development sessions. We might also test for BP_EC_UNAPPROVED_DOMAIN somewhere in our init callback to scream if the temporary development environment is disrupted.





Thursday, May 29, 2008

Udi on Google search quality

Google VP Udi Manber offers a high level description of what goes into Google's relevance rank in his recent post, "Introduction to Google Search Quality".

Some excerpts:
Ranking is hard ... We need to be able to understand all web pages, written by anyone, for any reason ... We also need to understand the queries people pose [and their needs], which are on average fewer than three words, and map them to our understanding of all documents ... And we have to do all of that in a few milliseconds.

PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it's not just the language, it's how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).

In 2007, we launched more than 450 new improvements, about 9 per week on the average. Some of these improvements are simple and obvious -- for example, we fixed the way Hebrew acronym queries are handled (in Hebrew an acronym is denoted by a (") next to the last character, so IBM will be IB"M), and some are very complicated -- for example, we made significant changes to the PageRank algorithm in January.
Please see also Barry Schwartz's post, "A Deeper Look At Google's Search Quality Efforts", which provides some additional commentary on Udi's post.

Please see also my earlier post, "The perils of tweaking Google by hand", which talks about whether these thousands of twiddles to the search engine and variations of them should be constantly tested rather than just evaluating one version of them at the time they are created.

Yahoo builds two petabyte PostgreSQL database

James Hamilton writes about Yahoo's "over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day".

It apparently is built on top of a modified version of PostgreSQL and runs on about 1k machines. In his post, James speculates on the details of the internals. Very interesting.

Please see also Eric Lai's article in ComputerWorld, "Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest". On that, note that the Google Bigtable paper from 2006 says Bigtable handles "petabytes of data", so the Yahoo claim may depend on what you consider a database.

Calvin and Hobbes for May 29, 2008


Testing IE Versions Just Got a Little Easier

Testing your sites on different versions of Internet Explorer has always been notoriously difficult mainly due to the fact that Microsoft prevents you from running to different versions of the browser in Windows. Sure there have been solutions to get around this limitation but in my experience, they've always caused unexpected results and instability for the operating system or required you to run a VM. Not ideal.


Jean-Fabrice RABAUTE, the man behind the IE debugger DebugBar, has come up with a nice solution he's called IETester. This free tool allows you to have the rendering and javascript engines of IE8 beta 1, IE7 IE 6 and IE5.5 on Vista and XP, as well as the installed IE in the same process.



You can check out IETester in action below:







ScreenCast IETester from WebInventif.fr on Vimeo.





Tuesday, May 27, 2008

Speed up access to your favorite frameworks via the AJAX Libraries API

Google engineers spend a lot of time working on speeding up their Web applications. Performance is a key factor for our teams, and we recognize how important it is for the entire Web.

When you take a look at the effort that it takes to setup work that should be simple, such as caching shared JavaScript libraries, you quickly realize that the Web could be faster than it currently is.

The AJAX Libraries API is an attempt to make Web applications faster for developers in simple ways:
  • Developers won't have to worry about getting caching setup correctly, as we will do that for you
  • If another application uses the same library (much more likely), they there is a much better chance that it will be already caching on the users machine
  • The network and bandwidth of the users systems will not be taxed.

What exactly is the AJAX Libraries API?

We have worked with a subset of the most popular JavaScript frameworks to host their work on the Google infrastructure. The AJAX Libraries API then becomes a content distribution network and loading architecture for these libraries.

We realize that there are a huge number of useful libraries out there, but we wanted to start small with the program, which has us starting with:

We work with the key stake holders for these libraries to make sure that the latest stable versions of their work get into our system as they are released. Once we host a release of a given library, we are committed to hosting that release indefinitely.

You can access the libraries in two ways, and either way we take the pain out of hosting the libraries, correctly setting cache headers, staying up to date with the most recent bug fixes, etc.

The first way to access the scripts is simply be using a standard <script src=".."> tag that points to the correct place.

For example, to load Prototype version 1.6.0.2 you would place the following in your HTML:

<script src="http://ajax.googleapis.com/ajax/libs/prototype/1.6.0.2/prototype.js"></script>

The second way to access the scripts is via the Google AJAX API Loader's google.load() method.

Here is an example using that technique to load and use jQuery for a simple search mashup:


<script src="http://www.google.com/jsapi"></script>
<script>
// Load jQuery
google.load("jquery", "1");

// on page load complete, fire off a jQuery json-p query
// against Google web search
google.setOnLoadCallback(function() {
$.getJSON("http://ajax.googleapis.com/ajax/services/search/web?q=google&;v=1.0&;callback=?",

// on search completion, process the results
function (data) {
if (data.responseDate.results &&
data.responseDate.results.length>0) {
renderResults(data.responseDate.results);
}
});
});
</script>


You will notice that the version used was just "1". This is a smart versioning feature that allows your application to specify a desired version with as much precision as it needs. By dropping version fields, you end up wild carding a field. For instance, consider a set of versions: 1.9.1, 1.8.4, 1.8.2.

Specifying a version of "1.8.2" will select the obvious version. This is because a fully specified version was used. Specifying a version of "1.8" would select version 1.8.4 since this is the highest versioned release in the 1.8 branch. For much the same reason, a request for "1" will end up loading version 1.9.1.

Note, these versioning semantics work the same way when using google.load and when using direct script urls.

By default, the JavaScript that gets sent back by the loader will be minified, if there is a version supported. Thus, for the example above we would return the minified version of jQuery. If you specifically want the raw JavaScript itself, you can add the "uncompressed" parameter like so:

google.load("jquery", "1.2", {uncompressed:true});

Today we are starting with the current versions of the library, but moving forward we will be archiving all versions from now onwards so you can be sure they are available.

For a full listing of the currently supported libraries, see the documentation.

We are really excited to offer something that we feel can truly help you out, and please give us feedback in our Google Group to let us know how the feature is working for you, and if you have a craving for a particular library to be included.



Monday, May 26, 2008

Comic for 27 May 2008


Sunday, May 25, 2008

Comic for 26 May 2008


Twitter Pager Rotation.


It dawned on me that if I were working for Twitter that I would just assume the service is down unless told otherwise.


This lead to the conclusion that one should invert monitoring to send off a notification when Twitter is online


Seriously. I like those guys but this is getting kind of embarrassing.


As someone interested in distributed, scalable, and reliable web services, I think I might stop using it out of protest.


Things could be worse though - they could be using Hadoop! :-)


You can see a picture of Twitter’s main database server below:


200805250725




Friday, May 23, 2008

YouTube - 21 Accents


philgyford : YouTube - 21 Accents - An actress says more or less the same sentence in 21 different accents. Very good. I'd love to be able to do, well, any accent other than my own. (via Boing Boing)


Tags : accents amywalker top via:boingboing video youtube


The Breakdown of Modern Web Design

(via drawohara)


Monday, May 19, 2008

Comic for May 19, 2008

Shared by Madhu


Yet to meet a single person who hasn't been in a similar situation




Sunday, May 18, 2008

Fortune Cookies

'You will have hot, steamy, sweaty sex ... IN BED!'

Friday, May 16, 2008

Lots of Bits

In January of 2008 we announced that the Amazon Web Services now consume more bandwidth than do the entire global network of Amazon.com retail sites.



Amazon.com CEO Jeff Bezos has been showing a chart of the relative bandwidth usage and I just received permission to post it here:



Aws_bandwidth



Pretty cool, huh?



-- Jeff;




Pearls Before Swine

Shared by Madhu


Funny Comic

Today's Comic

Tuesday, May 13, 2008

Facebook Chat

One of the things I like most about working at Facebook is the ability to launch products that are (almost) immediately used by millions of people. Unlike a three-guys-in-a-garage startup, we don't have the luxury of scaling out infrastructure to keep pace with user growth; when your feature's userbase will go from 0 to 70 million practically overnight, scalability has to be baked in from the start. The project I'm currently working on, Facebook Chat, offered a nice set of software engineering challenges:

Real-time presence notification:


The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.

The naive implementation of sending a notification to all friends whenever a user comes online or goes offline has a worst case cost of O(average friendlist size * peak users * churn rate) messages/second, where churn rate is the frequency with which users come online and go offline, in events/second. This is wildly inefficient to the point of being untenable, given that the average number of friends per user is measured in the hundreds, and the number of concurrent users during peak site usage is on the order of several millions.

Surfacing connected users' idleness greatly enhances the chat user experience but further compounds the problem of keeping presence information up-to-date. Each Facebook Chat user now needs to be notified whenever one of his/her friends
(a) takes an action such as sending a chat message or loads a Facebook page (if tracking idleness via a last-active timestamp) or
(b) transitions between idleness states (if representing idleness as a state machine with states like "idle-for-1-minute", "idle-for-2-minutes", "idle-for-5-minutes", "idle-for-10-minutes", etc.).
Note that approach (a) changes the sending a chat message / loading a Facebook page from a one-to-one communication into a multicast to all online friends, while approach (b) ensures that users who are neither chatting nor browsing Facebook are nonetheless generating server load.

Real-time messaging:


Another challenge is ensuring the timely delivery of the messages themselves. The method we chose to get text from one user to another involves loading an iframe on each Facebook page, and having that iframe's Javascript make an HTTP GET request over a persistent connection that doesn't return until the server has data for the client. The request gets reestablished if it's interrupted or times out. This isn't by any means a new technique: it's a variation of Comet, specifically XHR long polling, and/or BOSH.

Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice. Even without accounting for the sizeable overhead of spawning an OS process that, on average, twiddles its thumbs for a minute before reporting that no one has sent the user a message, the waiting time could be spent servicing 60-some requests for regular Facebook pages. The result of running out of Apache processes over the entire Facebook web tier is not pretty, nor is the dynamic configuration of the Apache process limits enjoyable.

Distribution, Isolation, and Failover:


Fault tolerance is a desirable characteristic of any big system: if an error happens, the system should try its best to recover without human intervention before giving up and informing the user. The results of inevitable programming bugs, hardware failures, et al., should be hidden from the user as much as possible and isolated from the rest of the system.

The way this is typically accomplished in a web application is by separating the model and the view: data is persisted in a database (perhaps with a separate in-memory cache), with each short-lived request retrieving only the parts relevant to that request. Because the data is persisted, a failed read request can be re-attempted. Cache misses and database failure can be detected by the non-database layers and either reported to the user or worked around using replication.

While this architecture works pretty well in general, it isn't as successful in a chat application due to the high volume of long-lived requests, the non-relational nature of the data involved, and the statefulness of each request.

For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users' conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space "processes", share-nothing message-passing semantics, built-in distribution, and a "crash and recover" philosophy proven by two decades of deployment on large soft-realtime production systems.

Glueing with Thrift:


Despite those advantages, using Erlang for a component of Facebook Chat had a downside: that component needed to communicate with the other parts of the system. Glueing together PHP, Javascript, Erlang, and C++ is not a trivial matter. Fortunately, we have Thrift. Thrift translates a service description into the RPC glue code necessary for making cross-language calls (marshalling arguments and responses over the wire) and has templates for servers and clients. Since going open source a year ago (we had the gall to release it on April Fool's Day, 2007), the Thrift project has steadily grown and improved (with multiple iterations on the Erlang binding). Having Thrift available freed us to split up the problem of building a chat system and use the best available tool to approach each sub-problem.

Ramping up:


The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a "dark launch" period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. With the "dark launch" bugs fixed, we hope that you enjoy Facebook Chat now that the UI lights have been turned on.

Eugene is a Facebook Engineer

Comic for 14 May 2008


05/12/08 PHD comic: &#39;Vicious Cycle&#39;

Shared by Madhu


Story of my life














Piled Higher
& Deeper
by Jorge
Cham

www.phdcomics.com


title:
"Vicious Cycle" - originally published
5/12/2008

For the latest news in PHD Comics, CLICK HERE!







Friday, May 9, 2008

How hash works with block in ruby

There are many different implementations of Fibonacci series in Ruby. One implementation is this.





fibs = Hash.new do |hash,key|
if key < 2
hash[key] = 1
else
hash[key] = hash[key - 1] + hash[key - 2]
end
end

(1..10).each {|i| puts fibs[i] }

# output
1
2
3
5
8
13
21
34
55
89




Ruby is full of surprises and this is one of them. Look at the api for Hash. It will mention that Hash can accept a block. But the api leaves out something very important without which understanding the above code is difficult.




At the time of Hash creation if a block is passed then this is what happens. Ruby will make a call to the block every single time ruby encounters a case where the hash doesn’t have a value for the input key. If the hash has a value for the input key then the block is not invoked. When the block is indeed invoked then ruby passes two parameters. The first param is the hash itself and the second param is the key for which no value was found.




Now it should make it easy to understand how and why the above code is working.


Thursday, May 8, 2008

High Performance Multithreaded Access to Amazon SimpleDB

Simpledb_s3_query_sample_2
We have just released a new code sample.



Written in Java, this new sample shows how Amazon SimpleDB can be used as a repository for metadata which describes objects stored in Amazon S3. The code was written to illustrate best practices for indexing S3 data and for getting the best indexing and query performance from SimpleDB.



Indexing is implemented at two levels. At the first level, multiple threads (implemented using the Java Executor) are used to ensure that a number of S3 reads and a number of SimpleDB writes are taking place simultaneously. At the second level, Amazon SQS is used to coordinate index tasks running on multiple systems, leading to an even higher degree of concurrency.



Bulk queries are implemented using a pair of thread pools. The first pool runs SimpleDB queries and the second retrieves SimpleDB attributes. With the proper balance between the two pools, a Small Amazon EC2 instance was able to make over 300 requests per second.



Check it out!



-- Jeff;




Bad concurrency advice: interned Strings

I just read Thread Signaling from Jacob Jenkov. It is fine as far as it goes to introduce the reader to Object.wait() and Object.notify().

But it has one fatal flaw: it uses a literal java.lang.String for coordinating between threads. Why is this wrong?

Strings are interned by the compiler. To quote the javadocs: All literal strings and string-valued constant expressions are interned. Using a literal string means that any other code anywhere in the JVM, even in other libraries, which use the same String literal value all share the same object for wait() and notify(). This code:

public void dastardly() {
"".notify();
}

will wake up a thread waiting on empty string, including one in utterly unrelated code.

Don't do that. Instead, create a fresh Object for coordinating threads. This age-worn advice for lock objects (synchronize(lock)) applies just as much to objects used to coordinate threads.


Comic for May 8, 2008



Wednesday, May 7, 2008

Crawling is harder than it looks

The best paper award at WWW 2008 went to a paper on large-scale crawling titled "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov.

I have never been that interested in crawling, so I almost missed this talk. But, I am glad I didn't.

The paper is a fascinating look at the difficulty of doing very large scale web crawling, focusing on avoiding web spam and pathological link structures that could trap a crawler, making sure to not overwhelm crawled webservers, and using high performance techniques for duplicate detection.

Some extended excerpts:
The web has changed significantly since the days of early crawlers, mostly in the area of dynamically generated pages and web spam. With server-side scripts that can create infinite loops, high-density link farms, and unlimited number of hostnames, the task of web crawling has changed from simply doing a BFS scan of the WWW to deciding in real time which sites contain useful information and giving them higher priority as the crawl progresses.

The first performance bottleneck we faced was caused by the complexity of verifying uniqueness of URLs and their compliance with robots.txt. As N scales into many billions ... [prior] algorithms ... no longer keep up with the rate at which new URLs are produced by our crawler (i.e., up to 184K per second) ... [A] new technique called Disk Repository with Update Management (DRUM) ... can store large volumes of arbitrary hashed data on disk and implement very fast check, update, and check+update operations using bucket sort ... DRUM can be thousands of times faster than prior disk-based methods.

In order to determine the legitimacy of a given domain, we use a very simple algorithm based on the number of incoming links from assets that spammers cannot grow to infinity. Our algorithm, which we call Spam Tracking and Avoidance through Reputation (STAR), dynamically allocates the budget of allowable pages for each domain and all of its subdomains in proportion to the number of in-degree links from other domains.

We ran IRLbot on a [single] quad-CPU AMD Opteron 2.6 GHz server (16 GB RAM, 24-disk RAID-5) attached to a 1-gb/s link ... [for a] total active crawling span of 41.27 days. During this time, IRLbot attempted 7,606,109,371 connections and received 7,437,281,300 valid HTTP replies ... IRLbot ended up with N = 6,380,051,942 responses with status code 200 and content-type text/html.

The average download rate during this crawl was 319 mb/s (1,789 pages/s) with the peak 10-minute average rate of 470 mb/s (3,134 pages/s). The crawler received 143 TB of data, out of which 254 GB were robots.txt files, and transmitted 1.8 TB of HTTP requests. The parser processed 161 TB of HTML code.
At very large scale and from a single box, those are pretty remarkable crawling rates, 2k pages/second. The details of the bottlenecks they encountered and how they overcame them made this paper a quite enjoyable read.

I did end with a question about part of the work. The authors say (in 7.1 and Figure 6) that the probability of finding a unique page never dropped below .11 and, earlier (in 1.4) say that "we believe a good fraction of the 35B URLs not crawled in this experiment [contain] useful content." However, they define a unique page as the same URL, so it seems like it could easily be the case that those 35B URLs seen but not crawled could have duplicate or near duplicate content to the 6B crawled.

Update: One topic the IRLbot paper ignored was how frequently we should recrawl a page. Another WWW 2008 paper, "Recrawl Scheduling Based on Information Longevity" (PDF) by Chris Olston and Sandeep Pandey, has a nice overview of that issue and then extends the prior work by focusing on the persistence of a web page. The authors end up arguing that it is not worth recrawling pages that change very rapidly very frequently because it is impossible to ever get the index to match the actual content of the page.

Update: On the question of what content is useful, both in terms of crawling and recrawling, I always find myself wondering how much we should care about pages that never get visited by anyone. Shouldn't our focus in broadening a crawl beyond 6B pages and in recrawling pages in our index depend on how much users seem to be interested in the new content?

Amazon S3 Copy API Ready for Testing

Copying_s3_objects
A few weeks ago we asked our developer community for feedback on a proposed Copy feature for Amazon S3. The feedback was both voluminous and helpful to us as we finalized our plans and designed our implementation.



This feature is now available for beta use; you can find full documentation here (be sure to follow the links to the detailed information on the use of this feature via SOAP and REST). Copy requests are billed at the same rate as PUT requests: $.01 for 1000 in the US, and $.012 for 1000 in Europe.



In addition to the obvious use for this feature -- creating a new S3 object from an existing one -- you can also use it to rename an object within a bucket or to move an object to a new bucket. You can also update the metadata for an object by copying it to itself while supplying new metadata.



Still on the drawing board is support for copying between US and Europe, and a possible conditional copy feature. Both of these items surfaced as a result of developer feedback.



Tool and library support for this new feature is already starting to appear; read more about that in this discussion board thread.



-- Jeff;