Tidbits

Thursday, June 12, 2008

Thrift: (slightly more than) one year later

A little over a year ago, Facebook released Thrift as open source software. (See the original announcement.) Thrift is a lightweight software framework for enabling communication between programs written in different programming languages, running on different computers, or both. We decided to release it because we thought it could help other groups solve some of the same technical problems we have faced, and because we hoped that developers who found it useful would contribute improvements back to the project.

A lot has happened in the year since we released Thrift. First and foremost, Thrift has gained a lot of cool features:

Support for C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell.
More idiomatic style in Java and Ruby.
Two new protocols: one for dense encoding and one using JSON syntax.
Significant speed boost in Python using a C module and PHP using an extension.

As we had hoped, a lot of these features were developed not inside Facebook, but by other companies and individuals using Thrift and contributing their changes back to the project. Thanks so much to our biggest corporate contributors: Powerset, imeem, Evernote, and Amie Street, as well as our many individual contributors.

I'm even more excited about some of the stuff being worked on now.

Here at Facebook, we're working on a fully asynchronous client and server for C++. This server uses event-driven I/O like the current TNonblockingServer, but its interface to the application code is all based on asynchronous callbacks. This will allow us to write servers that can service thousands of simultaneous requests (each of which requires making calls to other Thrift or Memcache servers) with only a few threads.
Powerset's Chad Walters and I are working on templatizing the C++ API. This will be an almost entirely backward-compatible change that will preserve all the flexibility of the Thrift API, but it will allow performance-conscious developers to add type-annotations that will improve serialization and deserialization speed by about 5x and 3x respectively.
Thrift's Ruby mapping, which never got much attention here at Facebook, has had a surge of popularity amongst our external contributors. It's getting some much-needed attention from Powerset's Kevin Clark and our newest corporate contributor: RapLeaf. They've already got an accelerator extension in testing (which works like the existing Python and PHP accelerators) and are working on some serious style overhauls. At least, that's what they tell me. I don't know Ruby, so I mostly leave them alone. :)
Ross McFarland has been working on a C mapping for Thrift using glib. A C mapping has been one of our oft-requested features, so it's great to see this finally taking shape.
There are a host of other features that are "on the back burner" for now, but which I expect to be incorporated eventually. These include patches that we received for an asynchronous Perl client, an SSL transport for C++ (based on GNU TLS), and a more robust file-based transport.

I think the most exciting thing going on with Thrift right now is our acceptance into the Apache Incubator. For those who are not familiar with it, the Apache Incubator is a program for open source projects to integrate themselves into the Apache Software Foundation in preparation for becoming a real Apache project (or sub-project). This is a great opportunity for Thrift. Becoming an Apache project will get us a lot of attention and brand-recognition. We are hoping it will pave the way for getting Thrift integrated into a lot of other open source projects, especially those that are already Apache projects themselves. Apache is quickly becoming the de facto home for large-scale open source server software, so we think this could be the perfect long-term home for Thrift.

If you are interested in Thrift, the best documentation is still the original whitepaper. You can also check out Thrift's new homepage. Most of the interesting updates are on the Thrift mailing lists (subscription info on the homepage). Thrift's Subversion repository has just moved into the Apache Incubator repository. Information on accessing it is available on the Thrift homepage. A lot of experimental development is published in the unofficial Git repository.

Wednesday, June 11, 2008

That’s probably why they call it that.

10 Worst Woman-Bashing Ads

It all started with domestic ads in the 1950s. “Honey, I want to get your shirts whiter! But how do I do it?” It was a serious quandary women appeared to face at the time.

These days, there’s no question those old ads are sexist. But woman-bashing is far from over. It has only taken on a different, more sexual guise. According to these ads, women are built for male satisfaction. If problems arise–too much talking, disobedience, ugliness–simply deflate them, drink a lot of beer, or appease them with diamond jewelry.

These ads say it all. Here are ten of the industry’s finest:

10. This ad depicts a plain-looking woman getting more attractive with each sip of a beer. When the beer runs out, she’s–gasp–plain again.

9. “Made by hand?” Wishful thinking.

8. They mean kicking from the inside, right? Not her abusive boyfriend kicking her belly from the outside?

7. According to this ad, it’s easy to buy her good behavior:

6. Do not get married. Your wife will become a hideous grandmother with pinned-up hair. Your mistress, however, will always remain hot and available.

5. England’s Sun newspaper ran this billboard ad on the sides of buses. She looks like she’s for sale—and not for very much.

4. Notice the double meaning of the word. American Apparel’s entire ad campaign is based on borderline ads like these.

3. This Australian commercial has been dubbed the “Smartest Man in the World” commercial. Moral of the story: women always trap you into large, messy family situations, and they never use birth control.

2. This Czech ad starts with a couple on the beach. The woman is complaining about something. The man, thirsty and tired of her yapping, deflates her and goes to have a beer with his buddies.

1. This notorious German Heineken commercial created the ultimate Fembot. She’s a hot, roboticized, self-cloning Christina Aguilera lookalike who serves beer out of a keg in her uterus. They call her “Minnie Draughter,” and she’s the ideal beer wench-cum-hot dancing chick.

Tuesday, June 10, 2008

Cookies are for Closers » LinkedIn Architecture

LinkedIn Architecture

Tag: Scalability — Oren Hurvitz @ 12:20 am

At JavaOne 2008, LinkedIn employees presented two sessions about the LinkedIn architecture. The slides are available online:

These slides are hosted at SlideShare. If you register then you can download them as PDF’s.

This post summarizes the key parts of the LinkedIn architecture. It’s based on the presentations above, and on additional comments made during the presentation at JavaOne.

Site Statistics

22 million members
4+ million unique visitors/month
40 million page views/day
2 million searches/day
250K invitations sent/day
1 million answers posted
2 million email messages/day

Software

Solaris (running on Sun x86 platform and Sparc)
Tomcat and Jetty as application servers
Oracle and MySQL as DBs
No ORM (such as Hibernate); they use straight JDBC
ActiveMQ for JMS. (It’s partitioned by type of messages. Backed by MySQL.)
Lucene as a foundation for search
Spring as glue

Server Architecture

2003-2005

One monolithic web application
One database: the Core Database
The network graph is cached in memory in The Cloud
Members Search implemented using Lucene. It runs on the same server as The Cloud, because member searches must be filtered according to the searching user’s network, so it’s convenient to have Lucene on the same machine as The Cloud.
WebApp updates the Core Database directly. The Core Database updates The Cloud.

2006

Added Replica DB’s, to reduce the load on the Core Database. They contain read-only data. A RepDB server manages updates of the Replica DB’s.
Moved Search out of The Cloud and into its own server.
Changed the way updates are handled, by adding the Databus. This is a central component that distributes updates to any component that needs them. This is the new updates flow:
- Changes originate in the WebApp
- The WebApp updates the Core Database
- The Core Database sends updates to the Databus
- The Databus sends the updates to: the Replica DB’s, The Cloud, and Search

2008

The WebApp doesn’t do everything itself anymore: they split parts of its business logic into Services.
The WebApp still presents the GUI to the user, but now it calls Services to manipulate the Profile, Groups, etc.
Each Service has its own domain-specific database (i.e., vertical partitioning).
This architecture allows other applications (besides the main WebApp) to access LinkedIn. They’ve added applications for Recruiters, Ads, etc.

The Cloud

The Cloud is a server that caches the entire LinkedIn network graph in memory.
Network size: 22M nodes, 120M edges.
Requires 12 GB RAM.
There are 40 instances in production
Rebuilding an instance of The Cloud from disk takes 8 hours.
The Cloud is updated in real-time using the Databus.
Persisted to disk on shutdown.
The cache is implemented in C++, accessed via JNI. They chose C++ instead of Java for two reasons:
- To use as little RAM as possible.
- Garbage Collection pauses were killing them. [LinkedIn said they were using advanced GC's, but GC's have improved since 2003; is this still a problem today?]
Having to keep everything in RAM is a limitation, but as LinkedIn have pointed out, partitioning graphs is hard.
[Sun offers servers with up to 2 TB of RAM (Sun SPARC Enterprise M9000 Server), so LinkedIn could support up to 1.1 billion users before they run out of memory. (This calculation is based only on the number of nodes, not edges). Price is another matter: Sun say only "contact us for price", which is ominous considering that the prices they do list go up to $30,000.]

The Cloud caches the entire LinkedIn Network, but each user needs to see the network from his own point of view. It’s computationally expensive to calculate that, so they do it just once when a user session begins, and keep it cached. That takes up to 2 MB of RAM per user. This cached network is not updated during the session. (It is updated if the user himself adds/removes a link, but not if any of the user’s contacts make changes. LinkedIn says users won’t notice this.)

As an aside, they use Ehcache to cache members’ profiles. They cache up to 2 million profiles (out of 22 million members). They tried caching using LFU algorithm (Least Frequently Used), but found that Ehcache would sometimes block for 30 seconds while recalculating LFU, so they switched to LRU (Least Recently Used).

Communication Architecture

Communication Service

The Communication Service is responsible for permanent messages, e.g. InBox messages and emails.

The entire system is asynchronous and uses JMS heavily
Clients post messages via JMS
Messages are then routed via a routing service to the appropriate mailbox or directly for email processing
Message delivery: either Pull (clients request their messages), or Push (e.g., sending emails)
They use Spring, with proprietary LinkedIn Spring extensions. Use HTTP-RPC.

Scaling Techniques

Functional partitioning: sent, received, archived, etc. [a.k.a. vertical partitioning]
Class partitioning: Member mailboxes, guest mailboxes, corporate mailboxes
Range partitioning: Member ID range; Email lexicographical range. [a.k.a. horizontal partitioning]
Everything is asynchronous

Network Updates Service

The Network Updates Service is responsible for short-lived notifications, e.g. status updates from your contacts.

Initial Architecture (up to 2007)

There are many services that can contain updates.
Clients make separate requests to each service that can have updates: Questions, Profile Updates, etc.
It took a long time to gather all the data.

In 2008 they created the Network Updates Service. The implementation went through several iterations:

Iteration 1

Client makes just one request, to the NetworkUpdateService.
NetworkUpdateService makes multiple requests to gather the data from all the services. These requests are made in parallel.
The results are aggregated and returned to the client together.
Pull-based architecture.
They rolled out this new system to everyone at LinkedIn, which caused problems while the system was stabilizing. In hindsight, should have tried it out on a small subset of users first.

Iteration 2

Push-based architecture: whenever events occur in the system, add them to the user’s "mailbox". When a client asks for updates, return the data that’s already waiting in the mailbox.
Pros: reads are much quicker since the data is already available.
Cons: might waste effort on moving around update data that will never be read. Requires more storage space.
There is still post-processing of updates before returning them to the user. E.g.: collapse 10 updates from a user to 1.
The updates are stored in CLOB’s: 1 CLOB per update-type per user (for a total of 15 CLOB’s per user).
Incoming updates must be added to the CLOB. Use optimistic locking to avoid lock contention.
They had set the CLOB size to 8 kb, which was too large and led to a lot of wasted space.
Design note: instead of CLOB’s, LinkedIn could have created additional tables, one for each type of update. They said that they didn’t do this because of what they would have to do when updates expire: Had they created additional tables then they would have had to delete rows, and that’s very expensive.
They used JMX to monitor and change the configuration in real-time. This was very helpful.

Iteration 3

Goal: improve speed by reducing the number of CLOB updates, because CLOB updates are expensive.
Added an overflow buffer: a VARCHAR(4000) column where data is added initially. When this column is full, dump it to the CLOB. This eliminated 90% of CLOB updates.
Reduced the size of the updates.

[LinkedIn have had success in moving from a Pull architecture to a Push architecture. However, don't discount Pull architectures. Amazon, for example, use a Pull architecture. In A Conversation with Werner Vogels, Amazon's CTO, he said that when you visit the front page of Amazon they typically call more than 100 services in order to construct the page.]

The presentation ends with some tips about scaling. These are oldies but goodies:

Can’t use just one database. Use many databases, partitioned horizontally and vertically.
Because of partitioning, forget about referential integrity or cross-domain JOINs.
Forget about 100% data integrity.
At large scale, cost is a problem: hardware, databases, licenses, storage, power.
Once you’re large, spammers and data-scrapers come a-knocking.
Cache!
Use asynchronous flows.
Reporting and analytics are challenging; consider them up-front when designing the system.
Expect the system to fail.
Don’t underestimate your growth trajectory.

Thursday, June 5, 2008

Radiohead's "Nude" performed by a ZX Spectrum, dot matrix printer, scanner, and hard disk array

Andy Baio : Radiohead's "Nude" performed by a ZX Spectrum, dot matrix printer, scanner, and hard disk array - starts at about 1:10; best remix ever

nelson : LoFi Radiohead / Nude - deeply fucked; give it to 1:30 before giving up

Tags : music radiohead via:waxy

Wednesday, June 4, 2008

06/4/08 PHD comic: 'When to meet with your advisor'

Piled Higher & Deeper by Jorge Cham		www.phdcomics.com

title: "When to meet with your advisor" - originally published 6/4/2008 For the latest news in PHD Comics, CLICK HERE!

Yahoo is releasing an Address Book API today that will give 3rd-party developers access Yahoo users’ contact lists without the traditional, but primitive, method of page scraping.

In addition to searching for specific contacts and fields and reading their data, developers can use it to add contacts and change existing records (although to start, only pre-approved developers will have the right to make edits).

Chris Yeh, the head of the Yahoo developer network, considers this release the second major “proof point” of Yahoo’s Open Services (YOS) campaign, which kicked off at the Web 2.0 Expo in March. The first point was Search Monkey, which makes it possible for anyone to enhance the way website results are displayed in Yahoo search.

As with Microsoft and Google’s own contact APIs, Yahoo has decided to implement a proprietary permission system - theirs called bbAuth - rather than implement an open protocol like oAuth. Yeh says he hopes to see oAuth adopted by Yahoo in the near term, although he couldn’t say when that might happen.

LinkedIn and Plaxo are two launch partners who have already implemented the new API and even used it publicly over the past several months.

Yeh says there is no policy in place for restricting how long developers can store and use the data they pull from the API. But, as with many of its developer initiatives, Yahoo reserves the right to stop what it deems bad behavior.

As Dave McClure suggested to me recently, it would be very powerful if developers could not only retrieve basic contact information from webmail services like Yahoo, Hotmail and Gmail, but could also determine the types of relationships a user has with those contacts. For example, if I wanted to pull out a user’s top 5 contacts, I could do so by looking at the frequency of messages sent to all contacts. This lookup could be refined by targeting only messages with certain keywords so that contacts belonging to particular categories (say, golf enthusiasts) could be identified by their messages.

Unfortunately, no such advanced querying is available with Yahoo’s new API, at least to start. Yeh does assure me that other groups within the YOS campaign are looking at how to identify relationships within the address book, so hopefully we’ll see this type of functionality down the line.

Crunch Network: CrunchBoard because it’s time for you to find a new Job2.0

Monday, June 2, 2008

Express Your Music Mood with Muzicons Widgets

Social networking woes got you down? Why don’t you let the world know how frustrated you are by expressing your emotions through a music widget!

Muzicons is a new music sharing site where you can easily create a widget (or Muzicon) to host on your blog. The name, like to ‘emoticon,’ comes from the ability to use icons to show mood, emotions, and whatever it is you are thinking at the time.

Unlike some of the music widgets that exist such as eSnips, MxPlay, and Sonific, Muzicons is the only service that gives users the option of choosing an emoticon and a mood status. You can customize the look of your widget and place it in your Blogger, LiveJournal, Wordpress or MySpace.

Muzicons may seem ordinary, but they’re David against the Goliath that is the music industry. Since Muzicons was created in Russia, it will not have to adhere to the demands of the DMCA as we saw happen with Imeem when it was sued by Warner Music. Users are free to upload copyrighted music as they see fit.

In my opinion, Muzicons are simply the cutest music widgets I have ever seen.

Check out my Muzicon:

Facebook certainly chose a peculiar time to announce the imminent death of a platform (and dare I say, operating system) staple: the installation screen.

In the late afternoon of last Friday, Pete Bratach penned a post called “Streamlining Application Authorization” that went virtually unnoticed by the press at the time, even by Facebook-focused blogs. And when they did finally cover Bratach’s post, they chose to focus on less important matters concerning user metrics.

Was Facebook trying to pull a fast one on us? That wouldn’t be surprising, given the potential of fundamental platform changes to upset a large number of developers. And what it did announce consisted of quite a fundamental change.

Starting July 15 (and perhaps coinciding with the rollout of Facebook’s new site design), users will no longer see an installation screen (see below) when they access an application for the first time. Rather, they will see a new “login” screen that simply asks them whether they want to permit the application access to their information. This simply grants the application temporary access to your data so it can operate, without establishing any real footprint on your Facebook experience.

The new screen has been designed to make application adoption less intimidating for users. They will no longer have to worry about installing (and later uninstalling) applications - and their associated profile boxes, left-hand nav buttons, profile links, and email lists - just to try them out.

But the change should also slow viral growth patterns - especially for newer, smaller apps. Gone is the ability to put profile boxes (which give apps considerable visibility) on users’ pages upon first access. To add a box (on a special apps tab no less), users must later decide that they like it well enough to click on a special canvas page button.

The same goes for email notifications and news feed items larger than one line; users must opt into these through the canvas page as an afterthought. The new design will also forgo app links in the left-hand column (the column is going away in its present form), as well as app links under profile pictures. All in all, these changes mean that applications will struggle to obtain the same visibility and user access that they can instantly achieve now upon installation.

The move to get rid of installations is the latest in a line of decisions meant to clamp down on spammy apps by implementing sweeping changes to the platform, rather than coming down hard on particular wrongdoers. We first covered this trend in August when Facebook moved to stop developers from generating deceptive profile boxes and messages, and then changed the way all applications are measured.

Later, Facebook outed misleading notifications and mini-feed stories, reined in cross-application notifications, and put restrictions on feed stories. More importantly, Facebook began regulating the number of notifications, requests, and emails that apps could send to new users, based on response rates. And this year the company also released a formal platform policy that implemented rule-based limitations in addition to technical ones.

Platform changes meant to reduce spam are great for users but not so great for developers, even the non-spammy ones. After all, platforms by definition are meant to be stable; shake them up and things start to fall apart. Furthermore, just how Facebook has chosen to evolve its platform should give innocent developers pause. As Michael remarked in August, Facebook tends to avoid punishing mischievous developers in any meaningful way. This policy leads to further bad behavior since developers know they won’t be held individually accountable; Facebook will just change the entire platform on them.

And these pending changes only serve to continue that tradition. When installation screens go away mid-July, existing applications will see their access to users grandfathered in. Their profile boxes will be moved over to the new tab, their email lists will retain members, and they will still be able to generate news feed items just as before. Thus, the popular apps that achieved success through spammy means won’t suffer nearly as much as nascent apps that have yet to gain a foothold.

The question going forward, therefore, is this: will Facebook continue perfecting the platform with the goal of preventing all bad behavior with technological measures but no meaningful deterrents? Or will it concede that overly selfish behavior on the part of developers is unstoppable to some extent, and that it’s important to implement a reliable and effective system of punishment?

Crunch Network: CrunchGear drool over the sexiest new gadgets and hardware.

Sunday, June 1, 2008

Java TreeMap and HashMap Incompatibilities

I will often use a TreeMap in place of a HashMap because while TreeMap is theoretically slower (O(logN) vs O(1)) it is often much more memory efficient, especially for larger maps.

The problem is that you can’t just swap your map for TreeMap without modifying code because you can’t store null keys in TreeMap.

This code:

map = new HashMap(); map.put( null, “bar” ); System.out.printf( “%s\n”, map.get( null ) );

works fine and prints ‘bar’.

This code:

map = new TreeMap(); map.put( null, “bar” ); System.out.printf( “%s\n”, map.get( null ) );

… will throw a null pointer exception.

This is true because the TreeMap Comparator doesn’t support comparing null values. Why not?

I can pass in my own Comparator but now I need to remember this for every instance of TreeMap.

Come to think of it. I’ve often seen Map implementations slowing down somewhat decent code. The 2N rehash functionality of HashMap often means that it can blow your memory footprint out of the water and crash your JVM.

If one were to use Guice to inject the dependency on the right Map implementation 3rd party libraries would be able to swap their Map implementations out at runtime.

Further, if memory is your biggest problem, I think you might want to punt on using the Java internal Map implementations and use FlatMap (even though it has some limitations).