Monday, December 31, 2007

A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging

We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words’ distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.

Users of the command line are familiar with the idea of building pipelines: a chain of simple commands strung together to the output of one becomes the input of the next. Using pipelines and a basic set of primitives, shell users can accomplish some sophisticated tasks. Here's a basic Unix shell pipeline that reports the ten longest .tip files in the current directory, based on the number of lines in each file:

 wc -l *.tip | grep \.tip | sort -n | tail -10

Let's see how to add something similar to Ruby. By the end of this set of two articles, we'll be able to write things like

puts (even_numbers | tripler | incrementer | multiple_of_five ).resume

and a palindrome finder using blocks:

words            = Pump.new %w{Madam, the civic radar rotator is not level.}
is_palindrome = Filter.new {|word| word == word.reverse}

pipeline = words .| {|word| word.downcase.tr("^a-z", '') } .| is_palindrome

while word = pipeline.resume
  puts word
end

Great code? Nope. But getting there is fun. And, who knows? The techniques might well be useful in your next project.

A Daily Dose of Fiber

Ruby 1.9 adds support for Fibers. At their most basic, let you create simple generators (much as you could do previously with blocks. Here's a trivial example: a fiber that generates successive Fibonacci numbers:

      fib = Fiber.new do
        f1 = f2 = 1
        loop do
          Fiber.yield f1
          f1, f2 = f2, f1 + f2
        end
      end

      10.times { puts fib.resume }

A fiber is somewhat like a thread, except you have control over when it gets scheduled. Initially, a fiber is suspended. When you resume it, it runs the block until the block finishes, or it hits a Fiber.yield. This is similar to a regular block yield: it suspends the fiber and passes control back to the resume. Any value passed to Fiber.yield becomes the value returned by resume.

By default, a fiber can only yield back to the code that resumed it. However, if you require the "fiber" library, Fibers get extended with a transfer method that allows one fiber to transfer control to another. Fibers then become fully fledged coroutines. However, we won't be needing all that power today.

Instead, let's get back to the idea of creating pipelines of functionality in code, much as you can create pipelines in the shell.

As a starting point, let's write two fibers. One's a generator—it creates a list of even numbers. The second is a consumer. All it does it accept values from the generator and print them. We'll make the consumer stop after printing 10 numbers.

    evens = Fiber.new do
      value = 0
      loop do
        Fiber.yield value
        value += 2
      end
    end

    consumer = Fiber.new do
      10.times do
        next_value = evens.resume
        puts next_value
      end
    end

    consumer.resume

Note how we had to use resume to kick off the consumer. Technically, the consumer doesn't have to be a Fiber, but, as we'll see in a minute, making it one gives us some flexibility.

As a next step, notice how we've created some coupling in this code. Our consumer fiber has the name of the evens generator coded into it. Let's wrap both fibers in a method, and pass the name of the generator into the consumer method.

    def evens
      Fiber.new do
        value = 0
        loop do
          Fiber.yield value
          value += 2
        end
      end
    end

    def consumer(source)
      Fiber.new do
        10.times do
          next_value = source.resume
          puts next_value
        end
      end
    end

    consumer(evens).resume

OK. Let's add one more fiber to the weave. We'll create a filter that only passes on numbers that are multiples of three. Again, we'll wrap it in a method.

    def evens
      Fiber.new do
        value = 0
        loop do
          Fiber.yield value
          value += 2
        end
      end
    end

    def multiples_of_three(source)
      Fiber.new do
        loop do
          next_value = source.resume
          Fiber.yield next_value if next_value % 3 == 0
        end
      end
    end

    def consumer(source)
      Fiber.new do
        10.times do
          next_value = source.resume
          puts next_value
        end
      end
    end

    consumer(multiples_of_three(evens)).resume

Running this, we get the output

This is getting cool. We write little chunks of code, and then combine them to get work done. Just like a pipeline. Except...

We can do better. First, the composition looks backwards. Because we're passing methods to methods, we write

    consumer(multiples_of_three(evens))

Instead, we'd like to write

    evens | multiples_of_three | consumer

Also, there's a fair amount of duplication in this code. Each of our little pipeline methods has the same overall structure, and each is coupled to the implementation of fibers. Let's see if we can fix this.

Wrapping Fibers

As is usual when we're refactoring towards a solution, we're about to get really messy. Don't worry, though. It will all wash off, and we'll end up with something a lot neater.

First, let's create a class that represents something that can appear in our pipeline. At it's heart is the process method. This reads something from the input side of the pipe, then "handles" that value. The default handling is to write that value to the output side of the pipeline, passing it on to the next element in the chain.

    class PipelineElement

      attr_accessor :source

      def initialize
        @fiber_delegate = Fiber.new do
          process
        end
      end

      def resume
        @fiber_delegate.resume
      end

      def process
        while value = input
          handle_value(value)
        end
      end

      def handle_value(value)
        output(value)
      end

      def input
        source.resume
      end

      def output(value)
        Fiber.yield(value)
      end
    end

When I first wrote this, I was tempted to make PipelineElement a subclass of Fiber, but that leads to coupling. In the end, the pipeline elements delegate to a separate Fiber object.

The first element of the pipeline doesn't receive any input from prior elements (because there are no prior elements), so we need to override its process method.

    class Evens < PipelineElement
       def process
         value = 0
         loop do
           output(value)
           value += 2
         end
       end
    end

    evens = Evens.new

Just to make things more interesting, we'll create a generic MultiplesOf filter, so we can filter based on any number, and not just 3:

    class MultiplesOf < PipelineElement
      def initialize(factor)
        @factor = factor
        super()
      end
      def handle_value(value)
        output(value) if value % @factor == 0
      end
    end

    multiples_of_three = MultiplesOf.new(3)
    multiples_of_seven = MultiplesOf.new(7)

Then we just knit it all together into a pipeline:

    multiples_of_three.source = evens
    multiples_of_seven.source = multiples_of_three

    10.times do
      puts multiples_of_seven.resume
    end

We get 0, 42, 84, 126, 168, and so on as output. (Any output stream that contains 42 must be correct, so no need for any unit tests here.)

But we're still a little way from our ideal of being able to pipe these puppies together. It's a good thing that Ruby let's us override the "|" operator. Up in class PipelineElement, define a new method:

    def |(other)
      other.source = self
      other
    end

This allows us to write:

    10.times do
      puts (evens | multiples_of_three | multiples_of_seven).resume
    end

or even:

    pipeline = evens | multiples_of_three | multiples_of_seven

    10.times do
      puts pipeline.resume
    end

Cool, or what?

In The Next Thrilling Installment

The next post will take these basic ideas and tart them up a bit, allowing us to use blocks directly in pipelines. We'll also reveal why our PipelineElement class I just wrote is somewhat more complicated than might seem necessary. In the meantime, here's the full source of the code so far.

    class PipelineElement

      attr_accessor :source

      def initialize
        @fiber_delegate = Fiber.new do
          process
        end
      end

      def |(other)
        other.source = self
        other
      end

      def resume
        @fiber_delegate.resume
      end

      def process
        while value = input
          handle_value(value)
        end
      end

      def handle_value(value)
        output(value)
      end

      def input
        source.resume
      end

      def output(value)
        Fiber.yield(value)
      end
    end

    ##
    # The classes below are the elements in our pipeline
    #
     class Evens < PipelineElement
       def process
         value = 0
         loop do
           output(value)
           value += 2
         end
       end
     end

    class MultiplesOf < PipelineElement
      def initialize(factor)
        @factor = factor
        super()
      end
      def handle_value(value)
        output(value) if value % @factor == 0
      end
    end

    evens = Evens.new
    multiples_of_three = MultiplesOf.new(3)
    multiples_of_seven = MultiplesOf.new(7)

    pipeline = evens | multiples_of_three | multiples_of_seven

    10.times do
      puts pipeline.resume
    end

Product: collectd

From http://directory.fsf.org/project/collectd/ :

'collectd' is a small daemon which collects system information every 10 seconds and writes the results in an RRD-file. The statistics gathered include: CPU and memory usage, system load, network latency (ping), network interface traffic, and system temperatures (using lm-sensors), and disk usage. 'collectd' is not a script; it is written in C for performance and portability. It stays in the memory so there is no need to start up a heavy interpreter every time new values should be logged.

Packing Down Prototype

In the world of client-side development, page size definitely matters and John-David Dalton is doing his best to improve this. Awhile back, he took on the task of trying to determine how to best optimize the 120k+ Prototype and Script.aculo.us libraries and get the best compression out of the files. He's done an absolutely fabulous job in the past and continues to pack the libraries as new versions are released. He's just released his new version of his compressed Prototype collection and released it on Google Code.

This pack contains the following compressed versions of Prototype: 1.4, 1.5, 1.5.1.1, 1.6.0 and Scriptaculous: 1.7.1_beta3, 1.8.0. This release compresses Prototype and Scriptaculous around 10% more than regular gzipping.

Prototype's lowest weighs in at 20.4kb.

Scriptaculous' lowest weighs in at: 19.7kb.

Protoculous' lowest weighs in at: 38.4kb.

Compressed forms of Scriptaculous are segmented to allow for custom builds of Protoculous/Scriptaculous.

He has included a common custom build with the pack, "Prototype+Effects", which weighs in at around 26kb. That's smaller than standard Prototype gzipped!

Protoculous also allows the loading of additional js files like Scriptaculous via: protoculous.js?load=one,two

You can download his collection here and included in the download is a readme file providing more details about his build.

Thursday, December 20, 2007

Google Text Ad Subversion

There’s an interesting article over at ZDNet that explained that Google’s text ads are getting subverted by trojans on people’s machines to get them to click on other people’s ads. It wasn’t clear what those ads were, exactly, but there you have it. I see this kind of thing as a clear path for future monetization - similar to how bad guys are adding extra form fields into forms via malware to gain more information about your identity. Very clever, and easy to do.

This is different from when Google’s ads were spreading malware but has the same basic purpose. Ultimately getting code on people’s machines is the best way to get control of the machine and ultimately make money off of it via spam, clicks, or whatever else they come up with.

Friday, December 14, 2007

Funny Books[Added new books]

English as a Second F*cking Language: How to Swear Effectively, Explained in Detail with Numerous Examples Taken From Everyday Life

Get Your Tongue Out of My Mouth, I'm Kissing You Good-bye!

When Your Phone Doesn't Ring, It'll Be Me

Even God Is Single, So Stop Giving Me A Hard Time

If You Can't Live Without Me, Why Aren't You Dead Yet?!

Babies and Other Hazards of Sex: How to Make a Tiny Person in Only 9 Months, with Tools You Probably Have around the Home

Dave Barry's Stay Fit and Healthy Until You're Dead

Women Are from Venus, Men Are from Hell

Everything You Ever Wanted to Know About Human Intelligence but Were Too Dumb to Ask: A Humorous Look at What Intelligence Is, How It Works & Who's Got It

Monday, December 3, 2007

Funny Books

The Complete Idiot's Guide to Understanding Intelligent Design

Teaching Kids to Read for Dummies

If Singleness Is a Gift, What's the Return Policy?

If Men Are Like Buses, Then How Do I Catch One?

How to Be Happy Though Married

Old Tractors and the Men Who Love Them: How to Keep Your Tractors Happy and Your Family Running

The Thermodynamics of Pizza

Tea Bag Folding

Lightweight Sandwich Construction

The Art and Craft of Pounding Flowers: No Ink, No Paint, Just a Hammer

Art and Science of Dumpster Diving

Outwitting Fish

How to Cook Roadkill

The History of Lesbian Hair

The Big Book Of Lesbian Horse Stories

How to Succeed in Business without a Penis

Dining Posture in Ancient Rome

The Joy of Sex, Pocket Edition

Penetrating Wagner's Ring

Whose Bottom Is This?

Straight Talk About Surgical Penis Enlargement

Circumcisions by Appointment

Bullying and Sexual Harassment: A Practical Guide

Scouts in Bondage

Love Your Body, Extended Edition

You're Dead and You Don't Even Know It

People Who Don't Know They're Dead: How They Attach Themselves To Unsuspecting bystanders and what to do about it

The Great Pantyhose Crafts Book

Teach Yourself Sex

Sex After Death

Making it in leather

Soil nailing

The Madam as Entrepreneur: Career Management in House Prostitution

So your wife came home speaking in tongues?: So did mine!

Children Are Wet Cement

Old Age: Its Cause & Prevention

Correct Mispronunciations of Some South Carolina Names

Mensa for Dummies

Greek rural postmen and their cancellation numbers

Life and laughter 'midst the cannibals

Throw Me a Bone, What Happens When You Marry an Archaeologist

Phone Calls from the Dead

Reusing Old Graves

Erections On Allotments - George W. Giles And Fred M. Osborn, Central Allotments Committee

Play With Your Own Marbles - J.J. Wright, S.W Partridge, C.1865

Proceedings Of The Second International Workshop On Nude Mice (1978)

Suggestive Thoughts For Busy Workers

How To Do Cups And Balls - The Vampire Press, 1946

Premature Burial And How It May Be Prevented

How To Cook Husbands; 1899

The Gentle Art Of Cooking Wives; 1900

The Disadvantages Of Being Dead

The Brain-Workers' Handbook

When All Else Fails, It's Time To Try...Frog Raising

Monday, February 26, 2007

Dojo Offline Demo and Release

Brad Neuberg keeps sprinting with his offline work for Dojo. He has just released a demo of an offline aware editor, and a release of the code itself:

We’ve finished the JavaScript layer of Dojo Offline. Dojo Offline consists of two major pieces: a JavaScript API that is included with a web application, and which helps with syncing, on/offline status notification, caching of data and resources, etc.; and a local small web proxy download that is cross-platform and cross-browser and which is web application independent. The JavaScript API is now finished, and can actually be used even though we have not finished the local web proxy yet. This is done by having the JavaScript layer be able to use the browser’s native cache if no offline cache is available. This means you can start playing with Dojo Offline right now, with the download link included in this blog post below. Note that using the browser cache instead of the web proxy is only suitable for prototyping and should not be deployed on production applications; it will work with varying degrees of success on Internet Explorer and Firefox, but not consistently on Safari. Higher levels of reliability will only come when we deliver the local web proxy component.

To start using Dojo Offline now in conjunction with the browser cache, you must have the following in your JavaScript code: dojo.off.requireOfflineCache = false. You must also turn on HTTP caching headers on your web server; how to turn on these HTTP headers and which ones to turn on are explained in the Dojo Storage FAQ. See the Moxie code links below for more examples of how to use the API. Note that the Dojo Offline JavaScript API has changed, especially for syncing, since our API blog post about a month ago and has become much simpler — see the Moxie source for details.

The demo of Moxie shown in the screencast can also be played with right in your browser. Please note that both Moxie and Dojo Offline are still alpha; you are literally seeing Dojo Offline being developed in front of your eyes, and glitches remain in both of them. Please debug and provide test cases for errors you find to help development.

Info

Live demo of the editor at work (alpha)

Source code: JavaScript and HTML

Full Release (zip)

Dojo Storage FAQ

Sunday, February 25, 2007

"We make it easier to work with Yahoo’s services than Yahoo does." That is what Alex believes as he announces that Dojo supports Yahoo! Pipes, which adds to the forward-thinking JSON-P style endpoints.

It’s cool and interesting to be able to call out to Pipes from your server-side web-app, but what the mashup-mad word REALLY needs is to be able to do the same thing from a browser. Despite not really being in the docs anywhere, Yahoo’s Kent Brewster points out that Pipes supports a JSON-P callback argument. Awesome!

The structure of Pipes URLs are different than every other Yahoo service (much like flickr. ugg.), so there’s no Dojo RPC for it yet, but you can easily query a pipe using dojo.io.bind and the ScriptSrcIO transport:

PLAIN TEXT

JAVASCRIPT:

// get news results for Cometd

dojo.require("dojo.io.ScriptSrcIO"); // the x-domain magic

dojo.require("dojo.debug.console"); // firebug integration

dojo.io.bind({

// grab this URL from the pipe you're interested in

url: "http://pipes.yahoo.com/pipes/fELaGmGz2xGtBTC3qe5lkA/run",

mimetype: "text/json",

transport: "ScriptSrcTransport",

jsonParamName: "callback”, // aha!

content: {

“render”: “json”,

“textinput1″: “cometd”

},

load: function(type, data, evt){

// log the response out to the Firebug console

dojo.require(”dojo.json”);

dojo.debug(dojo.json.serialize(arguments));

}

});

Tidbits

Monday, December 31, 2007

A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging

A Daily Dose of Fiber

Wrapping Fibers

In The Next Thrilling Installment

Product: collectd

Friday, December 21, 2007

Packing Down Prototype

Thursday, December 20, 2007

Google Text Ad Subversion

Friday, December 14, 2007

Funny Books[Added new books]

Monday, December 3, 2007

Funny Books

Monday, February 26, 2007

Dojo Offline Demo and Release

Sunday, February 25, 2007

About Me

Search This Blog

Labels

Blog Archive

Deals

Search 2.0

Monday, December 31, 2007

A Daily Dose of Fiber

Wrapping Fibers

In The Next Thrilling Installment

Friday, December 21, 2007

Thursday, December 20, 2007

Friday, December 14, 2007

Monday, December 3, 2007

Monday, February 26, 2007

Sunday, February 25, 2007

About Me

Search This Blog

Subscribe To

Labels

Blog Archive

Deals

Search 2.0