Monday, December 31, 2007
A Bayesian LDA-based Model for Semi-Supervised Part-of-speech Tagging
Users of the command line are familiar with the idea of building pipelines: a chain of simple commands strung together to the output of one becomes the input of the next. Using pipelines and a basic set of primitives, shell users can accomplish some sophisticated tasks. Here's a basic Unix shell pipeline that reports the ten longest .tip files in the current directory, based on the number of lines in each file:
wc -l *.tip | grep \.tip | sort -n | tail -10
Let's see how to add something similar to Ruby. By the end of this set of two articles, we'll be able to write things like
puts (even_numbers | tripler | incrementer | multiple_of_five ).resume
and a palindrome finder using blocks:
words = Pump.new %w{Madam, the civic radar rotator is not level.}
is_palindrome = Filter.new {|word| word == word.reverse}
pipeline = words .| {|word| word.downcase.tr("^a-z", '') } .| is_palindrome
while word = pipeline.resume
puts word
end
Great code? Nope. But getting there is fun. And, who knows? The techniques might well be useful in your next project.
A Daily Dose of Fiber
Ruby 1.9 adds support for Fibers. At their most basic, let you create simple generators (much as you could do previously with blocks. Here's a trivial example: a fiber that generates successive Fibonacci numbers:
fib = Fiber.new do
f1 = f2 = 1
loop do
Fiber.yield f1
f1, f2 = f2, f1 + f2
end
end
10.times { puts fib.resume }
A fiber is somewhat like a thread, except you have control over when it gets scheduled. Initially, a fiber is suspended. When you resume it, it runs the block until the block finishes, or it hits a Fiber.yield
. This is similar to a regular block yield: it suspends the fiber and passes control back to the resume
. Any value passed to Fiber.yield
becomes the value returned by resume
.
By default, a fiber can only yield back to the code that resumed it. However, if you require the "fiber"
library, Fibers get extended with a transfer
method that allows one fiber to transfer control to another. Fibers then become fully fledged coroutines. However, we won't be needing all that power today.
Instead, let's get back to the idea of creating pipelines of functionality in code, much as you can create pipelines in the shell.
As a starting point, let's write two fibers. One's a generator—it creates a list of even numbers. The second is a consumer. All it does it accept values from the generator and print them. We'll make the consumer stop after printing 10 numbers.
evens = Fiber.new do
value = 0
loop do
Fiber.yield value
value += 2
end
end
consumer = Fiber.new do
10.times do
next_value = evens.resume
puts next_value
end
end
consumer.resume
Note how we had to use resume
to kick off the consumer. Technically, the consumer doesn't have to be a Fiber, but, as we'll see in a minute, making it one gives us some flexibility.
As a next step, notice how we've created some coupling in this code. Our consumer
fiber has the name of the evens generator coded into it. Let's wrap both fibers in a method, and pass the name of the generator into the consumer
method.
def evens
Fiber.new do
value = 0
loop do
Fiber.yield value
value += 2
end
end
end
def consumer(source)
Fiber.new do
10.times do
next_value = source.resume
puts next_value
end
end
end
consumer(evens).resume
OK. Let's add one more fiber to the weave. We'll create a filter that only passes on numbers that are multiples of three. Again, we'll wrap it in a method.
def evens
Fiber.new do
value = 0
loop do
Fiber.yield value
value += 2
end
end
end
def multiples_of_three(source)
Fiber.new do
loop do
next_value = source.resume
Fiber.yield next_value if next_value % 3 == 0
end
end
end
def consumer(source)
Fiber.new do
10.times do
next_value = source.resume
puts next_value
end
end
end
consumer(multiples_of_three(evens)).resume
Running this, we get the output
0
6
12
18
. . .
This is getting cool. We write little chunks of code, and then combine them to get work done. Just like a pipeline. Except...
We can do better. First, the composition looks backwards. Because we're passing methods to methods, we write
consumer(multiples_of_three(evens))
Instead, we'd like to write
evens | multiples_of_three | consumer
Also, there's a fair amount of duplication in this code. Each of our little pipeline methods has the same overall structure, and each is coupled to the implementation of fibers. Let's see if we can fix this.
Wrapping Fibers
As is usual when we're refactoring towards a solution, we're about to get really messy. Don't worry, though. It will all wash off, and we'll end up with something a lot neater.
First, let's create a class that represents something that can appear in our pipeline. At it's heart is the process
method. This reads something from the input side of the pipe, then "handles" that value. The default handling is to write that value to the output side of the pipeline, passing it on to the next element in the chain.
class PipelineElement
attr_accessor :source
def initialize
@fiber_delegate = Fiber.new do
process
end
end
def resume
@fiber_delegate.resume
end
def process
while value = input
handle_value(value)
end
end
def handle_value(value)
output(value)
end
def input
source.resume
end
def output(value)
Fiber.yield(value)
end
end
When I first wrote this, I was tempted to make PipelineElement
a subclass of Fiber
, but that leads to coupling. In the end, the pipeline elements delegate to a separate Fiber
object.
The first element of the pipeline doesn't receive any input from prior elements (because there are no prior elements), so we need to override its process
method.
class Evens < PipelineElement
def process
value = 0
loop do
output(value)
value += 2
end
end
end
evens = Evens.new
Just to make things more interesting, we'll create a generic MultiplesOf filter, so we can filter based on any number, and not just 3:
class MultiplesOf < PipelineElement
def initialize(factor)
@factor = factor
super()
end
def handle_value(value)
output(value) if value % @factor == 0
end
end
multiples_of_three = MultiplesOf.new(3)
multiples_of_seven = MultiplesOf.new(7)
Then we just knit it all together into a pipeline:
multiples_of_three.source = evens
multiples_of_seven.source = multiples_of_three
10.times do
puts multiples_of_seven.resume
end
We get 0, 42, 84, 126, 168, and so on as output. (Any output stream that contains 42 must be correct, so no need for any unit tests here.)
But we're still a little way from our ideal of being able to pipe these puppies together. It's a good thing that Ruby let's us override the "|" operator. Up in class PipelineElement
, define a new method:
def |(other)
other.source = self
other
end
This allows us to write:
10.times do
puts (evens | multiples_of_three | multiples_of_seven).resume
end
or even:
pipeline = evens | multiples_of_three | multiples_of_seven
10.times do
puts pipeline.resume
end
Cool, or what?
In The Next Thrilling Installment
The next post will take these basic ideas and tart them up a bit, allowing us to use blocks directly in pipelines. We'll also reveal why our PipelineElement
class I just wrote is somewhat more complicated than might seem necessary. In the meantime, here's the full source of the code so far.
class PipelineElement
attr_accessor :source
def initialize
@fiber_delegate = Fiber.new do
process
end
end
def |(other)
other.source = self
other
end
def resume
@fiber_delegate.resume
end
def process
while value = input
handle_value(value)
end
end
def handle_value(value)
output(value)
end
def input
source.resume
end
def output(value)
Fiber.yield(value)
end
end
##
# The classes below are the elements in our pipeline
#
class Evens < PipelineElement
def process
value = 0
loop do
output(value)
value += 2
end
end
end
class MultiplesOf < PipelineElement
def initialize(factor)
@factor = factor
super()
end
def handle_value(value)
output(value) if value % @factor == 0
end
end
evens = Evens.new
multiples_of_three = MultiplesOf.new(3)
multiples_of_seven = MultiplesOf.new(7)
pipeline = evens | multiples_of_three | multiples_of_seven
10.times do
puts pipeline.resume
end
Product: collectd
From http://directory.fsf.org/project/collectd/ :
'collectd' is a small daemon which collects system information every 10 seconds and writes the results in an RRD-file. The statistics gathered include: CPU and memory usage, system load, network latency (ping), network interface traffic, and system temperatures (using lm-sensors), and disk usage. 'collectd' is not a script; it is written in C for performance and portability. It stays in the memory so there is no need to start up a heavy interpreter every time new values should be logged.
Friday, December 21, 2007
Packing Down Prototype
In the world of client-side development, page size definitely matters and John-David Dalton is doing his best to improve this. Awhile back, he took on the task of trying to determine how to best optimize the 120k+ Prototype and Script.aculo.us libraries and get the best compression out of the files. He's done an absolutely fabulous job in the past and continues to pack the libraries as new versions are released. He's just released his new version of his compressed Prototype collection and released it on Google Code.
This pack contains the following compressed versions of Prototype: 1.4, 1.5, 1.5.1.1, 1.6.0 and Scriptaculous: 1.7.1_beta3, 1.8.0. This release compresses Prototype and Scriptaculous around 10% more than regular gzipping.
Prototype's lowest weighs in at 20.4kb.
Scriptaculous' lowest weighs in at: 19.7kb.
Protoculous' lowest weighs in at: 38.4kb.
Compressed forms of Scriptaculous are segmented to allow for custom builds of Protoculous/Scriptaculous.
He has included a common custom build with the pack, "Prototype+Effects", which weighs in at around 26kb. That's smaller than standard Prototype gzipped!
Protoculous also allows the loading of additional js files like Scriptaculous via: protoculous.js?load=one,two
You can download his collection here and included in the download is a readme file providing more details about his build.
Thursday, December 20, 2007
Google Text Ad Subversion
There’s an interesting article over at ZDNet that explained that Google’s text ads are getting subverted by trojans on people’s machines to get them to click on other people’s ads. It wasn’t clear what those ads were, exactly, but there you have it. I see this kind of thing as a clear path for future monetization - similar to how bad guys are adding extra form fields into forms via malware to gain more information about your identity. Very clever, and easy to do.
This is different from when Google’s ads were spreading malware but has the same basic purpose. Ultimately getting code on people’s machines is the best way to get control of the machine and ultimately make money off of it via spam, clicks, or whatever else they come up with.
Friday, December 14, 2007
Funny Books[Added new books]
English as a Second F*cking Language: How to Swear Effectively, Explained in Detail with Numerous Examples Taken From Everyday Life
Get Your Tongue Out of My Mouth, I'm Kissing You Good-bye!
When Your Phone Doesn't Ring, It'll Be Me
Even God Is Single, So Stop Giving Me A Hard Time
If You Can't Live Without Me, Why Aren't You Dead Yet?!
Babies and Other Hazards of Sex: How to Make a Tiny Person in Only 9 Months, with Tools You Probably Have around the Home
Dave Barry's Stay Fit and Healthy Until You're Dead
Women Are from Venus, Men Are from Hell
Monday, December 3, 2007
Funny Books
The Complete Idiot's Guide to Understanding Intelligent Design
Teaching Kids to Read for Dummies
If Singleness Is a Gift, What's the Return Policy?
If Men Are Like Buses, Then How Do I Catch One?
How to Be Happy Though Married
Old Tractors and the Men Who Love Them: How to Keep Your Tractors Happy and Your Family Running
Lightweight Sandwich Construction
The Art and Craft of Pounding Flowers: No Ink, No Paint, Just a Hammer
Art and Science of Dumpster Diving
The Big Book Of Lesbian Horse Stories
How to Succeed in Business without a Penis
Dining Posture in Ancient Rome
The Joy of Sex, Pocket Edition
Straight Talk About Surgical Penis Enlargement
Bullying and Sexual Harassment: A Practical Guide
Love Your Body, Extended Edition
You're Dead and You Don't Even Know It
People Who Don't Know They're Dead: How They Attach Themselves To Unsuspecting bystanders and what to do about it
- The Great Pantyhose Crafts Book
- Teach Yourself Sex
- Sex After Death
- Making it in leather
- Soil nailing
- The Madam as Entrepreneur: Career Management in House Prostitution
- So your wife came home speaking in tongues?: So did mine!
- Children Are Wet Cement
- Old Age: Its Cause & Prevention
- Correct Mispronunciations of Some South Carolina Names
- Mensa for Dummies
- Greek rural postmen and their cancellation numbers
- Life and laughter 'midst the cannibals
- Throw Me a Bone, What Happens When You Marry an Archaeologist
- Phone Calls from the Dead
- Reusing Old Graves
- Erections On Allotments - George W. Giles And Fred M. Osborn, Central Allotments Committee
- Play With Your Own Marbles - J.J. Wright, S.W Partridge, C.1865
- Proceedings Of The Second International Workshop On Nude Mice (1978)
- Suggestive Thoughts For Busy Workers
- How To Do Cups And Balls - The Vampire Press, 1946
- Premature Burial And How It May Be Prevented
- How To Cook Husbands; 1899
- The Gentle Art Of Cooking Wives; 1900
- The Disadvantages Of Being Dead
- The Brain-Workers' Handbook
- When All Else Fails, It's Time To Try...Frog Raising