Data streams, part 1
American Scientist has an article in their latest issue that talks about the algorithmic difficulties of processing streams of data. It begins by talking about how search engines identify the most popular people (“The Britney Spears problem”), and then goes on to use simple number streams to explain the subject.
I was making progress in reading the article, when I got sidetracked by this :
Although identifying the most frequent item in a stream is hard in general, there is an ingenious way of doing it in one special case—namely, when the most common item is so popular that it accounts for a majority of the stream entries (more than half the elements).
The algorithm that accomplishes this task requires just two registers, and it runs in a constant amount of time per stream element. (Before reading on you might want to try constructing such an algorithm for yourself.)
This is where I put down the magazine, opened SciTE, and wrote this :
require 'stream_algorithm' reader = StreamReader.new reader.check :total_amount, :biggest_element stream = NumberStream.new( reader ).start
The code for the required ’stream_algorithm’ file is at the end of this post. When I run this, I get output that looks like this :
Starting streamreader...
Stream running!
Reading : 8
Reading : 8
Reading : 6
Reading : 4
Reading : 5
Reading : 8
Reading : 2
Reading : 6
Reading : 2
Reading : 8
Stream stopped.
Total amount : 10
Biggest element : 8
Cool. Now I’ve got a simple way of testing and playing with the concepts in the article. Next I’ll try to make the stream reader keep track of the most frequent number using only two numerical variables.
To be continued.
—
Here’s the code for the NumberStream and StreamReader classes.
#stream_algorithm.rb
def force string
puts string
STDOUT.flush
end
class NumberStream
attr_reader :length, :reader
def initialize reader, length=10
@reader = reader
@length = length
end
def start
force "Stream running!"
1.upto(length) { next_number }; stop
end
def stop
force "Stream stopped."
reader.display_results
end
def next_number
reader.send rand( 10 )
sleep 0.1
end
end
class StreamReader
attr_accessor :to_check
def initialize
force "Starting streamreader..."
end
def send element
force "Reading : #{element}"
process element
end
def check *args
@to_check = args
@to_check.each { |thing| eval "@#{thing} = 0" }
end
def process element
if to_check.include? :total_amount
@total_amount += 1
end
if to_check.include? :biggest_element
@biggest_element = element if @biggest_element < element
end
end
def display_results
to_check.each do |thing|
display_name = thing.to_s.capitalize.gsub( /_/, ' ' )
variable = eval "@#{thing}"
force "#{display_name} : #{variable}"
end
end
end
1 Comment »
Leave a comment
-
Archives
- August 2008 (6)
- July 2008 (13)
- June 2008 (6)
-
Categories
-
RSS
Entries RSS
Comments RSS


[...] streams, part 2 In a previous post I mentioned I was working on a program to test the data stream concepts from this article. I [...]
Pingback by Data streams, part 2 « Natural Code | July 15, 2008 |