Natural Code

Code, science and politics.

Data streams, part 2

In a previous post I mentioned I was working on a program to test the data stream concepts from this article. I finished it and implemented the method for finding the most frequent element in a stream.

It keeps track of the most frequent element in a stream of letters or numbers using only two integer variables. It’s not guaranteed to be accurate, but it’s a constant-space algorithm that can work on a stream of indefinite length, and gives good results as long as the most frequent element makes up more than half of the total elements in the stream.

It’s more a small library than a program, actually. I just use it like this :

require 'stream_algorithm'

Which calls up this file. Next, I make a Stream Reader which I store in a variable…

reader = StreamReader.new.check :biggest_element,
:most_frequent
>Starting StreamReader...

…and tell it to check for the biggest element in the stream (takes only one integer variable to track), as well as the most frequent element in the stream (takes only two integer/string variables but isn’t always reliable).

NumberStream.new( reader, :length => 10,
:skew => {:toward => 2, :percent => 55} ).start

This creates a Number Stream, assigns it the Stream Reader, gives the stream a length of 10 numbers, and skews the odds so that 55% of the time, the stream will contain 2 instead of a random number. The stream is then started.

The output from this part of the program looks like this :

 >NumberStream running!
 >Reading : 3  3 x 1
 >Reading : 1  3 x 0
 >Reading : 2  2 x 1
 >Reading : 7  2 x 0
 >Reading : 2  2 x 1
 >Reading : 7  2 x 0
 >Reading : 2  2 x 1
 >Reading : 2  2 x 2
 >Reading : 2  2 x 3
 >Reading : 2  2 x 4
 >Stream stopped.
 >Biggest element : 7
 >Most frequent : 2
 

I can do the same with letters :

reader.check :total_amount, :most_frequent

stream = LetterStream.new( reader, :length => 10,
:skew => {:toward => 'i', :percent => 45} ).start

The Reader’s check method resets its instance variables, so it’s ready to be used on a new stream. I create a new Letter Stream, and it gives me something like this :

>LetterStream running!
>Reading : h  h x 1
>Reading : i  h x 0
>Reading : i  i x 1
>Reading : z  i x 0
>Reading : i  i x 1
>Reading : i  i x 2
>Reading : i  i x 3
>Reading : i  i x 4
>Reading : t  i x 3
>Reading : c  i x 2
>Stream stopped.
>Total amount : 10
>Most frequent : i

And finally, I can do something like…

10.times { stream.start }

…and it’ll make the stream run to its length ten times while conserving state between iterations.

I thought it was fun, anyways.

July 15, 2008 Posted by naturalcode | Technology | , , , | No Comments Yet

Data streams, part 1

American Scientist has an article in their latest issue that talks about the algorithmic difficulties of processing streams of data. It begins by talking about how search engines identify the most popular people (“The Britney Spears problem”), and then goes on to use simple number streams to explain the subject.

I was making progress in reading the article, when I got sidetracked by this :

Although identifying the most frequent item in a stream is hard in general, there is an ingenious way of doing it in one special case—namely, when the most common item is so popular that it accounts for a majority of the stream entries (more than half the elements).

The algorithm that accomplishes this task requires just two registers, and it runs in a constant amount of time per stream element. (Before reading on you might want to try constructing such an algorithm for yourself.)

This is where I put down the magazine, opened SciTE, and wrote this :

require 'stream_algorithm'

reader = StreamReader.new
reader.check :total_amount, :biggest_element

stream = NumberStream.new( reader ).start

The code for the required ’stream_algorithm’ file is at the end of this post. When I run this, I get output that looks like this :

Starting streamreader...
Stream running!
Reading : 8
Reading : 8
Reading : 6
Reading : 4
Reading : 5
Reading : 8
Reading : 2
Reading : 6
Reading : 2
Reading : 8
Stream stopped.
Total amount : 10
Biggest element : 8

Cool. Now I’ve got a simple way of testing and playing with the concepts in the article. Next I’ll try to make the stream reader keep track of the most frequent number using only two numerical variables.

To be continued.

Here’s the code for the NumberStream and StreamReader classes.

#stream_algorithm.rb
def force string
  puts string
  STDOUT.flush
end

class NumberStream
  attr_reader :length, :reader

  def initialize reader, length=10
    @reader = reader
    @length = length
  end

  def start
    force "Stream running!"

    1.upto(length) { next_number }; stop
  end

  def stop
    force "Stream stopped."
    reader.display_results
  end

  def next_number
    reader.send rand( 10 )
    sleep 0.1
  end
end

class StreamReader
  attr_accessor :to_check

  def initialize
    force "Starting streamreader..."
  end

  def send element
    force "Reading : #{element}"
    process element
  end

  def check *args
    @to_check = args
    @to_check.each { |thing| eval "@#{thing} = 0" }
  end

  def process element
    if to_check.include? :total_amount
      @total_amount += 1
    end

    if to_check.include? :biggest_element
      @biggest_element = element if @biggest_element < element
    end
  end

  def display_results
    to_check.each do |thing|
      display_name = thing.to_s.capitalize.gsub( /_/, ' ' )
      variable = eval "@#{thing}"

      force "#{display_name} : #{variable}"
    end
  end
end

July 6, 2008 Posted by naturalcode | Uncategorized | , , , | 1 Comment