persisting complex (and possibly large) objects

This question may be so obvious that even Marnen won't answer it... :slight_smile:

I need to store a set of relatively complex objects in the db -- in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or (c) choose another approach that
someone on this list points out.

Can this forum offer any advice on size / speed tradeoffs for YAML'd
objects? If unpacking a YAML'd object is fast, then (a), storing the
entire B-Tree is probably the best approach. If slow, then perhaps I'm
better off with (b), storing a minimal data structure and reconstructing
the B-Tree when I need it.

Thoughts?

- ff

Fearless Fool wrote in post #965610:

This question may be so obvious that even Marnen won't answer it... :slight_smile:

I can't resist a challenge like that. :smiley:

I need to store a set of relatively complex objects in the db -- in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or (c) choose another approach that
someone on this list points out.

How about C? Nested sets are generally a great way of storing arbitrary
trees in SQL databases for easy retrieval. The awesome_nested_set
plugin makes this easy to do in Rails.

If you're unfamiliar with nested sets, I recommend
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html as a
good overview.

Or do I misunderstand? What's the nature of the tree you'd like to
store?

Can this forum offer any advice on size / speed tradeoffs for YAML'd
objects? If unpacking a YAML'd object is fast, then (a), storing the
entire B-Tree is probably the best approach. If slow, then perhaps I'm
better off with (b), storing a minimal data structure and reconstructing
the B-Tree when I need it.

Speed may not be your immediate concern with YAML: the big problem with
that approach is that it bypasses the structure of the database and
makes the data very difficult to query. It's a decent last resort, but
it *is* a last resort.

Thoughts?

- ff

Best,

I've an inkling this is out of my comfort zone, but my initial
thoughts (feel free to totally disregard...) are:

a) serialize? or is that what you meant by storing a "YAML'd B-Tree"?
b) if these are complex objects that need to be stored in the DB, can
you not use a DB designed to store complex objects (like Mongo)?
c) Marnen's nested set sounds like a potential good solution too
(given what I think I understand from your description...)

Fearless Fool wrote in post #965610:

This question may be so obvious that even Marnen won't answer it... :slight_smile:

I need to store a set of relatively complex objects in the db -- in my
case, these are discrete cumulative distribution functions: they take a
long time to construct, but I want the lookup to be fast. Since the
lookup is essentially a binary search, a B-Tree or some variant would be
a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple
data structure (e.g. arrays of numeric pairs) and spend time
constructing a B-Tree from it, or (c) choose another approach that
someone on this list points out.

- ff

Tough to call... What's your intended usage scenario?

Depending on the resolution required from the CDF, could you just store
built distributions in a table and query against it?

With normalized inputs and the right indices, the query should be fast,
and trading off record count versus accuracy, you could always do a
linear interpolation between any two stored points. Need higher
interpolated accuracy? Then increase the resolution of stored points to
minimize the interpolated gap.

Ugh... I gave up the practice of sadistics years ago, and now you're
making me remember it... Curse you Red Baron!

This may surprise the forum, but Marnen is right :slight_smile: :slight_smile: -- it makes
sense to simply structure the CDFs as SQL queries.

I know how to write CDF functions in a procedural language but haven't
yet attempted them using SQL, but @railsdog's post reminded me that it
won't be too hard (thank you).

CDF functions are easily implemented as tables of linear segments -- I
can store the vertices and the slopes, so linear interpolation between
the breakpoints will be fast.

And -- doh! -- a google search for "cumulative distribution function
sql" gives me a real jump start. Why didn't I think of that sooner?

Thanks, y'all.

- ff

@marnen, @railsdog: I owe you a big thank you. Once I realized that
looking up the CDF was really just piecewise linear approximation, then
casting it into pure SQL was easy. In fact, it gave me a chance to test
out the new AREL-based query interface.

If anyone is interested, it's surprisingly simple: Assume you store x0,
y0, and slope for each line segment like this:

    ActiveRecord::Schema.define do
      create_table(:lin_segs, :force => true) do |t|
        t.integer :tbl_id
        t.float :x0
        t.float :y0
        t.float :m
      end
    end

The AREL-style query (replete with linear interpolation) is simply:

  def lookup(x, table_id)
    seg = LineSegment.where(:table_id => table_id).order('x0
DESC').where("x0 < #{x}").limit(1).first
    seg.y0 + (x - seg.x0) * seg.m
  end

If you insert "bumpers" at each end of your table (an entry with x0 =
LARGEST_NEGATIVE_FLOAT and an entry with x0 = max X), then you don't
even need additional logic to handle out-of-bounds x values.

Thanks for the nudge. I'm getting more comfortable thinking in SQL.

- ff