persisting complex (and possibly large) objects

11155 · December 2, 2010, 6:28am

This question may be so obvious that even Marnen won't answer it...

I need to store a set of relatively complex objects in the db -- in my case, these are discrete cumulative distribution functions: they take a long time to construct, but I want the lookup to be fast. Since the lookup is essentially a binary search, a B-Tree or some variant would be a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple data structure (e.g. arrays of numeric pairs) and spend time constructing a B-Tree from it, or (c) choose another approach that someone on this list points out.

Can this forum offer any advice on size / speed tradeoffs for YAML'd objects? If unpacking a YAML'd object is fast, then (a), storing the entire B-Tree is probably the best approach. If slow, then perhaps I'm better off with (b), storing a minimal data structure and reconstructing the B-Tree when I need it.

Thoughts?

- ff

11155 · December 2, 2010, 3:17pm

Fearless Fool wrote in post #965610:

This question may be so obvious that even Marnen won't answer it...

I can't resist a challenge like that.

I need to store a set of relatively complex objects in the db -- in my case, these are discrete cumulative distribution functions: they take a long time to construct, but I want the lookup to be fast. Since the lookup is essentially a binary search, a B-Tree or some variant would be a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple data structure (e.g. arrays of numeric pairs) and spend time constructing a B-Tree from it, or (c) choose another approach that someone on this list points out.

How about C? Nested sets are generally a great way of storing arbitrary trees in SQL databases for easy retrieval. The awesome_nested_set plugin makes this easy to do in Rails.

If you're unfamiliar with nested sets, I recommend http://dev.mysql.com/tech-resources/articles/hierarchical-data.html as a good overview.

Or do I misunderstand? What's the nature of the tree you'd like to store?

Can this forum offer any advice on size / speed tradeoffs for YAML'd objects? If unpacking a YAML'd object is fast, then (a), storing the entire B-Tree is probably the best approach. If slow, then perhaps I'm better off with (b), storing a minimal data structure and reconstructing the B-Tree when I need it.

Speed may not be your immediate concern with YAML: the big problem with that approach is that it bypasses the structure of the database and makes the data very difficult to query. It's a decent last resort, but it *is* a last resort.

Thoughts?

- ff

Best,

Michael_Pavling · December 2, 2010, 3:23pm

I've an inkling this is out of my comfort zone, but my initial thoughts (feel free to totally disregard...) are:

a) serialize? or is that what you meant by storing a "YAML'd B-Tree"? b) if these are complex objects that need to be stored in the DB, can you not use a DB designed to store complex objects (like Mongo)? c) Marnen's nested set sounds like a potential good solution too (given what I think I understand from your description...)

11155 · December 2, 2010, 4:48pm

Fearless Fool wrote in post #965610:

This question may be so obvious that even Marnen won't answer it...

I need to store a set of relatively complex objects in the db -- in my case, these are discrete cumulative distribution functions: they take a long time to construct, but I want the lookup to be fast. Since the lookup is essentially a binary search, a B-Tree or some variant would be a sensible data structure.

I have choices: (a) I store entire YAML'd B-Trees, (b) I store simple data structure (e.g. arrays of numeric pairs) and spend time constructing a B-Tree from it, or (c) choose another approach that someone on this list points out.

- ff

Tough to call... What's your intended usage scenario?

Depending on the resolution required from the CDF, could you just store built distributions in a table and query against it?

With normalized inputs and the right indices, the query should be fast, and trading off record count versus accuracy, you could always do a linear interpolation between any two stored points. Need higher interpolated accuracy? Then increase the resolution of stored points to minimize the interpolated gap.

Ugh... I gave up the practice of sadistics years ago, and now you're making me remember it... Curse you Red Baron!

11155 · December 2, 2010, 6:58pm

This may surprise the forum, but Marnen is right -- it makes sense to simply structure the CDFs as SQL queries.

I know how to write CDF functions in a procedural language but haven't yet attempted them using SQL, but @railsdog's post reminded me that it won't be too hard (thank you).

CDF functions are easily implemented as tables of linear segments -- I can store the vertices and the slopes, so linear interpolation between the breakpoints will be fast.

And -- doh! -- a google search for "cumulative distribution function sql" gives me a real jump start. Why didn't I think of that sooner?

Thanks, y'all.

- ff

11155 · December 2, 2010, 11:38pm

@marnen, @railsdog: I owe you a big thank you. Once I realized that looking up the CDF was really just piecewise linear approximation, then casting it into pure SQL was easy. In fact, it gave me a chance to test out the new AREL-based query interface.

If anyone is interested, it's surprisingly simple: Assume you store x0, y0, and slope for each line segment like this:

ActiveRecord::Schema.define do create_table(:lin_segs, :force => true) do |t| t.integer :tbl_id t.float :x0 t.float :y0 t.float :m end end

The AREL-style query (replete with linear interpolation) is simply:

def lookup(x, table_id) seg = LineSegment.where(:table_id => table_id).order('x0 DESC').where("x0 < #{x}").limit(1).first seg.y0 + (x - seg.x0) * seg.m end

If you insert "bumpers" at each end of your table (an entry with x0 = LARGEST_NEGATIVE_FLOAT and an entry with x0 = max X), then you don't even need additional logic to handle out-of-bounds x values.

Thanks for the nudge. I'm getting more comfortable thinking in SQL.

- ff