Postgres binary_to_string and string_to_binary type casts

I'm using Postgres 8.2.4 with the postgres-0.7.9.2008.01.28 gem in a Rails 2.0.2 app. I have a model with a binary column and for the most part it works just fine. However, I have two test cases that fail. The data I put into the model comes back out truncated to the first null byte. (The older postgres-0.7.1 driver gives a slightly different, but still corrupt result.)

After much debugging, I've discovered the problem. It lies in ActiveRecord::ConnectionAdapters::PostgreSQLColumn. The heuristic used to determine whether a column's value has been previously encoded by escape_bytea is not reliable.

In binary_to_string, it checks whether it should unescape a value by looking for the tell-tale sequence \nnn. However, in the two real- world test cases below, these sequences appear in the original data. So unescape_bytea gets called on a block of data that was not previously escaped, thereby mucking it up pretty good.

The two real-world test cases that triggered this bug's discovery can be found here: http://rubycloud.com/pg/base.tsz http://rubycloud.com/pg/1288.1024.jpg

In base.tsz, the sequence \690 appears at byte offset 64,688. In 1288.1024.jpg, the sequence \754 appears at byte offset 27,316.

It seems that we need a more reliable way to mark a data block as previously escaped by escape_bytea. This is a somewhat tricky problem. We could save a marker with the data and then remove it when unescaping it, but this would mean that the persisted data would only be usable from within Rails, and that this marker could never be changed, and that any other code that works with the database would have to know about this trick. I've developed a patch that does this and it fixes my problem, but haven't submitted it because it's an ugly solution.

Any other ideas?

Thanks, Matt

Any other ideas?

I don't quite see why we need to check, but I don't really store blobs so I could be missing something obvious.

When we're storing strings into varchar columns we don't need to decide whether or not to escape the values, we simply remove a few key values when sticking them into queries and the rest is left untouched?

Postgres has specific requirements for binary string storage. All non- printable characters need to be escaped with a \nnn sequence where nnn is the octal representation of the character. See PostgreSQL: Documentation: 8.2: Binary Data Types for details.

So in the type cast code (binary_to_string and string_to_binary) it needs to check whether the value stored in the attributes hash has actually gone through this process, or is just some data that was set but hasn't been persisted yet.

Really I think this whole problem could be solved if we knew the modified state of each column. If the value has been modified from the persisted value, then we would only do the conversion one way (string to binary), otherwise we would only do it the other way (binary to string). Right now this is determined by checking for a \nnn sequence which is unreliable.

Another (better?) approach would be to keep track of whether a particular attribute has gone through the type cast process, caching the casted value, and only calling the type cast code when necessary. This would be preferable anyway for expensive type casts such as this one, or complex geometry data types such as those supported by PostGIS.

--Matt

I don't quite see why we need to check, but I don't really store blobs so I could be missing something obvious.

PostgreSQL returns all data as strings (unless returning the entire result set in binary format, which is not really practical in this case). A string is a sequence of characters, which may be different than a sequence of bytes.

So, we need to do the conversion at some point, the only question is when and where to do that conversion.

This is just like any other type conversion, binary conversion should not be a special case in AR.

When we're storing strings into varchar columns we don't need to decide whether or not to escape the values, we simply remove a few key values when sticking them into queries and the rest is left untouched?

That's not entirely true. If you try to put invalid byte sequences into a text field, you will get an error.

If it happens that you are not using encoding, than a sequence of characters is (almost) equivalent to a sequence of bytes. But that is not true for all users.

    Regards,         Jeff Davis

I'm inclined to say that if we follow two these two rules in AR: 1. Any value from the database is converted to the ruby type that higher layers expect to see, *before* assigning it to the attribute. 2. Any value being written to the database is converted just before it's sent to the underlying database driver.

All the problems will go away. There's nothing special about binary data, there are many types that need correct conversion to/from the database.

However, I also like your idea, because it allows lazy conversions (which could be a win on large binary data). But the most important thing to me is correctness.

    Regards,         Jeff Davis