I'm using Postgres 8.2.4 with the postgres-0.7.9.2008.01.28 gem in a
Rails 2.0.2 app. I have a model with a binary column and for the most
part it works just fine. However, I have two test cases that fail. The
data I put into the model comes back out truncated to the first null
byte. (The older postgres-0.7.1 driver gives a slightly different, but
still corrupt result.)
After much debugging, I've discovered the problem. It lies in
ActiveRecord::ConnectionAdapters::PostgreSQLColumn. The heuristic used
to determine whether a column's value has been previously encoded by
escape_bytea is not reliable.
In binary_to_string, it checks whether it should unescape a value by
looking for the tell-tale sequence \nnn. However, in the two real-
world test cases below, these sequences appear in the original data.
So unescape_bytea gets called on a block of data that was not
previously escaped, thereby mucking it up pretty good.
In base.tsz, the sequence \690 appears at byte offset 64,688.
In 1288.1024.jpg, the sequence \754 appears at byte offset 27,316.
It seems that we need a more reliable way to mark a data block as
previously escaped by escape_bytea. This is a somewhat tricky problem.
We could save a marker with the data and then remove it when
unescaping it, but this would mean that the persisted data would only
be usable from within Rails, and that this marker could never be
changed, and that any other code that works with the database would
have to know about this trick. I've developed a patch that does this
and it fixes my problem, but haven't submitted it because it's an ugly
solution.
I don't quite see why we need to check, but I don't really store blobs
so I could be missing something obvious.
When we're storing strings into varchar columns we don't need to
decide whether or not to escape the values, we simply remove a few key
values when sticking them into queries and the rest is left untouched?
Postgres has specific requirements for binary string storage. All non-
printable characters need to be escaped with a \nnn sequence where nnn
is the octal representation of the character. See
PostgreSQL: Documentation: 8.2: Binary Data Types for
details.
So in the type cast code (binary_to_string and string_to_binary) it
needs to check whether the value stored in the attributes hash has
actually gone through this process, or is just some data that was set
but hasn't been persisted yet.
Really I think this whole problem could be solved if we knew the
modified state of each column. If the value has been modified from the
persisted value, then we would only do the conversion one way (string
to binary), otherwise we would only do it the other way (binary to
string). Right now this is determined by checking for a \nnn sequence
which is unreliable.
Another (better?) approach would be to keep track of whether a
particular attribute has gone through the type cast process, caching
the casted value, and only calling the type cast code when necessary.
This would be preferable anyway for expensive type casts such as this
one, or complex geometry data types such as those supported by
PostGIS.
I don't quite see why we need to check, but I don't really store blobs
so I could be missing something obvious.
PostgreSQL returns all data as strings (unless returning the entire
result set in binary format, which is not really practical in this
case). A string is a sequence of characters, which may be different
than a sequence of bytes.
So, we need to do the conversion at some point, the only question is
when and where to do that conversion.
This is just like any other type conversion, binary conversion should
not be a special case in AR.
When we're storing strings into varchar columns we don't need to
decide whether or not to escape the values, we simply remove a few key
values when sticking them into queries and the rest is left untouched?
That's not entirely true. If you try to put invalid byte sequences
into a text field, you will get an error.
If it happens that you are not using encoding, than a sequence of
characters is (almost) equivalent to a sequence of bytes. But that is
not true for all users.
I'm inclined to say that if we follow two these two rules in AR:
1. Any value from the database is converted to the ruby type that
higher layers expect to see, *before* assigning it to the attribute.
2. Any value being written to the database is converted just before
it's sent to the underlying database driver.
All the problems will go away. There's nothing special about binary
data, there are many types that need correct conversion to/from the
database.
However, I also like your idea, because it allows lazy conversions
(which could be a win on large binary data). But the most important
thing to me is correctness.