I'm not sure how you get 34000 records from 6000 words. Surely you
should get less not more.
it would run for at least a minute... that's kind of weird...
Are you sure it is just the above code that is taking a minute and not
the bit of code that counts the words?
What hardware is this on exactly?
it is 34000 records because it is actually count up phrases instead of
words... for example, "a quick brown fox jumps over the lazy dog" i
will actually count phrases such as
"a quick"
"a quick brown"
"a quick brown fox"
etc... so the final result is 34000 entry hashes, mapping to
frequencies and word counts.
the hardware should be pretty good... it is the HP TouchSmart IQ804...
with a Core 2 Duo and hard drive ST3320820AS which is 7200rpm, 8MB
buffer.
hm... makes me wonder if the db is just 4MB, how come the hard drive
buffer 8MB didn't totally handled it in its RAM and be super fast. did
it actually force write to the physical disc?
Phrase.transaction do
frequencies.each do |phrase, freq|
Phrase.create!(:s => phrase, :frequency => freq)
end
end
Hash#each passes keys and values to your block and avoids unnecessary
lookups.
My snippet above doesn't take into account your :length =>
lengths[phrase] and that is as it should be. Presumably, lengths[phrase]
== phrase.length. Then
class Phrase < ActiveRecord::Base
attr_protected :length
...
def s=(value)
self.length = value.length
end
end
would be much cleaner code because it puts responsibility for setting
the length attribute where it belongs.
Phrase.transaction do
frequencies.each do |phrase, freq|
Phrase.create!(:s => phrase, :frequency => freq)
end
end
Hash#each passes keys and values to your block and avoids unnecessary
lookups.
My snippet above doesn't take into account your :length =>
lengths[phrase] and that is as it should be. Presumably, lengths[phrase]
== phrase.length. Then
class Phrase < ActiveRecord::Base
attr_protected :length
...
def s=(value)
self.length = value.length
end
end
would be much cleaner code because it puts responsibility for setting
the length attribute where it belongs.
i changed it to
time_start = Time.now
Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.create!(:s => phrase, :frequency =>
frequencies[phrase], :length => lengths[phrase])
end
end
p "took ", Time.now - time_start, " seconds to finish"
but it is still the same: it took 75 seconds...
i wanted the length, which is the count of word because in the future i
might want to do query such as "select * where length > 3" and so if i
count the word by getting "s" first, then it will be really slow won't
it? if length is stored and indexed, then "length > 3" can be super
fast?
time_start = Time.now
Phrase.transaction do
all_phrases.each do |phrase|
recordPhrase = Phrase.create!(:s => phrase, :frequency => frequencies[phrase], :length => lengths[phrase])
end
end
p "took ", Time.now - time_start, " seconds to finish"
but it is still the same: it took 75 seconds...
That's because it's still really the same code!
i wanted the length, which is the count of word because in the future i might want to do query such as "select * where length > 3" and so if i count the word by getting "s" first, then it will be really slow won't it? if length is stored and indexed, then "length > 3" can be super fast?
Just because it has an index on it doesn't automatically make it super fast. It depends largely on the selectivity of the length field. I suspect that most of your phrases are larger than 3 characters and so such a query will result in a full table scan anyway. In that case making the table smaller by leaving out the length may actually make it faster. However, selecting all the phrases where length < 5 would almost certainly use the index and make it faster. Only you know what you are likely to do in general here so you need to decide (and test) whether it is better to have the index and length column or not.
You can always do "select * from phrases where length(s) > 3" or something like.
Are you sure your overall run time is not limited by CPU rather than IO? How much CPU time is used to run your code?
oh sorry i should have clarified... the length is actually the count of words... so length > 3 means 3 words at least. or i might do a query that is word > 5 or 7... so will having an column and an index make it fast if i do query such as length > 7? i intend for example to get back only 5% or 3% of all records so the index might make it fast.
Ah, yes, that makes it different
by the way, the 75 second is the time from the moment the
Phrase.transaction do
to the end of it... so... i think during that time it is mostly IO time...
OK, just need to look at the CPU time used then.
Is this Windows or Linux or something else?
let me actually write up a test case with some dummy data to simulate it so that every one is on the same ground...
That would be the best of all. I can run your code with your sample data against MySQL and Postgres to give you some ideas. I don't really know much about sqlite, but I must admit I'm curious as to where the time has gone in such an apparently simple situation.
Don't forget to include the database definition and data as well as the code.
ok, the following code with 6000 records insert will take 13.3 seconds to finish (just for the db portion). if i change 6000 to 30000 then it will take 67 seconds.
OK, I created the table and the index and ran your code for 30000 records, but I wrapped the database part in a Benchmark.measure{}
That's 33 seconds of CPU time with 37 seconds elapsed on a quad core 2.4 GHz with a Seagate Barracuda SATA drive with 32MB cache.
As you can see, it's pretty much all in CPU time!!!!
The result was essentially the same in Postgres and MS SQL server!
So you can forget about the database itself. None of the database engines were unduly taxed by the test.
Just for fun I changed the program to output the data as SQL INSERT statements and then run that (with 30000 inserts wrapped in a transaction) against MySQL.
Imported in 1.2 seconds!!
I Don't know if it is the hash lookup code or ActiveRecord that is gobbling up the time, but it certainly isn't the database.
You'll need to tinker with, or better profile your code to find out what is sucking up the time.
I Don't know if it is the hash lookup code or ActiveRecord that is
gobbling up the time, but it certainly isn't the database.
by the way... interesting to find out the CPU time is so much... i
thought this code is I/O bound at first...
by the way I am using Rails 2.3.2, Ruby 1.8.6 patchlevel 287... on
Windows 7.
hm... the Hash look up time should be really small... for example, if i
just write all data to a text file, it should be really fast... it
might be ActiveRecord, although I though ActiveRecord merely translate
the method into a SQL statement and so shouldn't be so CPU intensive.
I Don't know if it is the hash lookup code or ActiveRecord that is
gobbling up the time, but it certainly isn't the database.
by the way... interesting to find out the CPU time is so much... i thought this code is I/O bound at first...
by the way I am using Rails 2.3.2, Ruby 1.8.6 patchlevel 287... on Windows 7.
hm... the Hash look up time should be really small... for example, if i just write all data to a text file, it should be really fast... it might be ActiveRecord, although I though ActiveRecord merely translate the method into a SQL statement and so shouldn't be so CPU intensive.
Aha.... It looks like ActiveRecord has an enormous overhead in creating/saving records.
If you change the inserts to this....
puts Benchmark.measure {
Phrase.transaction do
all_phrases.each do |phrase|
Phrase.connection.execute("insert into phrases(s,frequency,length) values('#{phrase}',#{frequencies[phrase]},#{lengths[phrase]})")
end
end
}
This falls down of course if the phrases contain single quote characters or are from an untrusted source. You will need to at least escape quotes before putting them in the insert statement. This will slow it down a bit, but not as much as before I would think.
Note to self: Don't attempt to do lots of records creations with standard AR code!
Gary Doades wrote:
etc... so the final result is 34000 entry hashes, mapping to
frequencies and word counts.
the hardware should be pretty good... it is the HP TouchSmart IQ804...
with a Core 2 Duo and hard drive ST3320820AS which is 7200rpm, 8MB
buffer.
hm... makes me wonder if the db is just 4MB, how come the hard drive
buffer 8MB didn't totally handled it in its RAM and be super fast. did
it actually force write to the physical disc?
It's never that simple (and yes under the appropriate circumstances
the drive is supposed to flush to the actual disk). For example
there's a constant amount of overhead with each query: network latency
etc. not a huge amount (probably less than a millisecond), but
multiply that by 34000 and it will add up. If you really need to
manipulate this much data you'd be well advised to do the inserts in
bulk.
Frederick Cheung wrote:
>> it actually force write to the physical disc?
> It's never that simple (and yes under the appropriate circumstances
> the drive is supposed to flush to the actual disk). For example
> there's a constant amount of overhead with each query: network latency
> etc. not a huge amount (probably less than a millisecond), but
> multiply that by 34000 and it will add up. If you really need to
> manipulate this much data you'd be well advised to do the inserts in
> bulk.
so how to do inserts in bulk? by using
Phrase.connection.execute("insert into phrases(s,frequency,length)
values('#{phrase}',#{frequencies[phrase]},#{lengths[phrase]})")
basically. you can insert more than one tuple like that (although
perhaps not with all databases)
? or by ar-extensions? can ActiveRecord have a mode for saving without
being in a transaction, or can ActiveRecord has some standard method of
doing bulk inserts?
ar-extensions provides bulk inserts if the db supports it.
You might be using the wrong tool and you might be reinventing existing
solutions. Did you have a look at existing text mining tools/frameworks
in Ruby as well as other languages?