updating the db 6000 times will take few minutes ?

Jian Lin wrote:

Jian Lin wrote:

i was doing it like this:

all_phrases = frequencies.keys Phrase.transaction do   all_phrases.each do |phrase|   recordPhrase = Phrase.new(:s => phrase, :frequency => frequencies[phrase], :length => lengths[phrase])   recordPhrase.save   end end

by the way, the table has indexes on all fields: s, frequency, and length.

That'll slow down inserts/updates, but not by a huge amount on such a small table.

Gary Doades wrote:

I'm not sure how you get 34000 records from 6000 words. Surely you should get less not more.

it would run for at least a minute... that's kind of weird...

Are you sure it is just the above code that is taking a minute and not the bit of code that counts the words?

What hardware is this on exactly?

it is 34000 records because it is actually count up phrases instead of words... for example, "a quick brown fox jumps over the lazy dog" i will actually count phrases such as

"a quick" "a quick brown" "a quick brown fox"

etc... so the final result is 34000 entry hashes, mapping to frequencies and word counts.

the hardware should be pretty good... it is the HP TouchSmart IQ804... with a Core 2 Duo and hard drive ST3320820AS which is 7200rpm, 8MB buffer.

hm... makes me wonder if the db is just 4MB, how come the hard drive buffer 8MB didn't totally handled it in its RAM and be super fast. did it actually force write to the physical disc?

Consider

Phrase.transaction do   frequencies.each do |phrase, freq|     Phrase.create!(:s => phrase, :frequency => freq)   end end

Hash#each passes keys and values to your block and avoids unnecessary lookups.

My snippet above doesn't take into account your :length => lengths[phrase] and that is as it should be. Presumably, lengths[phrase] == phrase.length. Then

class Phrase < ActiveRecord::Base   attr_protected :length   ...   def s=(value)     self.length = value.length   end end

would be much cleaner code because it puts responsibility for setting the length attribute where it belongs.

Michael

Michael Schuerig wrote:

end

Consider

Phrase.transaction do   frequencies.each do |phrase, freq|     Phrase.create!(:s => phrase, :frequency => freq)   end end

Hash#each passes keys and values to your block and avoids unnecessary lookups.

My snippet above doesn't take into account your :length => lengths[phrase] and that is as it should be. Presumably, lengths[phrase] == phrase.length. Then

class Phrase < ActiveRecord::Base   attr_protected :length   ...   def s=(value)     self.length = value.length   end end

would be much cleaner code because it puts responsibility for setting the length attribute where it belongs.

i changed it to

  time_start = Time.now   Phrase.transaction do     all_phrases.each do |phrase|     recordPhrase = Phrase.create!(:s => phrase, :frequency => frequencies[phrase], :length => lengths[phrase])     end   end   p "took ", Time.now - time_start, " seconds to finish"

but it is still the same: it took 75 seconds...

i wanted the length, which is the count of word because in the future i might want to do query such as "select * where length > 3" and so if i count the word by getting "s" first, then it will be really slow won't it? if length is stored and indexed, then "length > 3" can be super fast?

Jian Lin wrote:

i changed it to

  time_start = Time.now   Phrase.transaction do     all_phrases.each do |phrase|     recordPhrase = Phrase.create!(:s => phrase, :frequency => frequencies[phrase], :length => lengths[phrase])     end   end   p "took ", Time.now - time_start, " seconds to finish"

but it is still the same: it took 75 seconds...

That's because it's still really the same code!

i wanted the length, which is the count of word because in the future i might want to do query such as "select * where length > 3" and so if i count the word by getting "s" first, then it will be really slow won't it? if length is stored and indexed, then "length > 3" can be super fast?

Just because it has an index on it doesn't automatically make it super fast. It depends largely on the selectivity of the length field. I suspect that most of your phrases are larger than 3 characters and so such a query will result in a full table scan anyway. In that case making the table smaller by leaving out the length may actually make it faster. However, selecting all the phrases where length < 5 would almost certainly use the index and make it faster. Only you know what you are likely to do in general here so you need to decide (and test) whether it is better to have the index and length column or not.

You can always do "select * from phrases where length(s) > 3" or something like.

Are you sure your overall run time is not limited by CPU rather than IO? How much CPU time is used to run your code?

Cheers, Gary.

Jian Lin wrote:

oh sorry i should have clarified... the length is actually the count of words... so length > 3 means 3 words at least. or i might do a query that is word > 5 or 7... so will having an column and an index make it fast if i do query such as length > 7? i intend for example to get back only 5% or 3% of all records so the index might make it fast.

Ah, yes, that makes it different :slight_smile:

by the way, the 75 second is the time from the moment the

Phrase.transaction do

to the end of it... so... i think during that time it is mostly IO time...

OK, just need to look at the CPU time used then.

Is this Windows or Linux or something else?

let me actually write up a test case with some dummy data to simulate it so that every one is on the same ground...

That would be the best of all. I can run your code with your sample data against MySQL and Postgres to give you some ideas. I don't really know much about sqlite, but I must admit I'm curious as to where the time has gone in such an apparently simple situation.

Don't forget to include the database definition and data as well as the code.

Cheers, Gary.

Jian Lin wrote:

ok, the following code with 6000 records insert will take 13.3 seconds to finish (just for the db portion). if i change 6000 to 30000 then it will take 67 seconds.

OK, I created the table and the index and ran your code for 30000 records, but I wrapped the database part in a Benchmark.measure{}

Using MySQL:

For 30000 inserts with no indexes:

"Starting inserting records"   31.996000 0.639000 32.635000 ( 35.356000) 30000

For 30000 inserts with all indexes:

"Starting inserting records"   32.795000 0.982000 33.777000 ( 37.103000) 30000

That's 33 seconds of CPU time with 37 seconds elapsed on a quad core 2.4 GHz with a Seagate Barracuda SATA drive with 32MB cache.

As you can see, it's pretty much all in CPU time!!!!

The result was essentially the same in Postgres and MS SQL server!

So you can forget about the database itself. None of the database engines were unduly taxed by the test.

Just for fun I changed the program to output the data as SQL INSERT statements and then run that (with 30000 inserts wrapped in a transaction) against MySQL.

Imported in 1.2 seconds!!

I Don't know if it is the hash lookup code or ActiveRecord that is gobbling up the time, but it certainly isn't the database.

You'll need to tinker with, or better profile your code to find out what is sucking up the time.

Cheers, Gary.

Gary Doades wrote:

"Starting inserting records"   31.996000 0.639000 32.635000 ( 35.356000) 30000

For 30000 inserts with all indexes:

"Starting inserting records"   32.795000 0.982000 33.777000 ( 37.103000) 30000

I Don't know if it is the hash lookup code or ActiveRecord that is gobbling up the time, but it certainly isn't the database.

by the way... interesting to find out the CPU time is so much... i thought this code is I/O bound at first...

by the way I am using Rails 2.3.2, Ruby 1.8.6 patchlevel 287... on Windows 7.

hm... the Hash look up time should be really small... for example, if i just write all data to a text file, it should be really fast... it might be ActiveRecord, although I though ActiveRecord merely translate the method into a SQL statement and so shouldn't be so CPU intensive.

Jian Lin wrote:

Gary Doades wrote:

"Starting inserting records"   31.996000 0.639000 32.635000 ( 35.356000) 30000

For 30000 inserts with all indexes:

"Starting inserting records"   32.795000 0.982000 33.777000 ( 37.103000) 30000

I Don't know if it is the hash lookup code or ActiveRecord that is gobbling up the time, but it certainly isn't the database.

by the way... interesting to find out the CPU time is so much... i thought this code is I/O bound at first...

by the way I am using Rails 2.3.2, Ruby 1.8.6 patchlevel 287... on Windows 7.

hm... the Hash look up time should be really small... for example, if i just write all data to a text file, it should be really fast... it might be ActiveRecord, although I though ActiveRecord merely translate the method into a SQL statement and so shouldn't be so CPU intensive.

Aha.... It looks like ActiveRecord has an enormous overhead in creating/saving records.

If you change the inserts to this....

puts Benchmark.measure { Phrase.transaction do    all_phrases.each do |phrase|      Phrase.connection.execute("insert into phrases(s,frequency,length) values('#{phrase}',#{frequencies[phrase]},#{lengths[phrase]})")    end end }

you get this:

"Starting inserting records"    1.123000 0.686000 1.809000 ( 5.096000) 30000

Which is exactly what you want I think :slight_smile:

This falls down of course if the phrases contain single quote characters or are from an untrusted source. You will need to at least escape quotes before putting them in the insert statement. This will slow it down a bit, but not as much as before I would think.

Note to self: Don't attempt to do lots of records creations with standard AR code!

Cheers, Gary.

Gary Doades wrote: etc... so the final result is 34000 entry hashes, mapping to frequencies and word counts.

the hardware should be pretty good... it is the HP TouchSmart IQ804... with a Core 2 Duo and hard drive ST3320820AS which is 7200rpm, 8MB buffer.

hm... makes me wonder if the db is just 4MB, how come the hard drive buffer 8MB didn't totally handled it in its RAM and be super fast. did it actually force write to the physical disc?

It's never that simple (and yes under the appropriate circumstances the drive is supposed to flush to the actual disk). For example there's a constant amount of overhead with each query: network latency etc. not a huge amount (probably less than a millisecond), but multiply that by 34000 and it will add up. If you really need to manipulate this much data you'd be well advised to do the inserts in bulk.

Fred

Frederick Cheung wrote: >> it actually force write to the physical disc? > It's never that simple (and yes under the appropriate circumstances > the drive is supposed to flush to the actual disk). For example > there's a constant amount of overhead with each query: network latency > etc. not a huge amount (probably less than a millisecond), but > multiply that by 34000 and it will add up. If you really need to > manipulate this much data you'd be well advised to do the inserts in > bulk.

so how to do inserts in bulk? by using

Phrase.connection.execute("insert into phrases(s,frequency,length) values('#{phrase}',#{frequencies[phrase]},#{lengths[phrase]})")

basically. you can insert more than one tuple like that (although perhaps not with all databases)

? or by ar-extensions? can ActiveRecord have a mode for saving without being in a transaction, or can ActiveRecord has some standard method of doing bulk inserts?

ar-extensions provides bulk inserts if the db supports it.

Fred

You might be using the wrong tool and you might be reinventing existing solutions. Did you have a look at existing text mining tools/frameworks in Ruby as well as other languages?

Michael