First, I'm a physicist, not a computer scientist. I need to store 65M rows now, with a growth rate of ~21M/year whose entries arrive daily.
The purpose of this DB is to store metadata for environmental observations for the ionosphere. Basically it's a glorified file catalogue. I need the usual: fast enough inserts, reasonable recovery, REST like access, etc. I have access to Ruby AR/AS 2.1.1, Ruby 1.8.5 and eventually MySQL 5.1.x.
I'm trying to trade some storage space and lack of normalization for simplicity of design and ease of maintenance by me. Basically, I want to partition one very large table. My PK is essentially across 3 columns: filename, md5 checksum, destination. I'd like to partition by year which is discernible by the filename and by obs_time if I can but obs_time is not currently part of the PK. If that happens, then I would have up to 40M rows in a single table though most tables would have a few million; perhaps that is too many; perhaps a table for each half year would work.
Currently, I have no foreign keys but I may want them one day.
Any help and suggestions would be really appreciated!
My schema follows. I have created a simple model off this table and I'm using these decorators: validation, and before_save and after_save to build an inventory summary (in another table not shown). I've tried this on a non-partitioned table so far awaiting the installation of MySQL 5.1.x by our IT department.
How do I go about integrating AR with a partitioned table? Can you advise a simple pattern I might follow for such a large but simple relationship?
create table `ionofiles` (
-- TODO optimize variable sizes `id` int(15) NOT NULL auto_increment PRIMARY KEY, `file` varchar(100) NOT NULL, -- Any valid filename without the ".~N~" backup extensions. `md5` varchar(100) NOT NULL, -- MD5 Sum `destination` varchar(20) NOT NULL, -- E.g. MIDS, ADIC, AFWA, SWPC, ...
`version` varchar(20) default NULL, -- Optional version.
`ursi` char(5) default NULL, -- E.g. BC840, WI937. default NULL because we allow any filename.
`time_scope` varchar(20) default NULL, -- E.g. (individual, daily, monthly, yearly, etc). default NULL
-- because we allow non time-scoped files.
`content_type` varchar(20) default NULL, -- E.g. (root, scaled, image, ionogram)
`obs_time1` datetime default NULL, -- Start of measurement. Default NULL because we allow
-- non data files. `obs_time2` datetime default NULL, -- End of measurement, in the case file represents
-- multiple measurements. Default NULL because we allow
-- non data files.
`extension` varchar(5) default NULL, -- file extension (any case). E.g. SAO, MMM, 16C, TXT, png.
`size` int(11) NOT NULL, -- file size in bytes
`backup` int(2) default NULL, -- N of the "~N~" tilda backup file.
`is_ready` boolean NOT NULL default 0, -- Is this file's links ready? 0 = No, 1 = Yes. `is_public` boolean NOT NULL default 1, -- Is this file publicly available? 0 = Public, 1 = Private. `is_removed` boolean NOT NULL default 0, -- Has this file been marked deleted?, -- s.t. we don't ever want to see it again? -- This is useful in the event that we want to reject future submissions of file.
`file_on_disk` varchar(100) NOT NULL, -- Actual file name on disk: E.g. BC840_2009001000000.SAO.~10~ `path` varchar(200) NOT NULL, -- A directory or a full path to a ZIP or TAR or other aggregate file. -- E.g. data/BC840/individual/2009/001/scaled/ -- E.g. data/BC840/individual/2009/BC840_2009001_2009-01-01.tar.gz
`last_verified` datetime default NULL, -- last time we checked that this file actually exists at stated destination
-- TODO: ActiveRecord fields. What should be the data type? Timestamp or Datetime? -- The auto set and update features are currently handled by AR. -- Should we put some ON UPDATE, etc actions here for non AR interaction? `created_on` datetime NOT NULL, -- ActiveRecord auto set. `updated_on` datetime NOT NULL, -- ActiveRecord auto updated.
`note` TEXT default NULL, -- User note
UNIQUE KEY `key_file_md5_destination_public` ( `file`, `md5`, `destination`, `is_public` ),
KEY `key_file` ( `file` ), KEY `key_md5` ( `md5` ), KEY `key_destination` ( `destination` ), KEY `key_version` ( `version` ), KEY `key_ursi` ( `ursi` ), KEY `key_time_scope` ( `time_scope` ), KEY `key_content_type` ( `content_type` ), KEY `key_obs_time1` ( `obs_time1` ), KEY `key_obs_time2` ( `obs_time2` ), KEY `key_extension` ( `extension` ), KEY `key_size` ( `size` ), KEY `key_backup` ( `backup` ), KEY `key_is_ready` ( `is_ready` ), KEY `key_is_public` ( `is_public` ), KEY `key_is_removed` ( `is_removed` ), KEY `key_file_on_disk` ( `file_on_disk` ), KEY `key_last_verified` ( `last_verified` ), KEY `key_created_on` ( `created_on` ), KEY `key_updated_on` ( `updated_on` )
) ENGINE=InnoDB;
R