Data warehousing in MySQL
Some databases are used to store large amount of historical data (such as call detail records, log messages, etc). This data has the following common characteristics:
- every record has timestamp (date and time when event occurs)
- very large amount of records
- such records should be stored for long time
- records are accessed very infrequently
During life of such databases, we got data fragmentation (non-contiguous free space blocks made by deletion of old rows) and problems with long insert time - because of oversized B-Tree indexes. During periodic data maintenance, we should delete rows from such tables - and we got excessive table locking time if table stored as MyISAM and in all cases got excessive IO and CPU usage while executing DELETE statement.
These problems prevents data to grow and now, we will describe how to avoid these problems and make database to eat all your disk almost without performance loss.
Table partitioning can be used to solve these problems (starting from MySQL 5.1 http://dev.mysql.com/doc/refman/5.1/en/partitioning.html). Partitioning feature provides functionality to separate data between tablespaces, where partition is automatically selected in depency of inserted data.
MySQL has support of following partitioning alghorithms: RANGE, HASH, KEY, LIST (http://dev.mysql.com/doc/refman/5.1/en/partitioning-types.html). In this article we will only describe RANGE partitioning based on date, because it is slightly better for historical Data Warehousing. Please note that MySQL is optimized to use to_days() and year() functions as partitioning key.
Partitioned tables require additional administrative tasks for partition management (adding/deleting/splitting partitions), but performance improved over all problems we tell before:
- locked only partition against full table in MyISAM
- no data fragmentation - empty/new partition does not contain old data
- indexed operations improved - lowering count of iterations to search key inside B-Tree indexes
Consider to use this sample table:
CREATE TABLE members (
firstname VARCHAR(25) NOT NULL,
lastname VARCHAR(25) NOT NULL,
username VARCHAR(16) NOT NULL,
email VARCHAR(35),
joined DATE NOT NULL
)
PARTITION BY RANGE( YEAR(joined) ) (
PARTITION p0 VALUES LESS THAN (1960),
PARTITION p1 VALUES LESS THAN (1970),
PARTITION p2 VALUES LESS THAN (1980),
PARTITION p3 VALUES LESS THAN (1990),
PARTITION p4 VALUES LESS THAN MAXVALUE
);
Created table had following files inside database directory for MyISAM:
---
# ls -1 members*
members.frm - table description
members.par - partition description
members#P#p0.MYD, members#P#p0.MYI - partition 0
...
members#P#p4.MYD, members#P#p4.MYI - partition 4
---
For InnoDB (with innodb-file-per-table), following files are created:
---
members.frm - table description
members.par - partition description
members#P#p0.ibd, members#P#p1.ibd, members#P#p2.ibd, members#P#p3.ibd, members#P#p4.ibd - partition tablespaces
---
From this file structure it is clearly seen that each partition has it’s own space.
Please note, that last partition maximum value can be limited with specified value or be unlimited (MAXVALUE).
Partitioned tables requires periodic maintenance - creation of new partitions and deletion of obsolete. If last partition key has been set as sample to ‘01 jan 2011’, this means that data with date below but not including 1 january 2011 but not including this date can be inserted in such partition. If date exceeds key of last partition - insert will fail. ‘MAXVALUE’ can be specified as last partition key - this means that partition has no upper limit. In all cases - periodic partition maintenance with partitions created in advance is the best way.
Partitioned tables can be easily managed:
If we want to append partition for next year in case of last partition has MAXVALUE, then we should split last partition into new ones:
---
alter table members reorganize partition p4 into (partition y1999 values less than (2000), partition y2000 values less than (maxvalue));
---
In case, when last partition has fixed date, then new partitions can be simply added:
---
alter table members add partition (partition y2001 values less than (2002), partition y2002 values less than (2003));
---
Unwanted data can be easily dropped without excessive CPU and DISK load, because of partition deletion is a DDL construction. Lets delete partition that holds members before 1960 year:
---
alter table members drop partition p0;
---
In conclusion: using partitioning greatly help to reduce CPU and disk load during historic data management. Regular operations such as data inserts and select is also have benefits, because of B-Tree indexes has less height among with reduced data fragmentation. Partitioned table can grow almost unlimited.













Request a quote
Start Live chat
Clients area
REMSYS' News
Our blog
Server Management
Cloud Computing
Highload solutions
Infrastructure Management
High Availability Solutions
Data Migration

