Percona Configuration Wizard
Perhaps the most commonly used my.cnf tuning tool available is MySQLTuner. This program consists of a simple perl script. If your server is not currently set up to run perl then you will need to do this. It can be downloaded or in the case of Ubuntu server, installed directly using apt-get. Make sure your server has been up and used as it typically is for at least 24 hours. This use data is important to the script. Upon running this simple program, a report will be generated suggesting my.cnf configuration changes. Use this report with caution. While it will effectively identify some problems and will take care of the capacity estimations you might normally calculate by hand, it’s advice could potentially degrade performance in some cases. Make sure that you understand the changes you are making regardless of whether or not they have been suggested by an optimization tool.
Another interesting tool is the Percona configuration wizard. This online tool generates a fairly optimized my.cnf file for your database based on your answers to a sequence of questions. It could not get any easier than that 🙂
Table specification is done, as you are probably aware, through the create table command. Usually you will want your scripts to contain specification of fields, options, indexes and foreign keys all in one place. Check online for syntactical guides. Personally, I usually handle this process through an ide. Many people prefer to start with a db design tool such as DBVisualizer. In this case, you create a picture of your database design with tables connected at foreign keys by lines. Then, when satisfied you reverse engineer a create script through the software. I am not a big fan of this method as it seems slow to me however for others who are more visual it seems to work well. My MySQL client tool of choice is Navicat. It is fairly easy to use relative to the other options out there. To create a table, I will fill out a form with Navicat’s create table option which in turn generates the create sql for me. There are many free options out there as well, however Navicat essentials offers ssh tunneling, the ability to directly access a database that is not directly accessible to the internet, and this is important enough to me to warrant the 40$ a Navicat essentials license costs.
In the past, specification of foreign keys has been a requirement of good database design. Many db’s feel the same way about this now, especially those that have been around for a while. For most MySQL implementations, i agree that this is a helpful way of ensuring data integrity and wise to use. In the case of really large data sets, however, it is important to understand the cost of using foreign keys. In many cases, it may be better to rely upon the application for this integrity. For example, when an insert happens on a table with a foreign key, MySQL has to go check the key in a recursive fashion to make sure all required rows exist in the database according to the key. While this is usually more efficient than handling this check yourself(what a mess), this check can be costly performance wise. A happy medium can be achieved by creating foreign keys to control small inserts coming from the application but then disabling them using foreign_key_checks=0 when doing bulk loads. Of course you as the administrator are much less likely to make mistakes than the application programmers (haha). The performance gain by disabling these checks in a rigorously keyed database can be huge. We will discuss more about this when we address database loading and ETL.
Another feature available in newer versions of MySQL is views. Views are selections from the data than can be used to offer apparently pre-joined tables for use by application programmers. While some sense can be seen in this as it may help in consistent and efficient caching of queries, this is probably of minimal importance. MySQL creates views on the fly for the most part either using a merge or temp tables depending upon the terms of the query. When tuning queries for performance, it is usually better to manage these joins per query at the application level as the path taken by the view function may not be optimal. Basically, like foreign keys, views are a luxury that should be avoided when optimal performance is desired.
A more effective way of providing a true pre-join to application developers is through the use of star schema tables. These are tables that contain required data from 2 or more standard tables populated usually directly after the base tables are loaded. A star schema table can be separately indexed and optimized for a particular need and can prevent costly joins from being preformed over and over at the application level through performing them one time during your ETL process. Your particular application will dictate how effective this technique is in increasing performance. In the past, space requirements would often limit the use of this technique however given the low cost of disk space now, this usually a better choice than a view for increasing the performance of often used joins.
Use of temporary tables is also a technique that may or may not be effective in your situation but can have its uses in large MySQL databases. Temporary tables are only available for the life of a connection. More information about the use of these is available online. More recent releases of MySQL handle temporary tables very effectively. In some cases, especially those in which a costly join produces a very small set of data needed for part of a query, memory usage can be optimized through the use of a temporary table. When building data transforms, you should consider temporary table use when this type of situation arises as ETL performance like all query performance quickly becomes unacceptable when disk paging is required.
drop procedure if exists load_prv_destination_table;
create procedure load_prv_destination_table()
declare row_max int unsigned default 3000000;
declare row_increment int unsigned default 500000;
declare row_position int unsigned default 1;
declare row_count int unsigned default 0;
#set row_max = (select count(*) from raw.source_table);
set row_max = (select max(provider_no) from raw.source_table);
while row_position <= row_max do
insert into prv_destination_table (first_name, last_name, address_1, address_2, zip_code_no, destination_table_type_no, dbs_vshard_no)
select substring(raw.source_table.first_name,1,40), substring(raw.source_table.last_name,1,40), substring(raw.source_table.address_1,1,50), substring(raw.source_table.address_2,1,50), -1, -1, 10000
where raw.source_table.category != 'HS' and raw.source_table.category != 'RX'
and raw.source_table.category != 'AC' and raw.source_table.category != 'CLI'
and raw.source_table.category != 'AM' and raw.source_table.category != 'IL'
and raw.source_table.category != 'CM' and raw.source_table.category != 'unk'
and provider_no between row_position and row_position+row_increment
group by raw.source_table.first_name, raw.source_table.last_name, raw.source_table.address_1, raw.source_table.address_2
order by raw.source_table.provider_no asc;
#limit row_position, row_increment;
#set end_time = now();
#select start_time, end_time;
You can un-comment the time stuff to run benchmarks. You will want to optimize total run time by selecting the right number of records to loop through. This is of course most related to the amount of ram available. Watching the process monitor for your particular operating system can also be helpful. A properly sized chunk to loop through will nearly max out your innodb_buffer_pool memory. Too large of a loop and you are back to swapping to disk again and performance will really suffer. When looking at a series of queries in an ETL process (for example) it will be fairly clear which queries are ending up in swap space. While this can be minimized with good query design in a lot of cases, sometimes, it is unavoidable. In these instances, try the approach above.