MySQL Database Access and Population

CLI
Client Tools
Scripting
ODBC/JDBC
ETL
Partitioning Jobs
Transaction Scope
Query Cache

In this section I will discuss different methods of accessing your database to perform maintenance and data tasks.  Your connection choices when accessing your MySQL database have a significant bearing upon the performance of the job you are running.  In addition, limiting the scope of the transactions you pass through these connections can be a very effective means by which to achieve acceptable performance for inserts.

Queries such as selects and inserts can be passed to your database in several different ways.  Each of these has performance implications which need to be considered.  For the sake of this discussion let us consider a database residing on a remote machine which you are maintaining from a desktop at work or home.  For the most part, costly queries and inserts will perform far better when passed directly to the engine command line than when passed across the network.  Network latency and likely other concerns can cut the performance of your job nearly in half and in some cases make something un-runnable through your client that works perfectly fine when submitted command line.  This does not mean database clients have no use for querying and of course, most application designs require database access over a network.  What needs to be understood is that access from a client tool such as Navicat from your desktop utilizes drivers just as the application you are creating does to enable this access.  The pattern that has worked best for me is to use the ide to facilitate the creation of sql code and running this against a subset of data to test and debug.  Once the sql I am creating is ready for prime time, it can be moved to the server and executed command line or through a script on the entire data set.  In this way you can take advantage of the ease of the idea in the design process while getting optimal performance when the full job is executed.  While engaging in this process, you will quickly be made aware of the operation of the query cache.  This is a cache used by MySQL to store the results of particular often run queries.  As you iterate through runs of a particular piece of code, you will notice performance after the first run increases dramatically.  This is due to the operation of the query cache.  Specifying a larger query cache through the my.cnf configuration file using the query_cache_size option will increase the effectiveness of this cache with large data sets.  Be sure to keep this in mind and take advantage of this feature to save time.

As noted before, the performance of your job can be greatly effected by whether or not it can be executed in memory without having to swap parts of the data out to disk.  This can be managed through the use of transactions and in sql partitioning of your job.  Below I have provided an example which does both to speed up query execution.  Use of transaction isolation (using start transaction, begin and commit in your query syntax) and using looping to process small groups of records at a time can turn an impossible job that would otherwise take days to run into something that completes nicely in under 20 minutes.  While there are other approaches to this, this particular pattern has worked amazingly for me permitting ETL operations on data sets that due to size would have been otherwise unmanageable.  Here is some code to get you started.  Just replace the insert with yours.


 drop procedure if exists load_prv_destination_table;
 delimiter $$
 create procedure load_prv_destination_table()
 begin
 
 declare row_max int unsigned default 3000000;
 declare row_increment int unsigned default 500000;
 declare row_position int unsigned default 1;
 declare row_count int unsigned default 0;
#set row_max = (select count(*) from raw.source_table);
 set row_max = (select max(provider_no) from raw.source_table);
 start transaction;
 while row_position <= row_max do
 
 insert into prv_destination_table (first_name, last_name, address_1, address_2, zip_code_no, destination_table_type_no, dbs_vshard_no)
 select substring(raw.source_table.first_name,1,40), substring(raw.source_table.last_name,1,40), substring(raw.source_table.address_1,1,50), substring(raw.source_table.address_2,1,50), -1, -1, 10000
 from raw.source_table
 where raw.source_table.category != 'HS' and raw.source_table.category != 'RX'
 and raw.source_table.category != 'AC' and raw.source_table.category != 'CLI'
 and raw.source_table.category != 'AM' and raw.source_table.category != 'IL'
 and raw.source_table.category != 'CM' and raw.source_table.category != 'unk'
 and provider_no between row_position and row_position+row_increment
 group by raw.source_table.first_name, raw.source_table.last_name, raw.source_table.address_1, raw.source_table.address_2
 order by raw.source_table.provider_no asc;
 #limit row_position, row_increment;
 commit;
set row_position=row_position+row_increment;
 end while;
#set end_time = now();
 #select time_to_sec(timediff(end_time,start_time));
 #select start_time, end_time;
 end $$
delimiter;
 
call load_prv_destination_table();

You can un-comment the time stuff to run benchmarks.  You will want to optimize total run time by selecting the right number of records to loop through.  This is of course most related to the amount of ram available.  Watching the process monitor for your particular operating system can also be helpful.  A properly sized chunk to loop through will nearly max out your innodb_buffer_pool memory.  Too large of a loop and you are back to swapping to disk again and performance will really suffer.   When looking at a series of queries in an ETL process (for example) it will be fairly clear which queries are ending up in swap space.  While this can be minimized with good query design in a lot of cases, sometimes, it is unavoidable.  In these instances, try the approach above.