From its inception, speed has been a strong point of the MySQL database server. In fact, its developers have long been cautious to add new features at the expense of performance even when faced by withering pressure from detractors. Yet over time MySQL’s features caught up with its blazing speed, and today its used to power some of the highest traffic websites in the world, Yahoo! Finance, Craiglist, and TicketMaster among them. Accordingly, for most applications chances are MySQL is going to perform to your expectations.
However, with Web traffic exploding, and the complexity of applications increasing all the time, you should strive to optimize your database from its very inception, and continue to review and refine its structure and query activity over its lifetime. You can do so by following a process I refer to as NICM, or preferably, “Neglecting to Index Causes Misery” (those of you new to database optimization will soon get the joke). This process consists of five steps, including Normalize, Index, Cache, and Monitor. In this inaugural installment of my new MySQL series for Developer.com, I’ll introduce these five steps, highlighting not only how they will ensure your database is running at full tilt, but also how they will actually help you to more effectively manage your data.
The relational database is aptly named because it promotes the strategy of managing data through well-defined relations. By creating and enforcing relations, its possible to greatly reduce the possibility inconsistencies could creep into the data. This strategy is known as database normalization, of which there are several well-defined states, also known as forms.
To illustrate both the concept and importance of normalization, consider a scenario in which you’re tasked with creating a corporate human resources application capable of managing employee information. This application would allow the HR manager to easily insert, update, view and delete employee data. Each employee record would contain the usual information, including name, phone number, position, salary, and supervisor.
Note that in the opening sentence I stated that relational databases promote the strategy of managing data through well-defined relations—you’re certainly not constrained to do so. Therefore there’s nothing to prevent you from creating a single table that looks like this:
CREATE TABLE employees ( name VARCHAR(50) NOT NULL, telephone CHAR(10) NOT NULL, position VARCHAR(20) NOT NULL, salary DECIMAL(8,2) NOT NULL, supervisor VARCHAR(50) NOT NULL )
Yet several problems with this approach should be evident. Because the data types used to define these columns are all indicative of an input method requiring the manager to manually enter each value. For example, the manager might type the following values into the web form when adding a new employee:
“Jason Gilmore”, “614-999-9999”, “Software Developer”, “50000.00”, “John Thompson”
But what if over time the manager begins to enter the “Software Developer” title as “Sft. Dev.”, and occasionally forgets the periods, using just “Sft Dev”? These inconsistencies eliminate the possibility of using simple SELECT queries to retrieve the roster of software developers. The simple answer to the problem is to normalize the position data, creating a new table intended to contain the master list of all possible positions:
CREATE TABLE positions ( id TINYINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) NOT NULL )
Then you revise the employees table so that it refers to a position’s primary key rather than the name (in the revised table, the _fk postfix is simply a naming convention to remind the database administrator that this column points to a foreign table key):
CREATE TABLE employees ( name VARCHAR(50) NOT NULL, telephone CHAR(10) NOT NULL, position_fk TINYINT UNSIGNED NOT NULL, salary DECIMAL(8,2) NOT NULL, supervisor VARCHAR(50) NOT NULL )
Now you can retrieve the positions from the positions table and use them to populate a drop-down list. This forces the manager to choose from a constrained set rather than haphazardly type in the titles.
A similar normalization improvement can be made to the supervisor column, although this is a case where an additional table isn’t necessary. All you need to do is modify the employees table anew so it looks like this:
CREATE TABLE employees ( id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, telephone CHAR(10) NOT NULL, position_fk TINYINT UNSIGNED NOT NULL, salary DECIMAL(8,2) NOT NULL, supervisor_pk SMALLINT UNSIGNED )
Can you figure out the origin of the data used to populate the supervisor_pk column? That’s right, the very same employees table!
This is just a taste of how database normalization can greatly reduce the data inconsistencies that can arise over time. For a great summary of the concept and the various forms I alluded to earlier in the section, see the Wikipedia article on this topic.
I also suggest downloading MySQL’s free MySQL Query Browser application, which can help you to visualize the data and its various relationships in a manner much more efficient than if you were using the command-line client.
Once the tables have been properly designed and normalized, you should next take some time to think about what data will most commonly be queried, and create special data structures known as indexes which will dramatically improve the performance of these query operations. Indexes are important because they organize the indexed data in a way that allows MySQL to retrieve the desired record in the fastest possible fashion. In the previous section you already encountered an index, although it isn’t apparent: the primary key. But the primary key will only speed query operations when you’re searching for a record by using that primary key as the identifier, for example:
SELECT name, telephone FROM employees WHERE id='3'
What if you wanted to search for one or several records based on the employee name? The query might look like this:
SELECT id FROM employees WHERE name='John Smith'
To speed these sorts of queries, you’ll need to add an index to the name column. The revised employees table looks like this:
CREATE TABLE employees ( id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, INDEX (name), telephone CHAR(10) NOT NULL, position_fk TINYINT UNSIGNED NOT NULL, salary DECIMAL(8,2) NOT NULL, supervisor_pk SMALLINT UNSIGNED )
You can also create multiple-column indexes that would greatly improve the performance of queries such as this:
SELECT id FROM employees WHERE firstname = 'John' and lastname = 'Smith'
The revised employees table would look like this (I’ve also broken the name into two separate columns as indicated by the query):
CREATE TABLE employees ( id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, firstname VARCHAR(20) NOT NULL, lastname VARCHAR(20) NOT NULL, INDEX name (firstname, lastname), telephone CHAR(10) NOT NULL, position_fk TINYINT UNSIGNED NOT NULL, salary DECIMAL(8,2) NOT NULL, supervisor_pk SMALLINT UNSIGNED )
This new index would actually improve quite a few queries, including the following:
SELECT telephone, position FROM employees WHERE firstname = 'John' SELECT * FROM employees WHERE firstname = 'John' and lastname = 'Smith'
Table indexing can be a rather confusing topic, although the time invested in learning its nuances will be well worth it. I suggest reading the following MySQL manual pages for more information:
MySQL is often used to dynamically generate web pages based on similar queries and rarely changing data. For instance, consider the home page of Developer.com, which likely depends on a single query used to repeatedly retrieve the latest ten or so articles. New articles are typically published early to mid-morning, meaning the relevant data found on this page really only chances perhaps once or twice a day. Why not cache the data returned by these queries, thereby bypassing the need to repeatedly retrieve it from the database? Introduced in version 4, enabling MySQL’s query caching mechanism can result in a huge performance gain over neglecting to do so.
To learn more about how your MySQL installation’s query caching mechanism is configured, run the following command:
mysql>SHOW VARIABLES LIKE '%query_cache%'
The reasoning behind this step should be obvious: how are you going to know what should be optimized if you’re not actively monitoring how MySQL is operating? Thankfully, MySQL’s developers have been particularly proactive in providing developers with the tools for keeping abreast of database performance.
For starters, you should familiarize yourself with MySQL’s EXPLAIN query, which will provide detailed information regarding how MySQL goes about executing a SELECT query. You’ll learn valuable tips regarding how MySQL is executing table joins, and whether additional indexes should be added.
If you’re running a MySQL version earlier than 5.0, check out mytop, a console-based utility for monitoring MySQL performance.
Finally, be sure to have a solid understanding of MySQL’s configuration variables, because they control crucial performance-related matters such as how much memory is allocated to queries, what sort of data is logged, how many simultaneous collections are allowed, and much more. You can review a complete list of these variables by executing:
I hope this introductory article left you thinking about how you can go about making your MySQL database absolutely scream. And moving forward, every time you begin a new database project, don’t forget to approach it the NICM way!
About the Author
W. Jason Gilmore is Apress’ Open Source Editorial Director, and co-founder of IT Enlightenment. He’s the author of several books, including the best-selling “Beginning PHP and MySQL 5: Novice to Professional, Second Edition” (Apress, 2006. 913pp.). Jason loves receiving e-mail; so don’t hesitate to write him at wjATwjgilmore.com.