Sunday, May 24, 2009

The "Chemistry" of MySQL

Searching for MySQL with WolframAlpha, the latest entry in the search engines arena, I had a surprise:


Interpreting "mysql" as "mycil"
Input interpretation:
chlorphenesin
A further search for chlorphenesin, explains that it is a drug pertaining to the "central muscle relaxants" category.

Saturday, May 23, 2009

Wordcamp in Milan - slides on MySQL 5.x performance


Wordcamp

I am attending WordCamp 2009 in Milan.
I presented MySQL 5.1 and 5.4, with stress on performance.
People are interested. And many questions are flying around, some of which are answerable and some aren't. The questions about Oracle were swiftly avoided, and the ones about forks comparisons were answered with live examples.
The attendees have appreciated it.

Monday, May 18, 2009

Cleaning up Wordpress comment tables


cleaning wp

In Montreal with Dups and Kaj, we were looking at a number of technical problems, and each one of you got something valuable from the meeting.
One of Kaj's problems was a collection of Wordpress blogs infested by spam. Kaj has done something already but the situation was critical. Before applying Akismet to his comments, he needed to cleanup the majority of the spam in same easy way.

It is not rocket science, really, but it needs some care in the implementation.
The comment tables in a wordpress installation are called wp_X_comments, where X is a progressive integer. This makes it a bit problematic to run a single query against all comments. Additionally, the same problem exists in two different servers, with a different number of blogs. Thus, I needed to come up with a portable remedy.
There's the starting point.

SELECT
table_name, table_rows
from
information_schema.tables
where
table_schema=schema() and table_name like '%comments';
+----------------+------------+
| table_name | table_rows |
+----------------+------------+
| wp_10_comments | 385 |
| wp_11_comments | 1 |
| wp_12_comments | 7 |
| wp_13_comments | 0 |
| wp_14_comments | 80 |
| wp_15_comments | 0 |
| wp_16_comments | 24 |
| wp_17_comments | 1 |
| wp_18_comments | 2 |
| wp_19_comments | 3 |
| wp_1_comments | 6 |
| wp_2_comments | 34457 |
| wp_3_comments | 582 |
| wp_4_comments | 103 |
| wp_5_comments | 1 |
| wp_6_comments | 0 |
| wp_7_comments | 3 |
| wp_8_comments | 1 |
| wp_9_comments | 24 |
+----------------+------------+

DESC wp_1_comments;
+----------------------+---------------------+------+-----+
| Field | Type | Null | Key |
+----------------------+---------------------+------+-----+
| comment_ID | bigint(20) unsigned | NO | PRI |
| comment_post_ID | int(11) | NO | MUL |
| comment_author | tinytext | NO | |
| comment_author_email | varchar(100) | NO | |
| comment_author_url | varchar(200) | NO | |
| comment_author_IP | varchar(100) | NO | |
| comment_date | datetime | NO | |
| comment_date_gmt | datetime | NO | MUL |
| comment_content | text | NO | | <--
| comment_karma | int(11) | NO | |
| comment_approved | varchar(20) | NO | MUL | <--
| comment_agent | varchar(255) | NO | |
| comment_type | varchar(20) | NO | |
| comment_parent | bigint(20) | NO | |
| user_id | bigint(20) | NO | |
+----------------------+---------------------+------+-----+

To get rid of most of the probable spam, we can use a regular expression to search for unapproved comments that contain at least three links. There is a chance of legitimate comments like that, but this is the best shot.
First, let's see how big is the problem:
set group_concat_max_len = 65000;

set @sum=0;
set @q = (
select group_concat(
concat(
'SELECT "',
table_name,
'" as T, count(*), @sum := @sum + COUNT(*) C FROM ',
table_name,
' where comment_approved=0'
) SEPARATOR ' UNION '
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments');

prepare q from @q;
execute q;
deallocate prepare q;
select @sum;

This code will tell us how many unapproved comments there are in each comment table.
The way if works is by creating a query using the list of comment table, taken from the information schema.

+----------------+----------+-------+
| T | count(*) | C |
+----------------+----------+-------+
| wp_10_comments | 339 | 339 |
| wp_11_comments | 0 | 339 |
| wp_12_comments | 2 | 341 |
| wp_13_comments | 0 | 341 |
| wp_14_comments | 73 | 414 |
| wp_15_comments | 0 | 414 |
| wp_16_comments | 3 | 417 |
| wp_17_comments | 0 | 417 |
| wp_18_comments | 0 | 417 |
| wp_19_comments | 0 | 417 |
| wp_1_comments | 0 | 417 |
| wp_2_comments | 33784 | 34201 |
| wp_3_comments | 579 | 34780 |
| wp_4_comments | 90 | 34870 |
| wp_5_comments | 0 | 34870 |
| wp_6_comments | 0 | 34870 |
| wp_7_comments | 1 | 34871 |
| wp_8_comments | 0 | 34871 |
| wp_9_comments | 24 | 34895 |
+----------------+----------+-------+
+-------+
| @sum |
+-------+
| 34895 |
+-------+

Cleaning up the comments manually is thus ruled out.
The next step is to see how many comments contain at least three links. To create the search pattern, we use the REPEAT function.

set @w = concat('%', repeat('http%',3));
set @sum1 =0;
set @q = (
select group_concat(
concat(
'SELECT "',
table_name,
'" as T, count(*), @sum1 := @sum1 + COUNT(*) C FROM ', table_name,
' where comment_content LIKE "', @w, '" and comment_approved = 0'
) SEPARATOR ' UNION '
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments');

prepare q from @q;
execute q;
select @sum, @sum1, @sum1 / @sum * 100 ;

This second statement will show how many comments will be removed by simply looking at the links.

+----------------+----------+-------+
| T | count(*) | C |
+----------------+----------+-------+
| wp_10_comments | 51 | 51 |
| wp_11_comments | 0 | 51 |
| wp_12_comments | 0 | 51 |
| wp_13_comments | 0 | 51 |
| wp_14_comments | 0 | 51 |
| wp_15_comments | 0 | 51 |
| wp_16_comments | 0 | 51 |
| wp_17_comments | 0 | 51 |
| wp_18_comments | 0 | 51 |
| wp_19_comments | 0 | 51 |
| wp_1_comments | 0 | 51 |
| wp_2_comments | 26340 | 26391 |
| wp_3_comments | 402 | 26793 |
| wp_4_comments | 57 | 26850 |
| wp_5_comments | 0 | 26850 |
| wp_6_comments | 0 | 26850 |
| wp_7_comments | 0 | 26850 |
| wp_8_comments | 0 | 26850 |
| wp_9_comments | 9 | 26859 |
+----------------+----------+-------+

+-------+-------+--------------------+
| @sum | @sum1 | @sum1 / @sum * 100 |
+-------+-------+--------------------+
| 34895 | 26859 | 76.9709 |
+-------+-------+--------------------+

This filter will eliminate almost 77% of the comments.
The last step is now the production of the DELETE queries .

drop table if exists remove_queries;
create table remove_queries (q varchar(600));

insert into remove_queries
select
concat(
'DELETE FROM ', table_name,
' where comment_content LIKE "', @w, '" and comment_approved = 0;'
)
from information_schema.tables
where table_schema = schema()
and table_name like 'wp_%_comments';

select * from remove_queries;

This latest step creates the DELETE queries, the ones that will remnve the 77% of spam.
One they are created, you can either cut-and-paste them, or pipe them between client runs:

mysql wordpress -N -e 'select * from remove_queries' | mysql -v -v -v wordpress

The bigger bulk is gone. But we can do something more to clean up. The query to create is this one:

DELETE FROM wp_XXX_comments
where
comment_approved="spam"
or (
((comment_content regexp "casino|cialis|viagra|sex|loan")
or length(comment_content) < 10 )
and comment_approved = 0
);

It will remove another 20% of spam, leaving the situation so light that a few loops of the akismet plugin will remove the remaining spam.
In the other server, these two scripts removed 90% of the spam, leaving a few hundred comments to consider manually.

Wednesday, May 13, 2009

I am sponsoring the Boston MySQL User Group


Sheeri's MySQL Bible is here!

As many may probably know, Boston is one of my favorite towns. I have been in love with this town since my first visit in 1990, and in every subsequent trips I have always been enchanted by the town itself and by the kindness I was shown by its people.
I am also very fond of the Boston MySQL User Group, and its organizer, Sheeri K. "Super Hero" Cabral.

I want to show my personal appreciation for Boston by sponsoring the MySQL User Group for one year. I am doing that with my own money, without involving my company.
You can also donate for a MySQL user group, if you want, using the Technocation fund.
You can specify your funds go to a particular user group if you like, or any user group that needs your help.





Tuesday, May 12, 2009

MySQL 5.4 - performance at your fingertips


MySQL performance

The session at the Boston MySQL User Group was very enjoyable. There was a full room, and the presentation was followed with attention and curiosity.
The slides I used for this presentation include the deck used by Allan Packer and Mikael Ronström at the MySQL Conference 2009, followed by a testimony of my own experience with 5.4.
Here is the final deck: MySQL 5.4 theory and practice

While preparing the slides, I realized that, for the first time, we have a product that allows a painless performance gain. Unlike all the tuning and technological improvements that you can use to improve performance so far, upgrading to MySQL 5.4 is a simple and easy step to a performance gain. It deserves at least some testing, because of the huge difference that it can make.

On May 21, there will be an online presentation by Allan Packer on MySQL 5.4 Benchmarks In-Depth.

Saturday, May 09, 2009

MySQL Sandbox 3.0 is released - looking at the future


MySQL Sandbox 3.0

It's done. MySQL Sandbox 3.0 is now available.
New features include:
  • Installation as a Perl module;
  • Copying, moving, deleting, and preserving sandboxes;
  • Port checking before installing;
  • A test suite with 200 tests;
  • User defined modular tests;
  • A rich cookbook

If you get the source code through the Bazaar repository, you will see something new. There is now a directory script_templates, containing the re-engineered scripts that go inside the Sandbox. These scripts are not in version 3.0.0, but they are the first step to the changes that will go into version 3.1. The main new feature is modularity. These scripts are now easier to maintain and modify.
The new start script is also more powerful. You will be able to start a sandboxed server without the mysqld_safe container, sending the output to a user defined place, choosing between running it in background or foreground.
All in due time. The new features for now pass all the existing tests, but I didn't want to include them in the current release before designing some rigorous tests for them.
Looking forward to your comments!

Thursday, May 07, 2009

MySQL 5.4 at the MySQL Boston User group


MySQL 5.4

The MySQL Boston User group will meet on May 11 at 7pm.
I will speak about my experience using MySQL 5.4, with some general information on the release.
BTW, you know that there has been some trouble about MySQL meetup groups. We are looking into the matter, and we will release information as soon as possible. For now, I can tell you that the current agreement between MySQL and Meetup is valid at least until June 10, 2009.

Sunday, May 03, 2009

MySQL Sandbox 3.0 release candidate




MySQL Sandbox is now in release candidate status. If no bugs are reported on the latest version (2.0.99f), I will repackage it as 3.0.
In addition to the list of features previously announced, I managed to implement another feature that has been in the wish list for long time, i.e. creating a sandbox from existing binaries, such as the ones installed by a .rpm or .deb package.

The new script make_sandbox_from_installed meets the expectations by creating a fake BASEDIR with symbolic links.
Other important additions:
  • the test suite has now 200 tests, making it much more comprehensive than before.
  • The above feature was achieved also thanks to the improved test plugins, which allowed me to write tests much faster.
  • There is now another module, which only contains documentation. MySQL::Sandbox::Recipes is a cookbook that tells most of the cool things you can do with MySQL Sandbox. It is not completed, but for the unfinished recipes the stubs are enough to give you a clue.

MySQL Sandbox is a non trivial project.

partfileslinescharacters
scripts104,390141,585
libraries31,59551,304
documentation21,25050,248
tests1968619,951
total346,671239,048

If you have some recipes to contribute to the MySQL::Sandbox cookbook, please let me know.