The Data Charmer: 2013

Thursday, December 12, 2013

Quick and dirty concurrent operations from the shell

Let’s say that you want to measure something in your database, and for that you need several operations to happen in parallel. If you have a capable programming language at your disposal (Perl, Python, Ruby, PHP, or Java would fit the bill) you can code a test that sends several transactions in parallel.

But if all you have is the shell and the mysql client, things can be trickier. Today I needed such a parallel result, and I only had mysql and bash to accomplish the task.

In the shell, it’s easy to run a loop:

for N in $(seq 1 10)
do
    mysql -h host1 -e "insert into sometable values($N)" 
done

But this does run queries sequentially, and each session will open and close before the next one starts. Therefore there is no concurrency at all.
Then I thought that the method for parallel execution in the shell is to run things in the background, and then collect the results. I just needed to find a way of keeping the first session open while the others are being started.

Here’s what I did: I ran a loop with a countdown, using the seq command, and I included a sleep statement in each query, waiting for a decreasing amount of seconds. If I start with 10 seconds, the first query will sleep for 10 seconds, the second one for 9 seconds, and so on. I will run each command in the background, so they will eat up the time independently.

#!/bin/bash
mysql -h host1 test -e 'drop table if exists t1'
mysql -h host1 test -e 'create table t1 (i int not null primary key, ts timestamp)'

for N in $(seq 10 -1 1)
do
    query1="set autocommit=0"
    query2="insert into test.t1 (i) values($N)"
    query3="select sleep($N) into @a; commit"
    mysql -h host1 -e "$query1;$query2;$query3" &
done

wait

mysql -h host1 test -e 'select * from t1'

The effect of this small script is that the commit for these 10 commands come at the same time, as you can see from the resulting table:

+----+---------------------+
| i  | ts                  |
+----+---------------------+
|  1 | 2013-12-12 18:08:00 |
|  2 | 2013-12-12 18:08:00 |
|  3 | 2013-12-12 18:08:00 |
|  4 | 2013-12-12 18:08:00 |
|  5 | 2013-12-12 18:08:00 |
|  6 | 2013-12-12 18:08:00 |
|  7 | 2013-12-12 18:08:00 |
|  8 | 2013-12-12 18:08:00 |
|  9 | 2013-12-12 18:08:00 |
| 10 | 2013-12-12 18:08:00 |
+----+---------------------+

This is a very good result, but what happens if I need to run 500 queries simultaneously, instead of 10? I don’t want to wait 500 seconds (8+ minutes). So I made an improvement:

for N in $(seq 5000 -10 1)
do
    echo $N
    query1="set autocommit=0"
    query2="insert into test.t1 (i) values($N)"
    query3="select sleep(concat('0.', lpad($N,4,'0'))) into @a; commit"
    mysql -h host1 -e "$query1;$query2;$query3" &
done

Now each SLEEP command is called with a fractional argument, starting at “0.5000”, and continuing with “0.4999,” and so on. You can try it. All 500 rows are committed at the same time.

However, the same time is a bit fuzzy. When we use timestamps with second granularity, it’s quite easy to show the same time. But with microseconds it’s a different story. Here’s what happens if I use MySQL 5.6 with timestamp columns using microseconds (TIMESTAMP(3)):

+----+-------------------------+
| i  | ts                      |
+----+-------------------------+
|  1 | 2013-12-12 18:27:24.070 |
|  2 | 2013-12-12 18:27:24.070 |
|  3 | 2013-12-12 18:27:24.069 |
|  4 | 2013-12-12 18:27:24.068 |
|  5 | 2013-12-12 18:27:24.065 |
|  6 | 2013-12-12 18:27:24.066 |
|  7 | 2013-12-12 18:27:24.062 |
|  8 | 2013-12-12 18:27:24.064 |
|  9 | 2013-12-12 18:27:24.064 |
| 10 | 2013-12-12 18:27:24.064 |
+----+-------------------------+

For the purpose of my tests (the actual queries were different) this is not an issue. Your mileage may vary.

Monday, December 09, 2013

Old and new MySQL verbosity

I was pleased to see Morgan’s announcement about a fix to an old problem of mine. In March 2012 I complained about MySQL verbosity during installation.

In MySQL 5.7.3, this behavior was changed. While the default is still as loud as it can, you can now add an option (log_error_verbosity) to send only errors to STDERR, which allows you to hide the output of mysql_install_db, and still get the errors, if they occur.

Well done!

However, the same obnoxious verbosity is also in MariaDB 10.0.x. Since I discussed this specific bug with a few MariaDB developers early in 2012, I was disappointed to see this same output when running mysql_install_db with MariaDB. Here’s the same appeal: MariaDB developers, please fix this usability issue!

And now, for the laughing notes. All versions of MySQL available now, from Oracle, Percona, MariaDB, list this line when installing with mysql_install_db:

Please report any problems with the './scripts/mysqlbug' script!

There is Bug#29716 that was reported in 2008, about mysqlbug being unnecessary (by then, it had been obsolete for 2 or 3 years already), with a patch submitted but not committed. So, in 2013, we still see a reference to a tool that has ceased working for at least 8 years. It should not take much to remove this line and replace it with an appropriate link to the bugs system.

Sunday, December 08, 2013

Submissions at Percona Live Santa Clara 2014 and Lightning talks

The call for participation at Percona Live MySQL Conference and Expo 2014 is now closed. There have been more than 320 submissions, and this will keep the review committee busy for a while.

One important point for everyone who has submitted: if you have submitted a proposal but haven’t included a bio in your account, do it now. If you don’t, your chances of being taken seriously are greatly reduced. To add a bio, go to your account page and fill in the Biography field. Including a picture is not mandatory, but it will be definitely appreciated.

Although the CfP is closed for tutorials and regular sessions, your chances of becoming a celebrity are not over yet. The CfP is still open for Lightning talks and Bird of a Feather sessions.

If you want to submit a lightning talk, you still have time until the end of January. Don’t forget to read the instructions and remember that lightning talks don’t give you a free pass, but a healthy 20% discount.

So far, I have received 16 proposals. Of these, 6 have been rated highly enough to guarantee acceptance (including mine, for which I have not voted.) We still have 6 spots to fill (12 spots in total, 5 minutes each,) and I’d rather fill them with talks that appeal to everyone in the committee, than scrap the barrel of the mediocre ones. My unofficial goal is to have so many good submissions that I will have to withdraw my own talk. Thus, the potential number of available spots is 7. Please kick my talk off stage, by submitting outstanding proposals!

Friday, November 15, 2013

Parallel replication: off by one

One of the most common errors in development is where a loop or a retrieval by index falls short or long by one unit, usually because of an oversight or a logic in coding.

Of the following snippets, which one will run 10 times?

/* #1 */    for (N = 0 ; N < 10; N++) printf("%d\n", N);

/* #2 */    for (N = 0 ; N <= 10; N++) printf("%d\n", N); 

/* #3 */    for (N = 1 ; N <= 10; N++) printf("%d\n", N);

/* #4 */    for (N = 1 ; N < 10; N++) printf("%d\n", N);

The question is deceptive, as there are two snippets that will run 10 times (1 and 3). But they will print different numbers. If you ware aiming for numbers from 1 to 10, only #3 is good.

After many years of programming, off-by-one errors are rare in my code, and I have been able to spot them or prevent them at first sight. That’s why I feel uneasy when I look at the way parallel replication is enabled in MySQL 5.6,5.7 and MariaDB 10.0.5. In both cases, there is a variable that sets the number of replication threads:

set global slave_parallel_workers=5 in MySQL
set global slave_parallel_threads=5 in mariadb

Yet, for both implementations, you can set the number of threads to 1, and it is not the same as disabling parallel replication.

set global slave_parallel_workers=1 in MySQL
set global slave_parallel_threads=1 in mariadb

It will run parallel replication with one thread, meaning that you will have all the overhead of parallel replication with none of the benefits. Not only that, but replication actually slows down. The extra channel reduces performance by 7% in MariaDB and 10% in MySQL.

Now for the punch line. In Tungsten-Replicator, to disable parallel replication you set the number of channels to 1 (the intuitive value). If you set it to 0, the setup fails, as it should, since there would be no replication without channels. The reason for the fit is that in Tungsten, parallel replication was designed around the core functionality, while in MySQl and MariaDB it is an added feature that struggles to be integrated.

Wednesday, November 13, 2013

Call for papers (with lightning talks): Percona Live MySQL Conference 2014

The call for participation for Percona Live MySQL Conference 2014 is still open. As part of the review committee, I will be looking at the proposals, and I hope to see many interesting ones.

There is a novelty in the submission form. In addition to tutorials and regular sessions, now you can submit proposals for lightning talks, to which I am particularly interested, as I have organized the lightning talks in the past two editions, and I am in charge to continue the tradition for the next one.

If you want to be a speaker at the conference, here are some tips to get your proposal accepted:

Propose a topic that you know well;
Take some time to write a well thought and meaningful proposal: nothing gets me faster to the rejection button than statements like “I want to talk about X, I will think of something to say”;
Write your proposal with the attendees in mind, i.e. giving information that will make them want to see your presentation;
But also write with the committee in mind. There is a space for private messages to the reviewers. Use it wisely if there is something that we need to know.
Mind your buzzwords. I am not easily impressed by fashionable topics. Back your proposal with sound reasoning. Don’t assume that I, or anyone in the committee, see things your way, or the way they are reported in the press.
Check your spelling. Another way of getting rejected quickly is when you misspell the topic you claim to be an expert of.
And check your spelling again. If you miss the difference between “know its shit” and “know it’s shit,” I am less inclined to approve.
Write a sensible bio. We need to know who are you and what you do, to see if your story is compatible with your proposal.
Write enough to make your proposal clear. A proposal that is shorter than your bio will raise a red flag. But do not write too much. You are writing a proposal, not an article on the matter. If you have written an interesting article on the topic, give us an URL.

Regarding the lightning talks, I have some more recommendations.

A lightning talk last 5 minutes maximum. Don’t propose a topic that cannot be exhausted in that timeframe.
An accepted lightning talk does not give you a free pass (unless you are also accepted as speaker for a regular talk). You will be given a code to register at a 15% discount.
You should propose something that it is either highly interesting, or surprising, or entertaining, or all the above: the lightning talks are a show.
Be daring in your proposals. While a regular talk might be refused if you propose to sing the InnoDB settings, a LT on this topic could be seen as legitimate (but you must demonstrate that you can do it!)
Convince me (specifically, as I will be choosing the talks accepted by the committee) that you want to be on stage and have the abilities for the show.
Be prepared to show your slides earlier than usual. As the organizer, I need to make sure that you have something meaningful to show before sending you on stage.
Be aware of the rules:
- All slides will be loaded into a single computer, to minimize delays between talks;
- Someone (probably me) will nag the speakers until they either surrender their slides or escape to Mexico;
- All speakers will meet 15 minutes before the start, and be given the presentation order. Missing speakers will be replaced by reserve speakers;
- The speaker will have 5 minutes to deliver the talk.
- When one minute is left, there will be a light sound to remind of the remaining time.
- When 10 seconds are left, most likely the audience will start chanting the countdown.
- When the time is finished, the speaker must leave the podium to the next one.

If you have reached this point, you are ready to submit a proposal!

Friday, October 11, 2013

Tungsten Replicator Filters: A trove of golden secrets unveiled

UPDATE: This post is obsolete, for most of the URLs referenced here have been removed by the contents owner.

Since I joined the company in late 2010, I have known that one of the strong points of Tungsten Replicator is its ability of setting filters. The amazing capabilities offered by Tungsten filters cannot be fully grasped unless we explain how stage replication works.

There are several default stages in the replication stream. Every stage has an extraction task and an apply task. The extraction task will get data from the previous step repository and the apply task will save the data to the next repository, which can be either a temporary storage (memory queue, THL file) or the final destination (slave database server). Consider that the architecture allows developers to add stages, and you will appreciate its full power. For every stage, we can insert one or more filter between the two tasks. This architecture gives us ample freedom to design our pipelines.

What’s more interesting, filters in Tungsten can be of two types:

built-in filters, written in Java, which require users to compile and build the software before the filter is available;
Javascript filters, which can do almost everything the built-in scripts can do, but don’t require recompiling. All you need is deploy the filter in the appropriate folder, and run a tpm command to install or update the replicator.

The Tungsten team has developed a large number of filters, either to achieve general purpose pipeline control (such as renaming objects, excluding or including schemas and tables from replication) or filters providing very specific tasks that were needed to meet customer requirements or to help implementing heterogeneous pipelines (such as converting ENUM and SET columns to strings, dropping UPDATE and DELETE events, normalize DDL, etc). There are 23 built-in and 18 Javascript filters available. They were largely undocumented, or sparsely documented in the code. Now, thanks to my colleague MC Brown, Continuent director of documentation, the treasure is there for the taking.

First off, there is a general guide of filter usage, which tells you how to enable or disable filters.
Then, we have a full list of the built-in filters with their purpose, required parameters, and some applicability notes.
For Javascript filters there is a section that explains in detail How to create and using Javascript filters, followed by the Reference list of all available filters.

Tungsten filters are a beautiful feature to explore, but also a powerful tool for users with special needs who want to create their own customized pipeline. A word of warning, though. Filters are powerful and sometimes fun to implement and use, but they don’t come free. There are two main things to keep in mind:

Using more than one filter for the same stage may result in performance loss. The amount of performance loss depends on the stage and on your systems resources. For example if your replication weakest point is data transfer across the network, adding two or three filters to this stage (remote-to-thl) will make things worse. If you can apply the filter to the previous or next stages, you may achieve your purpose without many side effects. As always, the advice is: “test, benchmark, repeat.”
Filters that alter data (such as exclude schemas or table, drop events) will make the slave unsuitable for promotion. You should always ask yourself if the filter may make the slave a lousy master, and if it does, make sure you have at least another slave that is replicating without filters.

As a final parting thought, be aware that yours truly and the above-mentioned MC Brown will be speaking at Percona Live London 2013, with a full, take-no-prisoners, 6 hours long complete Tungsten Replicator tutorial, where we will cover filters and give some explosive and sensational examples.

Wednesday, October 09, 2013

Speaking at the MySQL NoSQL & Cloud Conference & Expo in Buenos Aires

I am on my way to Argentina, where I will be speaking at the MySQL NoSQL & Cloud Conference & Expo.

I have two talks: one on my pet project MySQL Sandbox and one on replication between MySQL and MongoDB (using another project dear to me, Tungsten Replicator.

It’s my first visit to Argentina and I will try to look around a bit before the conference. And I look forward to see many ex colleagues and well known speakers at the conference. The lineup includes speakers from Percona, EffectiveMySQL, PalominoDB, MariaDB, SkySQL, Tokutek, and OpenStack.

I am looking forward to this trip. My presentation on MongoDB replication is a first for me, and I am always pleased when I can break new ground. I have the feeling that the networking is going to be more exciting than the sessions.

If you are in Buenos Aires, come to the conference and say hi!

Thursday, September 26, 2013

Forking MySQL/ for how long can forks keep up?

Fact: MySQL 5.6 was released as GA in February 2013
Fact: MySQL 5.6 has been available with its complete set of features since September 2012
Fact: On September 21st, Oracle has released MySQL 5.7.2, which is the de facto major release after MySQL 5.6 (5.7.1 was just a token “we’re-still-in-business” release).
Fact: As of today, there is no GA-ready fork of MySQL 5.6.

Percona Server is still in RC state, while MariaDB, with its runaway version 10, is still in alpha state. Of these releases, Percona Server seems the one in the better shape. Their problem was to adapt Percona Server to the enhanced codebase for 5.6, and the merging problems were bigger than the ones encountered in 5.1 and 5.5. Percona Server is a business oriented fork. It includes specific features for performance and manageability, things that users are asking for, and things that make Percona professional services easier to carry out.

Much different is the case of MariaDB. Their feature set is wider and hard to categorize globally. It includes some features that enhance performance (block commit, subquery optimization) but also plenty of features that are included only because nobody else wanted to touch them. In my experience, adding features to a project does not make it more stable, as the MySQL team knows well, with its disastrous experience with version 6.0, which was aborted after two years of alpha. MariaDB is in a state where they are adding lots of stuff to their code base, while merging selected pieces from MySQL 5.6. It could be a recipe for success or disaster. User experience will tell.

Up to MySQL 5.5, MariaDB has kept itself completely compatible with MySQL, with the goal of being a drop-in replacement. With the capriciously numbered version 10, that promise of compatibility has been broken. Most of the features introduced in MySQL 5.6 have not been included in MariaDB 10. Some of them have been rewritten with a completely different approach. As a result, MariaDb 10 lacks some of MySQL 5.6 features (parallel replication, for example) while showing some that are not in the upstream project (multi-source replication). The design of GTID has been started from scratch, and surely will make interoperability really hard (if not downright impossible) between the two projects.

Looking at the status of the projects, it seems that the MySQL team at Oracle has produced so many changes, that keeping up with it, even in terms of simple patch merging, is a heavy task. Where does it leave us, the community users? We have the choice between using the latest and greatest from Oracle, or waiting for the next latest and greatest from one of the forks. Percona Server seems to be on the final path of a less-than-easy catch-up, but MariaDB is farther away. And even if it releases a GA soon, it will find itself in a situation where merging changes from Oracle is going to be tougher and tougher. It may mean that performance enhancements from mainstream MySQL will take months (or years) to reach MariaDB. Users willing to benefit from the latest improvements will choose between Oracle MySQL and (with some delay) Percona Server. MariaDB will need to either do some sort of cherry-picking from Oracle releases or implement its own improvements. Can they compete with the 200 strong engineering team at Oracle? It’s going to be a tough proposition.

Given the above facts, how is it that we hear of big players like Red Hat and Google announcing that they will adopt MariaDB instead of Oracle’s MySQL? Since I could not find a sensible technical reason, I am forced to conclude that the choice has been political. Both Red Hat and Google have been entangled in commercial competition, and sometimes in court. Yet, choosing software for non-technical reason looks foolish to me.

Summing up: Oracle is releasing a steady stream of improvement for MySQL. This shows a high level of commitment that does not play well with what the doomsayers have been announcing for the past 4 years. So far, facts are contradicting that FUD. Forks are lagging behind. Some more than others.

Disclaimer: The above thoughts are my own, and don’t reflect the opinions of my employer or other companies.

Tuesday, September 24, 2013

MySQL 5.7.2 : Good job Oracle! (Well, almost)

On September 21st, during the opening keynote at MySQL Connect 2013, Tomas Ulin disclosed the release of MySQL 5.7.2. This is a milestone release that includes several new features. Unlike the Previous one, which was just a point of pride, where Oracle was stating its continuous commitment to releasing new versions of MySQL. In MySQL 5.7.2, we see several new features:

First and foremost, performance. The announcement slides say MySQL 5.7.2 is 95% faster than MySQL 5.6 and 172% faster than MySQL 5.5. I don’t know yet in which circumstances these numbers hold true, but I am sure someone at Percona will soon prove or disprove the claim.
Performance Schema tables for several aspects:
- Memory usage
- Stored programs and Events pre-filtering
- Replication tables. You can have all the information that you get from SHOW SLAVE STATUS, and then some more, in nicely organized tables. Regrettably, there is no corresponding table or tables for SHOW MASTER STATUS. This is a feature that I did ask for long time ago (7 1/2 years, to be precise) and I had great expectations. Now the feature comes, sort of crippled. Without a table for the contents of SHOW MASTER STATUS, there is no good monitoring system that can be implemented.
There is now a different method for parallel replication. Instead of splitting the threads by schema, the parallelization can be defined by using a logical clock, which uses information stored by the master, telling the slaves when it is safe to run threads in parallel.
More flexibility with Semi-synchronous replication
You can now have more than one trigger per table per each action. The CREATE TRIGGER syntax now include a “trigger_order” clause that allows you to define in which sequence your triggers will run;
EXPLAIN FOR CONNECTION will let you run an explain statement for an already running query.

There are other news and improvements, and they are all listed in the Release notes, and What’s new in MySQL 5.7. Two minor changes that did not make the news, and yet, they mean a lot for end user experience:

The Verbosity of the bootstrap which filled the screen (and the error log) with useless trivia that made the users lose track of the real errors, has been greatly reduced. Now you only see warnings and errors. Thank you, MySQL team!
The test suite now uses InnoDB as default storage engine. This means that a lot of possible errors that could have been overlooked will now be caught by the test suite. Next step will be to use InnoDB for the grant tables, but there does not seem to be a set goal for this.

What else is there? The MySQL Labs now list several experimental features for MySQL 5.7.2. These are features that did not make the cut for the milestone, and are still in a less reliable state. Most notable among them is a build that offers multi source replication. I need to test it more before I can hazard some deeper judgment. For now, I am impressed by the amount of work produced.

Some disappointment comes from the realization that many of the shortcomings found in MySQL 5.6 are still there. SOme of them have been reduced of eliminated by the new performance schema tables, but the bulk of what I noticed in MySQL 5.6 is still there with the same limitations. I will come back to this topic with a deeper analysis of the new features.

Friday, August 30, 2013

Shi...pment happens - Tungsten Replicator 2.1.2

It was just a few days ago that we announced, with celebratory enthusiasm, Tungsten Replicator 2.1.1, and today we are at it again, with Tungsten Replicator 2.1.2.

What happened? In a surfeit of overconfidence, we released Tungsten 2.1.1, with faith on the test suite and its result. The faith was justified, as the test suite was able to catch any known problem and regression. The overconfidence was unjustified, because, due to a series of unfortunate events, some sections of the test suite were accidentally disabled, and the regression that was lurking in the dark was not caught.

Therefore, instead of having a quiet post-release week-end, the whole team has worked round the clock to plug the holes. There were two bugs that actually broke the tests (when we put the suite in good order, that is):

We came out of the exercise with the embarrassing bugs fixed, a few more found in the process, and a much improved test suite, which now checks itself to make sure that nothing is missing, and everything that should be tested was actually tested. While the bugs were fixed over the week-end, we spent three more days just testing the release with all the firepower we could muster.

The test suite has grown quite a while during the past years. The first version of the suite, 2+ years ago, ran in about 15 minutes. We have been adding steadily to the tests, and now the full set of tests (when none of it gets accidentally switched off, that is) requires about 12 hours. Now we feel much better about the release. All I wrote about Tungsten Replicator 2.1.1 is also good for Tungsten 2.1.2, so I encourage all to give it a try.

As a further lesson learned, from now on we’re going to publish a release candidate a few weeks before blessing a build as GA. The labels in the downloads page tell which stage of readiness each build is.

Wednesday, August 21, 2013

Tungsten-Replicator 2.1.1 with better installation and built-in security

UPDATE 2013-08-30: Tungsten 2.1.2 was released.

UPDATE 2013-08-23: We have found a few problems that happen when replicating with RBR and temporal columns. We will have to publish an updated bugfix release quite soon.

Tungsten Replicator 2.1.1 is out. Key features in this release are:

A better installer, of which we have already given a preview in tpm, the multi-master composer. The new installer allows faster and more powerful deployments of both single and multiple masters topologies. And it also allows the next feature:
Secured communication layer. Now the replicator data and administrative messages can be encrypted with SSL across nodes. The security layer, once installed, is transparent. All replication features will keep working as before, and the encryption is independent from the database. In fact, heterogeneous replication (e.g. MySQL to MongoDB, Oracle to MySQL, etc) can use it just as easily as MySQL to MySQL replication.
Full support for MySQL 5.6 binary log improvements. Now you can have the best of two worlds, running MySQL 5.6 enhanced performance, and Tungsten advanced replication features, without compromises. Due to this improvement, we also have the first change in our transport layer (the Transaction History Logs, or THL) since we released parallel replication. This means that a full cluster upgrade is needed (first slaves, and then masters) if you want to use the new release.

For more information on Tungsten Replicator 2.1.1, see the Release notes.

What does this mean for the common user? Let’s see what you can experience, when installing Tungsten Replicator 2.1.1

$ tar -xzf tungsten-replicator-2.1.1-230.tar.gz
$ cd tungsten-replicator-2.1.1-230
$ export VERBOSE=1
$ ./cookbook/install_master_slave
## -------------------------------------------------------------------------------------
## Installation with deprecated method will resume in 30 seconds - Hit CTRL+C now to abort
## -------------------------------------------------------------------------------------
## WARNING: INSTALLATION WITH tungsten-installer and configure-service IS DEPRECATED
## Future versions of Tungsten Cookbook will only support tpm-based installations
## To install with tpm, please set the variable 'USE_TPM' and start again
## -------------------------------------------------------------------------------------
....5....^C

Installation with tungsten-installer, which has been used until now, is still available, but it is deprecated. We want to encourage everyone to use tpm, as we will stop supporting tungsten-installer from the next release (2.1.2).

The main reason for using tpm instead of tungsten-installer, is that you can now install with security. The Tungsten manual has an extensive section on how to create security certificates. If you are not used to this kind of tasks, you may get discouraged from the very beginning, as you will need to create two key stores, one encrypted password store, and one file with JMX access rules. Tungsten Cookbook to the rescue! It will be enough to state our intention to install using tpm, with security enabled, and the cookbook script will generate the needed files for you.

$ export USE_TPM=1
$ export WITH_SECURITY=1
$ ./cookbook/install_master_slave
Certificate stored in file </home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/client.cer>
Certificate was added to keystore
[Storing /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/truststore.ts]
Using parameters:
-----------------
password_file.location   = /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/passwords.store
encrypted.password   = true
truststore.location      = /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/truststore.ts
truststore.password      = cookbookpass
-----------------
Creating non existing file: /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/passwords.store
User created successfuly: cookbook
Using parameters:
-----------------
password_file.location   = /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/passwords.store
encrypted.password   = true
truststore.location      = /home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/truststore.ts
truststore.password      = cookbookpass
-----------------
User created successfuly: cookbook
# ---------------------------------------------------------------------
# Options for tpm
\
--thl-ssl=true \
--rmi-ssl=true \
--rmi-authentication=true \
--rmi-user=cookbook \
--java-keystore-password=cookbookpass \
--java-truststore-password=cookbookpass \
--java-truststore-path=/home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/truststore.ts \
--java-keystore-path=/home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/keystore.jks \
--java-jmxremote-access-path=/home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/jmxremote.access \
--java-passwordstore-path=/home/tungsten/tinstall/tungsten-replicator-2.1.1-230/cookbook/passwords.store
# ---------------------------------------------------------------------

Next, you will see the complete installation command using tpm, and the cluster will be built as smoothly as it would be without the security additions.

Notice that the paths that you see on the screen are created dynamically. Once installed, the security files will be deployed in a standard location, which will be easily picked up when you need to upgrade.

The difference that you will notice about the secure deployment is only in a few small differences. When using the cookbook tools, you will see a ssl label next to each secured node:

$ ./cookbook/show_cluster
--------------------------------------------------------------------------------------
Topology: 'MASTER_SLAVE'
--------------------------------------------------------------------------------------
# node host1 (ssl)
cookbook  [master]  seqno:          0  - latency:   0.681 - ONLINE
# node host2 (ssl)
cookbook  [slave]   seqno:          0  - latency:   1.397 - ONLINE
# node host3 (ssl)
cookbook  [slave]   seqno:          0  - latency:   1.683 - ONLINE
# node host4 (ssl)
cookbook  [slave]   seqno:          0  - latency:   1.684 - ONLINE

When using the traditional tools, you will notice one tiny difference in the master URI:

Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000008:0000000000000427;0
appliedLastSeqno       : 0
appliedLatency         : 0.681
channels               : 1
clusterName            : cookbook
currentEventId         : mysql-bin.000008:0000000000000427
currentTimeMillis      : 1377091602039
dataServerHost         : host1
extensions             :
latestEpochNumber      : 0
masterConnectUri       : thls://localhost:/    
masterListenUri        : thls://host1:2112/    
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : /var/lib/mysql
relativeLatency        : 656.039
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : cookbook
serviceType            : local
simpleServiceName      : cookbook
siteName               : default
sourceId               : host1
state                  : ONLINE
timeInStateSeconds     : 655.552
transitioningTo        :
uptimeSeconds          : 656.431
version                : Tungsten Replicator 2.1.1 build 230
Finished status command...

Instead of thl:// you see thls://. That’s the indication that the replicators are communicating using a SSL channel.

The same procedure works for multi-master and heterogeneous topologies. In fact, the very same mechanism is used in our commercial product, Continuent Tungsten, where it is installed using the same tools and the same tpm options.

For existing deployments we have a manual page dedicated to Upgrading from tungsten-installer to tpm-based installation. If you are a cookbook user, try

./cookbook/upgrade

There is a live webinar covering many Tungsten-Replicator 2.1.1 features. It is free, on Thursday, August 22nd, at 10am PT.

Wednesday, August 14, 2013

Calculating timezone offsets

Time zones are a tricky feature. You live in a given time zone, and most of the time you won’t think about that at all. You may live in a place where you are conscious of time zones, such as the United States, if your business spans across the country, where you know that New York is three hours ahead of San Francisco or Chicago and Dallas share the same time zone. Time Zone support in MySQL is a complicate business in itself. Once you have updated your time zone tables, you can set your time zone in an human readable format:

set global time_zone="America/Los_Angeles";

This is nice and well. It tells you which time zone your server is working with. However, things get a bit hairy when you need to do practical things with this information. Suppose you are looking at a server that displays unfamiliar time zones:

mysql> select @@system_time_zone, @@time_zone;
+--------------------+-----------------+
| @@system_time_zone | @@time_zone     |
+--------------------+-----------------+
| GST                | Asia/Singapore  |
+--------------------+-----------------+
1 row in set (0.00 sec)

When you run select now() in this environment, you get the time in “Asia/Singapore” time zone, although the operating system is running in Dubai time zone.

mysql> select now();
+---------------------+
| now()               |
+---------------------+
| 2013-08-14 12:00:18 |
+---------------------+
1 row in set (0.00 sec)

A look at your watch can tell you what’s the difference. But if you need to put that information into a calculation, then you need a way of converting that string “Asia/Singapore” into a number:

mysql> select timediff( now(), utc_timestamp());
+-----------------------------------+
| timediff( now(), utc_timestamp()) |
+-----------------------------------+
| 08:00:00                          |
+-----------------------------------+
1 row in set (0.00 sec)

This is the time zone offset in this server, as referred to Greenwich Mean Time (GMT) or UTC. It means that Singapore is 8 hours ahead of GMT. You could get the same result by asking

mysql> select timediff( now(), convert_tz(now(), @@time_zone, 'UTC'));
+-----------------------------------+
| timediff( now(), utc_timestamp()) |
+-----------------------------------+
| 08:00:00                          |
+-----------------------------------+
1 row in set (0.00 sec)

OK. This is longer, but it has the ability of becoming a general purpose time zone offset calculator:

mysql> select timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'UTC'));
+-------------------------------------------------------------------+
| timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'UTC')) |
+-------------------------------------------------------------------+
| -07:00:00                                                         |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'Europe/Rome'));
+---------------------------------------------------------------------------+
| timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'Europe/Rome')) |
+---------------------------------------------------------------------------+
| -09:00:00                                                                 |
+---------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'Asia/Singapore'));
+------------------------------------------------------------------------------+
| timediff( now(), convert_tz(now(), 'America/Los_Angeles', 'Asia/Singapore')) |
+------------------------------------------------------------------------------+
| -15:00:00                                                                    |
+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Thursday, July 18, 2013

tpm, the multi-master composer

Multi master topologies blues

Tungsten Replicator is a powerful replication engine that, in addition to providing the same features as MySQL Replication, can also create several topologies, such as

all-masters: every master in the deployment is a master, and all nodes are connected point-to-point, so that there is no single point of failure (SPOF).
fan-in: Several masters can replicate into a single slave;
star: It’s an all-masters topology, where one node acts as hub which simplifies the deployment at the price of creating a SPOF.

The real weakness of these topologies is that they don’t come together easily. Installation requires several commands, and running them unassisted is a daunting task. Some time ago, we introduced a set of scripts (the Tungsten Cookbook) that allow you to install multi-master topologies with a single command. Of course, the single command is just a shell script that creates and runs all the commands needed for the deployment. The real downer is the installation time. For an all-masters topology with 4 nodes, you need 17 operations, which require a total of about 8 minutes. Until today, we have complex operations, and quite slow.

Meet The TPM

Notice: these examples require a recent night build of Tungsten Replicator (e.g. 2.1.1-120), which you can download from http://bit.ly/tr_21_builds

But technology advances. The current tungsten-installer, the tool that installs Tungsten-Replicator instances, has evolved into a tool that has been used for long time to install our flagship product, Continuent Tungsten (formerly known as ‘Tungsten Enterprise’). The ‘tpm’ (Tungsten Package Manager) has outsmarted its name, as it does way more than managing packages, and actually provides a first class installation experience. Among other things, it provides hundreds of validation checks, to make sure that the operating system, the network, and the database servers are fit for the installation. Not only that, but it installs all components, in all servers in parallel.

So users of our commercial solution have been enjoying this more advanced installation method for quite a long time, and the tpm itself has improved its features, becoming able to install single Tungsten Replicator instances, in addition to the more complex HA clusters. Looking at the tool a few weeks ago, we realized that tpm is so advanced that it could easily support Tungsten Replicator topologies with minimal additions. And eventually, we have it!

The latest nightly builds of Tungsten Replicator include the ability of installing multi-master topologies using tpm. Now, not only you can perform these installation tasks using the cookbook recipes, but the commands are so easy that you can actually run them without help from shell scripts.

Let’s start with the plain master/slave installation (Listing 1). The command looks similar to the one using tungsten-installer. The syntax has been simplified a bit. We say members instead of cluster-hosts, master instead of master-host, replication-user and replication-password instead of datasource-user and datasource-password. And looking at this command, it does not seem worth the effort to use a new syntax just to save a few keystrokes.

./tools/tpm install alpha \
    --topology=master-slave \
    --home-directory=/opt/continuent/replicator \
    --replication-user=tungsten \
    --replication-password=secret \
    --master=host1 \
    --slaves=host2,host3,host4 \
    --start

Listing 1: master/slave installation.

However, the real bargain starts appearing when we compare the installation time. Even for this fairly simple installation, which ran in less than 2 minutes with tungsten-installer, we get a significant gain. The installation now runs in about 30 seconds.

Tpm master slave — *Image 1 - Master/slave deployment*

Where we see the most important advantages, though, is when we want to run multiple masters deployments. The all-masters installation command, lasting 8 minutes, which I mentioned a few paragraphs above? Using tpm, now runs in 45 seconds, and it is one command only. Let’s have a look

./tools/tpm install four_musketeers \
    --topology=all-masters \
    --home-directory=/opt/continuent/replicator \
    --replication-user=tungsten \
    --replication-password=secret \
    --masters=host1,host2,host3,host4 \
    --master-services=alpha,bravo,charlie,delta \
    --start

Listing 2: all-masters installation.

It’s worth observing this new compact command line by line:

./tools/tpm install four_musketeers: This command calls tpm with the ‘install’ mode, to the entity ‘four_musketeers’. This thing is a data service, which users of other Tungsten products and readers of Robert Hodges blog will recognize as a more precise definition of what we commonly refer to as ‘a cluster.’ Anyway, this data service appears in the installation and, so far, does not have much to say within the replicator usage. So just acknowledge that you can name this entity as you wish, and it does not affect much of the following tasks.
–topology=all-masters: Some of the inner working of the installer depend on this directive, which tells the tpm what kind of topology to expect. If you remember what we needed to do with tungsten-installer + configure-service, you will have some ideas of what this directive tells tpm to do and what you are spared now.
–home-directory=/opt/continuent/replicator: Nothing fancy here. This is the place where we want to install Tungsten.
–replication-user=tungsten: It’s the database user that will take care of the replication.
–replication-password=secret: The password for the above user;
–masters=host1,host2,host3,host4: This is the list of nodes where a master is deployed. In the case of an all-masters topology, there is no need of listing the slaves: by definition, every host will have a slave service for the remaining masters.
–master-services=alpha,bravo,charlie,delta: This is the list of service names that we will use for our topology. We can use any names we want, including the host names or the names of your favorite superheroes.
–start: with this, the replicator will start running immediately after the deployment.

This command produces, in 45 seconds, the same deployment that you get with tungsten-installer in about 8 minutes.

Tpm all masters — *Image 2 - all-masters deployment*

The command is so simple that you could use it without assistance. However, if you like the idea of Tungsten Cookbook assembling your commands and running them, giving you access to several commodity utilities in the process, you can do it right now. Besides, if you need to customize your installation with ports, custom paths and management tools, you will appreciate the help provided by Tungsten Cookbook.

# (edit ./cookbook/USER_VALUES.sh)
export USE_TPM=1
./cookbook/install_all_masters

Listing 3: invoking tpm installation for all-masters using a cookbook recipe.

When you define USE_TPM, the installation recipe will use tpm instead of tungsten-installer. Regardless of the verbosity that you have chosen, you realize that you are using the tpm because the installation is over very soon.

The above command (either the one done manually or the built-in recipe) will produce a data service with four nodes, all of which are masters, and you can visualize them as:

./cookbook/show_cluster
--------------------------------------------------------------------------------------
Topology: 'ALL_MASTERS'
--------------------------------------------------------------------------------------
# node host1
alpha    [master]   seqno:         15  - latency:   0.058 - ONLINE
bravo    [slave]    seqno:         15  - latency:   0.219 - ONLINE
charlie  [slave]    seqno:         15  - latency:   0.166 - ONLINE
delta    [slave]    seqno:         15  - latency:   1.161 - ONLINE

# node host2
alpha    [slave]    seqno:         15  - latency:   0.100 - ONLINE
bravo    [master]   seqno:         15  - latency:   0.179 - ONLINE
charlie  [slave]    seqno:         15  - latency:   0.179 - ONLINE
delta    [slave]    seqno:         15  - latency:   1.275 - ONLINE

# node host3
alpha    [slave]    seqno:         15  - latency:   0.093 - ONLINE
bravo    [slave]    seqno:         15  - latency:   0.245 - ONLINE
charlie  [master]   seqno:         15  - latency:   0.099 - ONLINE
delta    [slave]    seqno:         15  - latency:   1.198 - ONLINE

# node host4
alpha    [slave]    seqno:         15  - latency:   0.145 - ONLINE
bravo    [slave]    seqno:         15  - latency:   0.256 - ONLINE
charlie  [slave]    seqno:         15  - latency:   0.208 - ONLINE
delta    [master]   seqno:         15  - latency:   0.371 - ONLINE

Listing 4: The cluster overview after an all-masters installation.

More topologies: fan-in

Here is the command that installs three masters in host1,host2, and host3, all fanning in to host4, which will only have 3 slave services, and no master.

./tools/tpm install many_towns \
    --replication-user=tungsten \
    --replication-password=secret \
    --home-directory=/opt/continuent/replication \
    --masters=host1,host2,host3 \
    --slaves=host4 \
    --master-services=alpha,bravo,charlie \
    --topology=fan-in \
    --start

Listing 5: Installing a fan-in topology.

Tpm fan in 1 — *Image 3 - Fan-in deployment*

You will notice that it’s quite similar to the installation of all-masters. The most notable difference is that, in addition to the list of msters, the list of masters, there is also a list of slaves.

    --masters=host1,host2,host3 \
    --slaves=host4 \

Listing 6: How a fan-in topology is defined.

We have three masters, and one slave listed. We could modify the installation command this way, and we would have two fan-in slaves getting data from two masters.

    --masters=host1,host2 \
    --slaves=host3,host4 \
    #
    # The same as:
    #
    --masters=host1,host2 \
    --members=host1,host2,host3,host4 \

Listing 7: Reducing the number of masters increases the slaves in a fan-in topology.

Now we will have two masters in host1 and host2, and two fan-in slaves in host3 and host4.

Tpm fan in 2 — *Image 4 - Fan-in deployment with two slaves*

If we remove another master from the list, we will end up with a simple master/slave topology.

And a star

The most difficult topology is the star, where all nodes are masters and a node acts as a hub between each endpoint and the others.

./tools/tpm install constellation \
    --replication-user=tungsten \
    --replication-password=secret \
    --home-directory=/opt/continuent/replication \
    --masters=host1,host2,host4 \
    --hub=host3 \
    --hub-service=charlie \
    --master-services=alpha,bravo,delta \
    --topology=star \
    --start

Listing 8: Installing a star topology.

Now the only complication about this topology is that it requires two more parameters than all-masters or fan-in. We need to define which node is the hub, and how to name the hub service. But this topology has the same features of the one that you could get by running 11 commands with tungsten-installer + configure-service.

More TPM: building complex clusters

The one-command installation is just one of tpm many features. Its real power resides in its ability of composing more complex topologies. The ones shown above are complex, and since they are common there are one-command recipes that simplify their deployment. But there are cases when we want to deploy beyond these well known topologies, and compose our own cluster. For example, we want an all-masters topology with two additional simple slaves attached to two of the masters. To compose a custom topology, we can use tpm in stages. We configure the options that are common to the whole deployment, and then we shape up each component of the cluster.

#1
./tools/tpm configure defaults  \
    --reset \
    --replication-user=tungsten \
    --replication-password=secret \
    --home-directory=/home/tungsten/installs/cookbook \
    --start

#2
./tools/tpm configure four_musketeers  \
    --masters=host1,host2,host3,host4 \
    --master-services=alpha,bravo,charlie,delta \
    --topology=all-masters

#3
./tools/tpm configure charlie \
    --hosts=host3,host5 \
    --slaves=host5 \
    --master=host3

#4
./tools/tpm configure delta \
    --hosts=host4,host6 \
    --slaves=host6 \
    --master=host4

#5
./tools/tpm install

Listing 9: A composite tpm command.

In Listing 9, we have 5 tpm commands, all of which constitute a composite deployment order. In segment #1, we tell tpm the options that apply to all the next commands, so we won’t have to repeat them. In segment #2, we define the same 4 masters topology that we did in Listing 2. Segments #3 and #4 will create a slave service each on hosts host5 and host6, with the respective masters being in host3 and host4. The final segment #5 tells tpm to take all the information created with the previous command, and finally run the installation. You may be wondering how the tpm will keep track of all the commands, and recognize that they belong to the same deployment. What happens after every command is that the tpm adds information to a file named deploy.cfg, containing a JSON record of the configuration we are building. Since we may have previous attempts at deploying from the same place, we add the option –reset to our first command, thus making sure that we start a new topology, rather than adding to a previous one (which indeed we do when we want to update an existing data service).

The result is what you get in the following image:

Tpm all masters with slaves — *Image 6 - all-masters deployment with additional slaves*

A word of caution about the above topology. The slaves in host5 and host6 will only get the changes originated in their respective masters. Therefore, host5 will only get changes that were originated in host4, while host6 will only get changes from host4. If a change comes from host1 or host2, they will be propagated to host1 to host4, because each host has a dedicated communication link to each of the other masters, but the data does not pass through to the single slaves.

Different is the case when we add slave nodes to a star topology, as in the following example.

./tools/tpm configure defaults  \
    --reset \
    --replication-user=tungsten \
    --replication-password=secret \
    --home-directory=/home/tungsten/installs/cookbook \
    --start

./tools/tpm configure constellation  \
    --masters=host1,host2,host3,host4 \
    --master-services=alpha,bravo,delta \
    --hub=host3 \
    --hub-service=charlie \
    --topology=star

./tools/tpm configure charlie \
    --hosts=host3,host5 \
    --slaves=host5 \
    --master=host3

./tools/tpm configure delta \
    --hosts=host4,host6 \
    --slaves=host6 \
    --master=host4

./tools/tpm install

Tpm star with slaves — *Image 7 - star deployment with additional slaves*

In a star topology, the hub is a pass-through master. Everything that is applied to this node is saved to binary logs, and put back in circulation. In this extended topology, the slave service in host5 is attached to a spoke of the star. Thus, it will get only changes that were created in its master. Instead, the node in host6, which is attached to the hub master, will get all the changes coming from any node.

Extending clusters

So far, the biggest challenge when working with multi-master topologies has been extending an existing cluster. Starting with two nodes and then expanding it to three is quite a challenging task. (Figure 8)

Using tpm, though, the gask becomes quite easy. Let's revisit the all-masters installation command, similar to what we saw at the start of this article

./tools/tpm install musketeers \
    --reset \
    --topology=all-masters \
    --home-directory=/opt/continuent/replicator \
    --replication-user=tungsten \
    --replication-password=secret \
    --masters=host1,host2,host3 \
    --master-services=athos,porthos,aramis \
    --start

If we want to add a host 'host4', running a service called 'dartagnan', we just have to modify the above command slightly:

./tools/tpm configure musketeers \
    --reset \
    --topology=all-masters \
    --home-directory=/opt/continuent/replicator \
    --replication-user=tungsten \
    --replication-password=secret \
    --masters=host1,host2,host3,host4 \
    --master-services=athos,porthos,aramis,dartagnan \
    --start

./tools/tpm update

That's all it takes. The update command is almost a repetition of the install command, with the additional components. The same command also restarts the replicators, to get the configuration online.

Tpm all masters extend — *Image 8 - Extending an all-masters topology*

More is coming

The tpm is such a complex tool that exploring it all in one session may be daunting. In addition to installing, you can update the data service, and thanks to its precise syntax, you can deploy the change exactly in the spot where you want it, without moving from the staging directory. We will look at it with more examples soon.

Thursday, July 11, 2013

Some of my MySQL Forge snippets are resurfacing

Some time ago, MySQL Forge went offline.

As part of the Forge dismissal operations, I got a backup of my snippets from the MySQL community team, and I have been lazily looking around for an alternative place where to put them.

I found such a place: Github GIST

Gist is a simple way to share snippets and pastes with others. All gists are git repositories, so they are automatically versioned, forkable and usable as a git repository.

Out of my 25 snippets, these are the ones that still look useful (at least, people have been asking me about those).

When I need to publish a snippet, this is the place where it will end up. Eventually, the collection will grow again with more useful stuff.

Sunday, July 07, 2013

RFC - DBdeployer : Bringing MySQL Sandbox to a new level

MySQL Sandbox is growing old

The MySQL Sandbox project has been around for 8 years, and it has gained considerable attention from the community. I have seen it mentioned in books and articles, used in other projects, and widely adopted by testers and bug reporters.
I have used it for more than testing, and it has saved me many hours of labor by allowing me to create database servers in a few seconds.
Yet, I have gathered a long list of grievance about it, both from my own experience and from other users feedback. Here goes.

MySQL Sandbox is not easy to install. For people used to install Perl modules, it feels natural. For experienced Perl users, installing it in user space without root access is feasible and even easy. But for the vast majority of users, it is often a nightmare. Even when the method is well understood, there are issues that baffle even the experts. Let’s see them:
- You can have several Perl versions in your system. How do you get them? By just following the instructions to install some other project, which will silently install a competing Perl interpreter in your laptop.
- Your operating system can upgrade Perl and mix up the paths where the module is installed.
- In both the above cases, you may end up with MySQL Sandbox not being the latest version or the help not being available.
MySQL Sandbox is not intuitive to use. This is a consequence of the project being extended and stretched over the previous releases, where it was really arcane and difficult to start. As a result, I made the call to the sandboxing tools brief, but not really easy. It’s a defect that often happens when the designer and the developer are the same person. Now I know better, but there is more.
MySQL Sandbox is not easy to extend. And also this is a consequence of the project being evolved from its initial roots. The first version of the application was a Swiss Army knife with many sharp blades and a tiny handle. With the latest releases I created an easy wrapper around the initial tool (which is now labeled as low-level-make-sandbox) at the price of using a awkward syntax. I made common operations really easy to use, and uncommon ones unnecessarily hard. But the worst consequence is that the features I wanted to develop are still in my wish list, because the code underneath is not flexible.

My wish list

Having used the project for most everything that came to my path during my job, I came to appreciate its versatility, but at the same time I wished I could do more to make the tool meet my needs. In the past years I have extended MySQL Sandbox with many tiny new improvements, but the core remains the same. Here’s what I would like to do:

Deploy for production, which includes the ability of deploying safely with root access. You can do that now with MySQL Sandbox, but since it was designed on purpose for deployment in user space. If you want to create production ready deployments and make them maintainable, there are many tasks that you should manage, which are taken care of when you use .rpm or .deb based deployments, but that are not that easy with a custom deployment.
Deploy remotely, which you can do now, using a shell script that I have added to the build, but it is kind of a hack: a wrapper on top of other wrappers. While it makes the deployment really easy, it has not enough checking to guarantee it will work well in most cases.
A GUI. I know MySQL Sandbox intimately. Yet, when I need to do something unusual, I need to look at the help, and sometimes at the docs to remind me of what needs to be done. A web-based (or even a text-based) menu would make the application more friendly. The main obstacle to this is that the internal build-up of the work flow has not been designed for interactivity.
Easy installation. This includes the ability of being installed through a package manager (apt-get, yum) or a language specific manager (Perl CPAN, Ruby Gems, Python PYPI), but also a simple way of using it out of the box without installing at all. This feature (or lack thereof) is what makes current remote deployment so fragile.
Deploy on Windows. I don’t like Windows, and I don’t want to have anything to do with it, but I realize that for many users it is the only operating system available. So I would make the next application modular, in such a way that someone else can just create descending classes from my abstract ones, and implement sandboxes in a different setup.
Make sandboxes for more than MySQL. Databases are all different. People have asked me to create sandboxes for Postgres, Drizzle, and Oracle, and I have declined to even try. But if the framework is abstract enough, it could allow subclasses that handle different operating systems and different database servers.
Create an API that can be used from a programming language instead of using the tools directly. This requires some clever redesign but it is feasible.

Meet DBdeployer

The next incarnation of MySQL Sandbox is named DBdeployer. Its features so far include a development plan on GitHub and a dedicated domain dbdeployer.org (with a twin .com) that so far redirects to MySQL Sandbox site.
As you can see from the development plan, there is quite a lot to do.

Why GitHub?

During the lifetime of MySQL Sandbox I have changed from Savannah to SourceForge to the current Launchpad.
For the work at my company, I have also used extensively Google code. In each place I found good and bad parts, and I kept looking around for alternatives. The fastest growing project hosting that I have noticed is GitHub, where I find all the main features I need. What I loved in launchpad was the ability of writing blueprints with their dependencies. What I have realized later is that I can write a development plan using Test Driven (TDD) and/or Behavior Driven Development (BDD) for my working documents. If you look at the MySQL sandbox code, you will see that the largest piece of the code base is made of tests, and yet I wrote most of those tests after writing the features, just because of how the application was designed. Given my love for testing, it makes sense that in my next project I privilege testing by embracing a development method that combines tests, design, and development in one harmonious loop. What has this to do with GitHub? It’s because it has all except a blueprint designer graphical interface, and since the design is going to be done with the BDD tests, GitHub, with its vibrant community and the tools that make production easy, is the tool that I have chosen. I should also mention that I have grown tired of Bazaar that comes with Launchpad. While I like its principles, it is really not as widespread and maintained as git. When git was a Linux-only application, it was not a good choice, but now, not only it is universal, but it comes pre-installed in every major operating system (no, Windows is not one of them, but despite having mentioned it in my wish list, I still don’t care).

Why not Perl?

Going with GitHub is a hint that I won’t be using Perl for my next project. A few years ago I stated my intention of using Python for a new MySQL Sandbox, but then I changed job and I focused on other priorities. While I kept up to date with Python, I also rediscovered an old love: Ruby.
Ruby is everything that Perl wanted to be but could not. It has taken from Perl most of its core principles and made them extensible with true object oriented implementation, and in the latest release of Ruby the language has improved its stability and expressiveness that strikes a chord with my love for technology. Not only that, but there are testing techniques that the Ruby community has embraced and enhanced more than other languages, and since I am a QA developer at heart, it sounds like a logical choice.
I could still use Perl, and GitHub, and BDD together. But let’s be frank: Perl object oriented abilities are poor. While it has finally come up with a new OOP framework, it is still an addition that feels foreign to the language. Compared to Ruby terse oop implementation (I love the everything is an object paradigm) Perl feels old and awkward when I try to do some innovative design.

What, you may ask, is this fixation with OOP? It's an early infatuation of mine. I started using OOP in the early 1990s, with C++, before it became as useful as it is nowadays thanks to the Standard Template Library. And since C++ compilers were not widely available, I implemented my own OOP flavor using standard C, which allowed me to develop a whole database engine interpreter and to expand with new features quite easily. Since then, I have always liked OOP. With Perl, I had to bargain. On one side, it allows me to save thousands of lines of code, compared to C, but it does not encourage using OOP. As a result, I wrote code faster, but not the kind of reusable code that with hindsight I would have liked to have produced.

So, where is the code?

Here’s the thing: there is no code yet. Although my fingers are itching to get started, I want to avoid getting stuck with a prototype that grows into a non-scalable application, and so I will put together as many BDD tests as I can before writing any code.
The first step was writing the development plan, which is more a wishlist than a proper plan. On the same vein, there is a possible description of the application at Introducing DBdeployer, from which I excerpt a dummy tentative interface.

$ dbdeployer
NAME
dbdeployer - installs single and composite MySQL instances

SYNOPSIS
dbdeployer [global options] command [command options] [arguments...]

VERSION
0.0.1

GLOBAL OPTIONS
--help                - Show this message
--hosts=hostname list - Install on one or more remote hosts (default: localhost)
-s, --[no-]sandbox    - Creates a sandboxed instead of a production-ready instance
--version             - Display the program version

COMMANDS
single      - Deploys a single instance
replication - Deploys a composite instance in replication
multiple    - Deploys a composite instance of isolated servers
check       - Checks if an instance is working correctly
clone       - Clone an instance
move        - Move an instance
remove      - Remove an instance
help        - Shows a list of commands or help for one command

Next: requesting comments

The next step is collecting advice, comments, and wishes from users and see where it goes.
I have a wish of getting a version 0.1 ready for MySQL Connect 2013, where I have a talk about running MySQL in your laptop (obviously it covers MySQL Sandbox). If the design goes well, I should be able to get a working application by mid September. Probably it won’t include all the features, but if I have an extensinble framework and the implementation of the most basic features in place, I would feel satisfied.
If you are interested in the future of MySQL Sandbox, read the development plan, and comment to this article with advice or wishes.

Acknowledgements

Thanks to my current company: Continuent, Inc (and the previous one: MySQL AB) for allowing me to keep maintaining MySQL Sandbox and its next revolutionary evolution. And thanks to all the people who have contributed with bug reports, advice, features, comments, reviews, and of course by using it and spreading it around.

On contributing to MySQL

Dave Stokes has just written that MySQL is Looking for External Contributions. The first comments on that were negative, saying that forcing developers to sign a contributor agreement is not friendly, and that MySQL developers don't play well with external contributors.

To be fair, it is not Oracle that has an unfriendly policy about contributions. It was already like this with MySQL AB, and the reason is simply that the company wants to maintain ownership of the code, so that it will be able to sell dual licensing agreements.
I may not like it, but licensing was still a consistent part of the business when I left Oracle, and I assume it still is. Since this “feature” helps paying the developers that create open source software, I believe it is a reasonable trade-off.
Besides, also MontyProgram asks the same thing (https://kb.askmonty.org/en/mca/), and so does Apache (http://www.apache.org/licenses/icla.txt) and Canonical (http://www.canonical.com/contributors).
It is a reasonable request. Not only it allows the main project to retain ownership of the code, which is often a requirement to trigger further revenues, but it also gives the project some legal protection when the contributor submits code that was created by others.

All in all, it’s a good thing that Oracle is encouraging contributions!

About the MySQL developers not playing well with external contributors, this is also nothing new. To my chagrin, when I was working with MySQL I realized early on that not all developers have a mindset that plays well with community participation. That is just to be expected. I know brilliant developers who can deliver exceptional code, but don't understand the concept of coding in an open environment, or don't have the ability of cooperating with other developers that are not colleagues. Is that a bad thing? For someone, yes, but I don't think so. If the final result is that the community gets great code for the effort of non social-oriented developers, I still take it ans day "thank you!"

As a community manager in MySQL, I tried to improve MySQL attitude towards external contributors, but I had to acknowledge that we could not force the developers to work against their inclination. There were (and still are!) some developers who play well with external contributors, but forcing everyone to work with contributors outside the company may have endangered productivity. The developers at MySQL were hired for their technical skills, and with small exception the choice has proven right, as there are developers capable of delivering first class software, which is used in mission critical applications by countless businesses.

So, there are proposals that are rejected by the developers. This is a bad sign for the contributor, but a good sign for the users. A rejected proposal means that someone took the time to examine it, and balanced the options between accepting the contribution and writing the same feature from scratch. Or even rejecting the contribution because it is not worth it.

Let me clarify, because this point is not widely known. This description refers to the process as I knew it 3 years ago. It might be changed now, but I doubt it has changed for the worse.
For a feature to get into MySQL code, by any developer, meaning either an employee of MySQL or an external contributor, it must pass three obstacles:

There must be tests for the code provided. A feature that does not include tests, is rejected.
The code must be reviewed by one or two senior developers. The first things that the developers notice is (besides having tests or not) if the code duplicates something that was already done. If it does, they suggest rewriting the code in such a way that it uses existing features rather than creating new ones.
The code must pass the test suites in all platforms supported by MySQL. This is more easily said than done. External contributors test in their preferred operating system, and then they think they are done. But MySQL has a testing farm that submits every proposed commit to the full range of the operating systems that are actively supported. So chances are that a proposed contribution breaks the tests in one or two operating systems. At that point, the employees who have been reviewing the code can either ask the initial developer to change the code so that it is multi-platform friendly, or they can do it themselves. Having supervised many such exchanges between employees and external contributors, I know for a fact that the contribution is a hard task for both sides. Since MySQL has a policy for high quality (mandatory tests, code review, test passing in all platforms), every piece of code submitted to trunk is the result of a long process, which every would-be contributor must be aware of and willing to endure.

One more thing that comes to mind in this context. An additional difficulty of contributing to MySQL is given by MySQL code being quite intricate. It was initially created when resources were scarce, and thus it includes design decisions that made sense 15 years ago, but could be written quite differently nowadays. The Drizzle project exposed most of these issues, when it stripped down most of the legacy code and implemented the database server in a more agile way, at the same time making it easier for contributors. But it did that at the expense of breaking compatibility with the past. Since MySQL gets most of its popularity by being already widely adopted and available, breaking compatibility is not a path that the project wants to walk now. I know that the developers are refactoring the code to make it more modern and manageable, but they are doing that while trying not to break anything that works well today. It's a slow process, and someone may not like it. But it's a sign of continuous progress, for which the community should be grateful. As I am.

Friday, June 14, 2013

Welcome Tungsten Replicator 2.1.0!

Overview

First off, the important news. Tungsten Replicator 2.1.0 was released today.
You can download it and give it a try right now.

Second, I would say that I am quite surprised at how much we have done in this release. The previous release (2.0.7) was in February, which is just a few months ago, and yet it looks like ages when I see the list of improvements, new features and bug fixes in the Release Notes. I did not realized it until I ran my last batch of checks to test the upgrade from the previous release, which I hadn’t run for quite a long time. It’s like when you see a son growing in front of your eyes day by day, and you don’t realize he’s grown a full foot until a distant relative comes visit you. The same happened to me here. I looked at the ./cookbook directory in 2.0.7, and I saw just a handful of commands (most of them now deprecated), and then at 2.1.0, which has about 30 new commands, all nicely categorized and advertised in the embedded documentation. If you are starting today with Tungsten Replicator 2.1.0, you can run

./cookbook/readme

and

./cookbook/help

Upgrade

If you were using Tungsten Replicator before, you need to know how to upgrade. If, by any unfortunate chance, you were not using the Cookbook recipes to run the installation, the method for installing is the following:

unpack the tarball in a staging directory

For each node in your deployment:

stop the replicator

run

./tools/update –release-directory=$PATH_TO_DEPLOYED_TUNGSTEN –host=$NODE

If your node has more than one service, restart the replicator

If you are using the cookbook, you can run an upgrade using

./cookbook/upgrade

This command will ask for your current topology and then show all the commands that you should run to perform the upgrade, including adapt the cookbook scripts to use the new deployment.

So, What’s New:

The list of goodies is long. All the gory details are in the Release Notes. Here I would like to mention the ones that have impressed me more.

Oracle Extractor Is Open Source

Up to the previous release, you could extract from MySQL and appley to Oracle, all using open source tools. If you wanted to extract from Oracle, you needed a commercial license. Now all the replication layer is completely open source. You can replicate from and to Oracle using Tungsten Replicator 2.1.0 under the terms of the GPL v2. However, you will still have to buy database licenses from Oracle!

Installation and Administration

There is a long list of utilities released inside the ./cookbook directory, which will help you install and maintain the cluster with a few strokes. See References #2 and #3 below. The thing that you should try right away is:

 # edit ./cookbook/COMMON_NODES.sh
 # edit ./cookbook/USER_VALUES.sh
 ./cookbook/validate_cluster

This will tell you if your servers are ready for deployment, without actually deploying anything.

Documentation!

We have hired a stellar professional writer (my former colleague at MySQL AB, well known book writer MC Brown) and the result is that our well intentional but rather unfocused documentation is now shaping up nicely. Among all the things that got explained, Tungsten Replicator has its own getting started section.

Metadata!

Tungsten replication tools now give information using JSON. Here’s a list of commands to try:

trepctl status -json
trepctl services -json -full
trepctl properties | less
thl list -headers -high 100 [-json]

For example:

$ trepctl services -json
[
{
"appliedLatency": "0.81",
"state": "ONLINE",
"role": "slave",
"appliedLastSeqno": "1",
"started": "true",
"serviceType": "local",
"serviceName": "cookbook"
} 
]

$ trepctl properties -filter replicator.service.comments
{
"replicator.service.comments": "false"
}

More Tools

My colleague Linas Virbalas has made the team (and several customers) happy when he created two new tools:

ddlscan, a Utility to Help Analyze and Migrate Database Schemas
the rename filter A supercharged filter that can rename mostly any object in a relational database, from schema down to columns.

Linas coded also the above mentioned JSON-based improvements.

MongoDB Installation

It was improved and tested better. It’s a pleasure top see how data from a relational database flow into a rival NoSQL repository as if they belong there! See reference #4 below.

More to Come

What’s listed here is what we have tested and documented. But software development is not a linear process. There is much more boiling in the cauldron, ready to be mixed into the soup of release 2.1.1.

We’re working hard at making filters better. You will see soon the long awaited documentation for them, and a simplified interface.

Another thing that I have tested and worked surprisingly well is the creation of Change Data Capture for MySQL. This is a feature that is usually asked for by Oracle users, but I tried it for MySQL and it allowed me to create shadow tables with the audit trace of their changes. I will write about that as soon as we smooth a few rough edges.

Scripting! This going to be huge. Much of it is already available in the source, but not fully documented or integrated yet. The thing that you will see soon in the open is a series of Ruby libraries (the same used by the very sophisticated Tungsten installation tools) that is exposed for general usage by testers and tool creators. While the main focus of this library is aimed at the commercial tools, there is a significant portion of work that needs to end up in the replicator, and as a result its usability will increase.

What else? I may have forgot something important amid all the excitement. If so, I will amend in my next articles. Happy hacking!

References

Monday, May 27, 2013

Getting started with replication from MySQL to MongoDB

As you probably know, Tungsten Replicator can replicate data from MySQL to MongoDB. The installation is relatively simple and, once done, replication works very well. There was a bug in the installation procedure recently, and as I was testing that the breakage has been fixed, I wanted to share the experience of getting started with this replication.

Step 1: install a MySQL server

For this exercise, we will use a MySQL sandbox running MySQL 5.5.31.

We download the binaries from dev.mysql.com and install a sandbox, making sure that it is configured as master, and that it is used row-based-replication.

$ mkdir -p $HOME/opt/mysql
$ cd ~/downloads
$ wget http://dev.mysql.com/get/Downloads/MySQL-5.5/mysql-5.5.31-linux2.6-x86_64.tar.gz/from/http://cdn.mysql.com/
$ make_sandbox --export_binaries ~/downloads/mysql-5.5.31-linux2.6-x86_64.tar.gz -- --master
$ echo 'binlog-format=row' >> ~/sandboxes/msb_5_5_31/my.sandbox.cnf

$ ~/sandboxes/msb_5_5_31/use -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {root} ((none)) > grant all on *.* to tungsten identified by 'secret' with grant option;
Query OK, 0 rows affected (0.00 sec)


mysql [localhost] {root} ((none)) > set global binlog_format=row;
Query OK, 0 rows affected (0.00 sec)

The above command will install an instance of MySQL 5.5.31 in the directory $HOME/sandboxes/msb_5_5_31. You can use any other MySQL version. In that case, you should change the DEPLOY.sh file below with the coordinates of your server.

Step2: install a MongoDB server

Get the binaries from MongoDB downloads, and unpack them.

$ cd ~/downloads
$ wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.4.3.tgz
$ cd
$ tar -xzf ~/downloads/mongodb-linux-x86_64-2.4.3.tgz
$ mv mongodb-linux-x86_64-2.4.3 mongodb

Once you have unpacked the binaries, you can run the server, with a command such as this:

#!/bin/bash
MONGODB_HOME=$HOME/mongodb

if [ ! -d $MONGODB_HOME ]
then
    echo "$MONGODB_HOME not found"
    exit 1
fi

cd $MONGODB_HOME
if [ ! -d $MONGODB_HOME/data ]
then
    mkdir $MONGODB_HOME/data
fi

./bin/mongod \
   --logpath=$MONGODB_HOME/mongodb.log \
   --dbpath=$MONGODB_HOME/data \
   --fork \
   --rest

Now MongoDB should be ready to receive commands. If needed, you can see a tutorial on basic MongoDB operations. For now, it's enough to check that your system is running:

$ cd ~/mongodb
$ ./bin/mongo
MongoDB shell version: 2.4.3
connecting to: test
> show dbs
local 0.078125GB
>

Step 3: install a master replicator using MySQL

As I mentioned above, there was a bug in the installation. The bug was fixed yesterday, and thus you should use a recent build from http://bit.ly/tr20_builds. For this exercise, I am using tungsten-replicator-2.1.0-269.tar.gz.

Of course, you should also configure your host according to the system requirements for Tungsten.

We start by creating a few directories that we need for our deployment.

$ cd 
$ mkdir deploy
$ cd deploy
$ mkdir mysql     # here we install the master replicator
$ mkdir mongodb   # here we install the slave replicator
$ wget https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.1.0-269.tar.gz
$ tar -xzf tungsten-replicator-2.1.0-269.tar.gz

We now create a defaults script that will be used by both master and slave installer.

$ cat DEPLOY.sh
MYSQL_PORT=5531
MYSQL_SANDBOX_BASE=$HOME/sandboxes/msb_5_5_31
MYSQL_BASEDIR=$HOME/opt/mysql/5.5.31
MYSQL_CONF=$MYSQL_SANDBOX_BASE/my.sandbox.cnf
MYSQL_BINLOG_DIRECTORY=$MYSQL_SANDBOX_BASE/DATA/
DEPLOY_HOME=$HOME/deploy
MYSQL_DEPLOY=$DEPLOY_HOME/mysql
MONGODB_DEPLOY=$DEPLOY_HOME/mongodb
MONGODB_PORT=27017
TUNGSTEN_BINARIES=$DEPLOY_HOME/tungsten-replicator-2.1.0-269
MASTER_THL_PORT=12500
SLAVE_THL_PORT=12600
MASTER_RMI_PORT=11500
SLAVE_RMI_PORT=11600

And finally the master installation script:

$ cat install_master.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

./tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=127.0.0.1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --datasource-mysql-conf=$MYSQL_CONF \
  --datasource-log-directory=$MYSQL_BINLOG_DIRECTORY \
  --datasource-port=$MYSQL_PORT \
  --service-name=mongodb \
  --home-directory=$MYSQL_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --thl-port=$MASTER_THL_PORT \
  --rmi-port=$MASTER_RMI_PORT \
  --java-file-encoding=UTF8 \
  --mysql-use-bytes-for-string=false \
  --mysql-enable-enumtostring=true \
  --mysql-enable-settostring=true \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=none --start-and-report

We can run the installation command for the master.

$ ./install_master.sh
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> ..
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 1.218
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Step 4: install a slave replicator using MongoDB

Now we have a master database server with a replicator that can send data across for any slave to pick it up. We can install one or more MySQL slaves, but this is not the purpose of this exercise. So we skip any regular slave installation and will install a MongoDB slave. To do so, we create a third script, which will invoke the installer to run a replicator slave service that applies data to MongoDB.

$ cat install_slave.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=127.0.0.1  \
  --service-name=mongodb \
  --home-directory=$MONGODB_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --datasource-port=$MONGODB_PORT \
  --master-thl-port=$MASTER_THL_PORT \
  --thl-port=$SLAVE_THL_PORT \
  --rmi-port=$SLAVE_RMI_PORT \
  --java-file-encoding=UTF8 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=none --start-and-report

Compared to MySQL installation, we see that the 'datasource-type' is 'mongodb', and as such it does not need username and password. Let's run it.

$ ./install_slave.sh
WARN  >> 127.0.0.1 >> Currently unable to check for the THL schema in mongodb
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 49.444
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

We get a warning, which should not worry us, as it is expected. Notice that the high latency reported by the replicator is due to the delay in installing the slave. The only event in the replication pipeline is the 'master online' broadcast, which was replicated after the installation was completed.

If you got to this point, with both the master and slave replicator in the 'ONLINE' state, then you can continue with checking that replication works.

Step 5: check replication

Replication from MySQL (a relational DBMS) to MongoDB (a key-value store) requires some understanding of what we are replicating and what limitations we can face.

In MySQL, we have tables, which are translated to MongoDB collections. What is very important to understand is that Tungsten Replicator DOES NOT REPLICATE DDL EVENTS. Any CREATE/DROP/ALTER events are ignored by the replicator.

However, the replicator is smart enough to convert every table record into a document within the relevant collection. Every insert will generate a document inside MongoDB. Let's see an example:

mysql [localhost] {msandbox} (test) > create table myfirst (id int not null primary key, name char(30), d date);
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (test) > insert into myfirst values (1, 'Harry Potter', '1997-06-30');
Query OK, 1 row affected (0.00 sec)

On the MongoDB side, we will see the corresponding object:

> show dbs
local 0.078125GB
test 0.203125GB
tungsten_mongodb 0.203125GB
> use test
switched to db test
> show collections
myfirst
system.indexes
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter", "d" : "1997-06-30" }

Not only inserts, but also updates are recognised:

mysql [localhost] {msandbox} (test) > insert into myfirst values (2, 'Harry Potter 2', '1999-06-02');
Query OK, 1 row affected (0.00 sec)

mysql [localhost] {msandbox} (test) > update myfirst set name = 'Harry Potter 1' where id =1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Which will result in:

> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }
{ "_id" : ObjectId("51a287633004fd7959a5f5ab"), "id" : "2", "name" : "Harry Potter 2", "d" : "1999-06-02" }

And delete statements:

mysql [localhost] {msandbox} (test) > delete from myfirst  where id =2;
Query OK, 1 row affected (0.00 sec)

#####
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }

This is all fine. But what happens when we insert quite a lot of data? Let's find out. We're going to use the test employees database to see if the system complains.

$ cd ~/data/employees/
$  ~/sandboxes/msb_5_5_31/use < employees.sql
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries

To see if replication is OK, we check both master and slave for their status, and then we count the records in MySQL and MongoDB.

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 1.152
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 106
appliedLatency  : 93.469
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

So, we see that the slave is lagging, as the data is hard to digest (about 4 million records). Let's try again after a few seconds:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 89.99
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Both the master and the slave have reached the same transaction number. The slave has some delay, due to the size of the transactions (which become even bigger because of the RBR). When the size of the transaction is small, the latency becomes quite acceptable:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT heartbeat

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 183
appliedLatency  : 0.423
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

As a last operation, let's check the numbers of the biggest table in the employees database:

$ ~/sandboxes/msb_5_5_31/use  employees
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {msandbox} (employees) > show tables;
+---------------------+
| Tables_in_employees |
+---------------------+
| departments         |
| dept_emp            |
| dept_manager        |
| employees           |
| salaries            |
| titles              |
+---------------------+
6 rows in set (0.00 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from employees;
+----------+
| count(*) |
+----------+
|   300024 |
+----------+
1 row in set (0.22 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from dept_emp;
+----------+
| count(*) |
+----------+
|   331603 |
+----------+
1 row in set (0.28 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from titles;
+----------+
| count(*) |
+----------+
|   443308 |
+----------+
1 row in set (0.25 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from salaries;
+----------+
| count(*) |
+----------+
|  2844047 |
+----------+
1 row in set (0.83 sec)

####

> use employees
switched to db employees
> show collections
departments
dept_emp
dept_manager
employees
salaries
system.indexes
titles
> db.employees.count()
300024
> db.dept_emp.count()
331603
> db.titles.count()
443308
> db.salaries.count()
2844047
>

This proves that replication works, and even large amount of data are replicated reasonably fast.

Here's the official documentation about MySQL to MongoDB replication: https://docs.continuent.com/wiki/display/TEDOC/Replicating+from+MySQL+to+MongoDB. There you will find more details on how to install and what to expect. Happy hacking!