Monday, August 24, 2015

MySQL usability issues for replication

In my latest series of advanced replication features, I came across several usability issues, which I would like to recap here. For each section of this list of requests, I make a wish list, with some general comments.

INSTALLATION

As the maintainer of MySQL Sandbox, a tool that wants to facilitate the installation of MySQL for testing, I am particularly sensitive to the annoyances during installation. I have covered the matter extensively in recent weeks and many times before. My point is that the logging of the installation must be useful for the DBA, not to the MySQL developers.
  • Make a big distinction between [Info] and [Warning] messages. Warnings should only appear when there is some risk. Using a warning to tell the user that the InnoDB log file was created is a nonsense.
  • … and then make [Info] messages suppressible. Currently, I can suppress warnings, but then I would suppress also the real warnings, in addition to the fake ones.
  • Make mysqld –initialize use the correct syntax for timestamps, instead of issuing a warning about the obsolete syntax that was used. (See In search of cleanliness : the elusive quiet installation for a full explanation)

MULTI-SOURCE

Here we have two main problems. The first one is that the introduction of multi source channels has left some new ways of shooting yourself in the foot. By mixing the new explicitly named channels with the hidden one (but still used in regular replication!) which is named "" (= the empty string), we may start to see a new series of blunders in replication.
The second set of problems comes from weak integration between existing features and the new one. There are operation modes, such as semi-synchronous, delayed, and parallel replication that were designed with a single data stream in mind, and that lack the capability of being tuned for different channels.
  • Make it impossible to use the unnamed channel and the named ones at the same time. Currently only a very weak protection exists against mixing named and unnamed channels. Also the syntax for channel ’’ should result in an error when named channels are defined.
  • Integrate features that were designed for single sources to work better with multi source: semi-sync replication, parallel applier, delayed replication.

GTID

The ugliness of GTID implementation comes from the assumption that users should value more the uniqueness of the server ID than the ease of maintenance. This leads to the solution of choosing UUIDs as server identifiers, and tying such identifiers to GTIDs. Personally, I don't care if my server has the same ID used in a deployment next door or in another continent. I care about being able to recognize my servers easily and group them according to my needs. Moreover, the server-UUID is only used for a few functions, but replication accuracy still depends on the old server-IDs.
That said, even if GTIDs were easier to read, they still suffer from lack of usability in many ways, as was documented in an earlier article.
  • The server-UUID should be human friendly. It should be something that can be pronounceable, easy to read, and easy to tell apart from others. Valid candidates for better server identifiers can be simple numbers (like server-id), alphanumeric text (such as ‘alpha,’ ‘mynode101,’ ‘London–42’)
  • It should be possible to define the server-UUID (or its human-friendly replacement) without hacking the auto.cnf file
  • The GTID shown in monitoring commands ( show master/slave status, select @@global.gtid_executed, various performance_schema tables) should show the precise transaction number, not the transaction set, i.e. server-uuid:203, not server-uuid:1–203.
  • Enable GTID by default;
  • Remove restrictions to GTID usage;
  • Make an easy procedure to skip unwanted events, instead of injecting empty transactions;

CHANGE MASTER TO

The shortcomings of GTID are also reflected on the main command for replication setup: CHANGE MASTER TO. Some of the shortcomings are listed below. Perhaps there is something more that escapes me now.
  • Allow CHANGE MASTER TO to use GTID to set the replication starting point. Currently you can only do it with binlog name + position;
  • Allow CHANGE MASTER TO to use server-uuid (or their human friendly replacement) as argument for IGNORE_SERVER_IDS=…

MASTER STATUS

The usefulness of MASTER STATUS is reduced by the inability of combining its output with something else. Since the status is only available as a SHOW statement, we can only use its output through an external procedure, not in SQL.
Most damning, though, is the grouping of GTIDs listed in SHOW MASTER STATUS with multi-source replication, where determining what was created in the current server requires multiple operations.
  • its output should be available in a table
  • When using multi-source, there should be a distinction between GTID generated in the current server and the ones that were received through replication

SLAVE STATUS

In part I of the replication analysis, we have seen that the only place where we have completeness of information is SHOW SLAVE STATUS, while all the replication tables in mysql and performance_schema together still lack some of the elements that this ancient command allows. There is much that can be done for improving the monitoring tables.
  • show ALL the information from SHOW SLAVE STATUS in performance_schema tables.
  • In multi-source, show executed GTID for the channel that is being listed. Currently, it shows all GTIDs for every channel.

CRASH-SAFE tables in mysql

This is related to the previous section. Some info that is available in SHOW SLAVE STATUS is still missing from the crash-safe tables. And some more data, which is available to the replication threads, should be logged as well.
  • Show received and executed GTID alongside binlog and relay log info
  • Add schema info (in which schema the event was applied)

SLAVE WORKER TABLE

Continuing the grievance about SHOW SLAVE STATUS, the monitoring table dedicated to parallel replication can give more info. When an error occurs during parallel replication, the error message shows the thread identifier, the GTID, and the log where the problem has arisen. Why the same info cannot be in the tables as well?
  • Add GTID info
  • Add schema info
  • Remove Checkpoint_group_bitmap field (which shows as gibberish) or make it displayable.

performance_schema TABLES

Part of the above wishes apply to the performance_schema tables as well. Information known to the server is not made available for monitoring.
  • Show both received and executed GTIDs for all replication_% tables that display progress.
  • Enhance replication tables to include all information related to replication.

WISHES FOR FUTURE FEATURES

As MySQL 5.7 approaches GA, we can start worrying about new features coming. I think that the data dictionary will be one of the features that will have the bigger impact on usability.
What concerns me is that some of the features of a data dictionary will clash with what is already implemented in information_schema and performance_schema. I'd like to see, as early as possible, an integration path for these features. I don't care if in this version we need to adjust our queries to the new structure, provided that there will be only one source of information and that the change provides real benefits.
A further wish for the future: when introducing changes related to security, make them substantial rather than noisy. Adding warnings to commonly used commands results often in unnecessary work for DBAs and developers.
Finally, and coming back to the main focus of this post, when thinking of replication I invite the MySQL team to innovate outside the box. There are many fields where replication can be enhanced and made more powerful. Here is a simple list:
  1. Make replication open to heterogeneous sources and destination. Replicating from and to no-sql entities is becoming more important by the day.
  2. Make replication more agile, by adopting and promoting an open standard for binary logs.
  3. Replication monitoring is now based on a single server. We need internal features that make easier the task of creating monitoring dashboards that can see the whole cluster at once. Adding the appropriate monitoring tables and giving slaves the ability of communicating back to the masters their status (which would then broadcast back to the other slaves) would improve the replication process —and the ability of recovering from a failure— immensely.

2 comments:

Federico Razzoli said...

Interesting post. You mention MySQL multisource replication. Reading your article, I see that the implementation is different from MariaDB multisource. I'm considering using MariaDB multisource in a future - do you have any comments about it?

Giuseppe Maxia said...

@Federico:
I have written 3 posts on the topic of multi-source, covering both MySQL 5.7 and MariaDB 10.
MySQL replication in action - Part 2 - Fan-in topology
MySQL replication in action - Part 3 - All-masters P2P topology
MySQL replication in action - Part 4 - star and hybrid topologies