Wednesday, December 19, 2007

Pop quiz (with prize): generate 4 billion records

My latest quiz was quite popular, and some interesting ideas were submitted.
There was an interesting development. A colleague called and asked me for advice on how to insert 4 billion rows in a table with a simple structure.
create table t1 (
id tinyint not null
);
Very simple. No primary key, no indexes. It is needed to perform some specific tests.
Actually, not 4 billion, but 2^32 records are needed, i.e. 4,294,967,296.
The classical method used in these cases is doubling the table contents:

insert into t1 values (1),(1),(1),(1),(1),(1),(1),(1);
insert into t1 select * from t1;
insert into t1 select * from t1; # and so on
My solution was similar to the one from my quiz.
CREATE VIEW `v4` AS
select NULL
union all select NULL
union all select NULL
union all select NULL;

CREATE VIEW v1024 AS
select null from v4 a, v4 b, v4 c, v4 d, v4 e;

INSERT INTO t1 select 1 from v1024 a, v1024 b, v1024 c, v4 ;
This one is faster than the doubling method, but it still requires from 40 to 65 minutes, depending on how fast is your server.
So, the challenge, for which we will give away 1 MySQL T-shirt to the winner, is as following:
  • Generate 2^32 records for table t1;
  • no limits to the length of code to use;
  • you can use stored routines, temporary tables, views, events, whatever makes the insertion fast;
  • the method must be portable across operating systems. If it is not portable, a Unix-only method may be accepted if it is exceptionally faster than SQL-only solutions;
  • methods relying on external applications cannot be accepted;
  • If an programming language is needed, for compatibility with the test suite we can only accept Bash shell or Perl scripts;
  • If you devise a fast method to insert data using MySQL Proxy, a Lua script can be accepted.
  • what matters is the final insertion speed and ease of use.
  • If a method uses an external script, its speed must be more than 20% faster than the fastest method using only SQL.
  • The speed will be calculated on my server, using MySQL 5.1.23.

Solutions so far

  • Dpin. 20 minutes for a portable solution is very good.
  • Jedy. Very nice, but not that fast. 32 minutes.
  • Todd. Brilliant solution (10 minutes for 4 billion rows!), but really impractical. We need something that works for any engine. This one is a dirty trick that is fun to use once, but in the long run it won't stand.

Data from nothing - solution to pop quiz

My latest post on generating one million records received many comments, with interesting solutions.
The challenge required the insertion of 1 million records in a simple table, with a few constraints.
create table t1 (
dt datetime not null,
primary key (dt)
);

The solution

The official solution is straightforward:
create view v3 as select null union all select null union all select null;
create view v10 as select null from v3 a, v3 b union all select null;
create view v1000 as select null from v10 a, v10 b, v10 c;
set @n = 0;
insert into t1 select now()-interval @n:=@n+1 second from v1000 a,v1000 b;
You can appreciate the title here. Data from nothing. Generate 1 million valid records from a view made of NULLs!
The principle is easy.
  1. First create a few values in a view, by means of UNION ALL queries.
  2. Then, using Cartesian products, generate a larger view.
  3. And a larger one, containing 1,000 self generating rows.
  4. The final step is a simple Cartesian product in a self join.
This is something that they told you not to do in Database 101 classes, but it can be handy sometimes!

It can be reduced to 4 lines:
create view v3 as select 1 n union all select 1 union all select 1;
create view v as select 1 n from v3 a, v3 b union all select 1;
set @n = 0;
insert t1 select now()-interval @n:=@n+1 second from v a,v b,v c,v d,v e,v;

The other solutions

The solution that comes closer to the intended one comes from Shane Bester.
set @a:=1167602400,@b:=1168602400;
create view v as select 1 union select 2 union select 3 union select 4;
create view x as select 1 from v,v a,v b,v c,v d;
insert t1 select(from_unixtime(@a:=@a+1))from x,x a where @a<@b;
Shane also produced the fastest solution.
create view v as select a.dt from t1,t1 a,t1 b,t1 c,t1 d,t1 e,t1 f,t1 g;
replace t1 values('990101'),('980101'),('970101'),(@a:=1);
replace t1 select (adddate(a.dt,@a:=@a+1)) from v,v a limit 999996;

B.Steinbrink produced the shortest one.
  INSERT t1 VALUES(@a:=72620211),(101),(102),(103),(104),(105),(106),(107);
replace t1 select@a:=adddate(@a,1)from t1,t1 a,t1 b,t1 c,t1 d,t1 e,t1 f;

Roland Bouman produced the most wicked one. And a method that caught me by surprise.
# WARNING! DON'T RUN IF YOUR SERVER DOES NOT HAVE AT LEAST 4 GB RAM !!!
set max_allowed_packet := 1073741824;
set @s=concat('insert into t1 values(@d:=10000101+@i:=1)');
set @s=concat(@s,repeat(',(date_add(@d,interval @i:=@i+1 day))',999999));
prepare s from @s;
execute s;
In 5 lines, he is generating a 37 MB query containing 1 million records!
Not very efficient, but it works, if you have a huge amount of RAM. Otherwise, it can crash your server!
Kai Voigt came up with the same idea, with a faster execution.
set @a=concat(repeat(concat(@s:="select 1 ","union all "),99),@s),@x:=0;
set @c=concat(@s,"from (",@a,")a join(",@a,")b join(",@a,")c"),@i="inser";
set @d=concat(@i,"t into t1 select from_unixtime(@x:=@x+1) from(",@c,")c");
prepare s from @d;
execute s;
Jedy solution was the fastest until Shane's arrived.
insert into t1 values (now()-2),(now()-1),(now()),(now()+(@a:=1));
insert into t1 select adddate(now(),@a:=@a+1) from t1 a,t1 b,t1 c,t1 d,t1;
insert into t1 select adddate(now(),@a:=@a+1) from t1,t1 b limit 998972;
You can see all the other solutions as comments to the previous post.
This post comes earlier than promised, because of an unexpected development. A new quiz is coming. Stay tuned!

Monday, December 17, 2007

Pop quiz: generate 1 million records

This is a quiz that has a aha! solution. Not so trivial, though. It requires some thinking.

Given this table:
create table t1 (
dt datetime not null,
primary key (dt)
);
Task: insert exactly 1 million records in table t1, with the following constraints:
  • Use a maximum of 5 (five) SQL statements;
  • Use only the MySQL interactive command line client;
  • No shell commands;
  • No loading of scripts;
  • No inserts from existing tables in your system. You must assume that t1 is the only table in your entire database;
  • No MySQL Proxy;
  • No stored routines, triggers or events;
  • Each statement must be not longer than 75 characters;
  • UPDATE. No modification of table t1;
  • No LOAD DATA.
Prize: fame and fortune (i.e. your name quoted in these columns).
I will publish the solution at the end of the week.

To make sure that I am not cheating, here is the MD5 signature of the solution file that I will publish this week.
fc6d32faf19b5ac1064093a6d7969f7c  solution.txt
If you are paranoid and believe that I can create an arbitrary file and make its contents match with the above MD5 signature, you have until Friday to get the solution from that. :)

Update: To keep the challenge interesting, I won't publish the comments with the right solutions for a few days. If you have sent a comment, don't worry if it does not show up immediately. I will publish it soon before the final solution.

Solutions so far

  • Shane. Three lines. Less than 6 seconds. And close to the original solution! (Your third solution is almost the same as the intended one)
  • Jedy. Three lines! and less than 6 seconds to execute! on top.
  • Dipin. Your latest solution (3 lines) tops the list.
  • Kai Voigt. one solution with the crowd, and a wicked one like Roland's, but much faster!
  • Roland Bouman. This is the wickedest solution so far. Not the shortest, but it's the one that is totally different from the intended one (and the other solutions). It will crash most weak servers, though.
  • Carsten. In the same league as Kai and Roland for the wicked solution.
  • Dirk1231. The challenge requires not to use other tables. (Also your second solution does). The third attempt put you finally in the list! Your fourth solution is nice, but not enough to climb to the top.
  • Sergey Zhuravlev. Excellent! Can you do it without creating a table? (your second solution is very nice and imaginative)
  • WSchwach. Your solution qualifies as cheating. Altering the given table to insert duplicate records is not a valid solution. Good shot, nonetheless!
  • Morgan Tocker. Very creative! That's very good.
  • Matthias. Good use of the allotted characters.
  • Ephes. Not bad. But you are not assuming that t1 is the only table in your system. Will you try without creating tables?
  • Hubert. Nice try. One of the elements you mentioned will lead you to the right track. Keep trying.
  • Bill Karwin. I did not specify that the numbers should be contiguous, and you took advantage of it! Well done!
  • Domas. Good solution. I expected more from you.
  • Tobias. Your first solution runs forever. The second one is cheating (using a information_schema table)
  • Erik. Nope. No other tables allowed. Not even information_schema tables.
  • William: Nope. No stored routines allowed. And the instruction to create the routine is longer than 75 characters. Your second solution has a command longer than 75 chars, and it's using other tables, and it does not work either!

Some hints

Keep trying, and consider that my intended solution does the following:
  • Inserts contiguous records;
  • Uses all dates in this century;
  • Does not insert from ANY table, not even t1;
  • Does not use LIMIT;
  • The total execution time is below 7 seconds.
UPDATE: Solution online!

Sunday, December 02, 2007

MySQL University - Introducing Lua for Proxy scripting

MySQL University
Mark your calendars!
On Thursday, December 13th at 15:00 CET (14:00 UTC, 09:00 EDT, 06:00 PST) I will host a session of MySQL University, on the topic of Introducing Lua for MySQL Proxy scripting.

For those who missed the previous announcements, MySQL University is a series of online expert lessons that you can join for free and attend from the comfort of your home or office. The slides are provided in either PDF of wiki pages, the audio is an ogg stream, and you can interact with the lecturer via IRC.
If you have heard of MySQL Proxy but haven't got the time of getting involved with it yet, this session is for you. If you were interested but you thought that another scripting language would be too difficult, give this session a chance. Enroll now!