ChannelMeter was officially launched May 1, 2011 but we’ve been harvesting data for a lot longer. As of this writing we harvest data for 8580 YouTube channels and something like 2.3 million videos; some of this data goes back as far as 2009. More channels are added constantly, which adds all of their videos, and we harvest data every hour of every day.
Obviously, this is a lot of data. You don’t store this in an Excel spreadsheet.
Enter MySQL 5.5 Community Version. It’s a fantastic product — it’s fast, it’s web scale, it can store huge amounts of data and still access it in the blink of an eye, it can be tuned to run on just about any size server, it supports replication in every configuration imaginable, it offers multiple table engines so you can choose the one that works for you, etc.
Here begins the difficulty, though: all those cool options involve choices, and configuration, and the occasional ‘gotcha’. DBAs are the people who specialize in knowing all this, and I’m a developer, not a DBA. But, as the ChannelMeter Technical Co-Founder, I’m responsible for managing all our tech. It’s been an interesting learning experience. Here are a few of the things I’ve learned, in no particular order (and not all at ChannelMeter). Standard disclaimer, your own risk, not my fault, blah blah blah.
InnoDB is your friend. Act accordingly.
Declare InnoDB explicitly
When you create your tables, explicitly instruct them to use InnoDB. This is the default as of 5.5 but you never know what server version your schema will be instantiated against (ChannelMeter has been hosted at 3, soon to be 4, separate hosts in its lifetime.) If you don’t know why InnoDB matters, the basic version is: MyISAM (the pre-5.5 default) can occasionally lose or corrupt your data. It also provides no referential integrity — foreign keys are silently ignored, transactions are noops, etc. An out-of-the-box InnoDB will not befoul your data, and will allow you to use FKs, transactions, and all that other ACID goodness.
Configure the server with InnoDB-specific params
After you’ve set all your tables to InnoDB, edit your my.cnf and add the following (hover for explanations):
These are conservative defaults that should work on just about any machine; you should read the docs and set these to something appropriate for your system.
Binary logs are big; gzip them
By default, MySQL puts the error log in the data directory and the binary logs in the data directory. The binary logs are going to build up and will eventually take a lot of space. I am uncomfortable deleting anything in the data directory, so I simply gzip’d them and left them in place. They compress very nicely (generally about 5:1), and I can always nuke them, or gunzip them, later.
Backups are painful. Percona is awesome.
There are multiple ways to back up a MySQL server (there are others than the ones I linked). All of them have problems: they cost a lot, they take too long for large datasets, they risk getting inconsistent data under certain edge cases, etc.
Percona is a company; Percona Server is one of their products. BUT. It’s not a different database — it’s still MySQL, it’s just MySQL++. Percona took the MySQL code, extended it, and created Percona Server as a drop-in replacement for the stock server. Backups were one of the pain-points they specifically set out to fix, and they did.
With Percona Server, backups are now a trivial cronjob that stream all data into a compressed tar file (or a bazillion different other options, if you prefer), while the system is running, with no risk of it hosing your data. They also provide a raft of other stuff that I have not had cause to use yet, but I’m glad it’s there.
Unless you have a paid support contract for MySQL-stock-install, there is no reason not to migrate to Percona Server immediately; you can always go back.
INSERTs, UPDATEs, etc are limited by max_allowed_packet
When dealing with a single large chunk of data that needs to get into or out of the DB, the gating factor is probably the max_allowed_packet variable, not RAM, disk, etc. Here are the relevant quotes from the page I just linked:
The largest possible packet that can be transmitted to or from a MySQL 5.5 server or client is 1GB. … Both the client and the server have their own
max_allowed_packetvariable, so if you want to handle big packets, you must increase this variable both in the client and in the server. … If you are using the mysql client program, its default
max_allowed_packetvariable is 16MB.
The ‘packet’ they are talking about is the entire data you’re sending, not just the largest field within it. So, this issue applies most often when dealing with TEXT or BLOB, but can also crop up (e.g) when inserting to a bunch of large VARCHAR fields.
Finally, note that when you send a too large packet, the INSERT fails. This is exactly what a database should do and is surprising only in that it is counter to MySQL’s normal behavior of truncating data to the max amount that will fit in a field. (A practice that I personally find abhorrent, I might add — it should either fail or succeed at inserting what I gave it, but successfully inserting something else is insane.)
That’s it for the actual advice, but I would like to share the story about how we migrated to Percona Server. When ChannelMeter started, I wasn’t aware of Percona. When it came time to set up a backup system that could handle reliably backing up (at the time) 50+ GB of data and growing, without shutting the server down, fast enough that the backup finished before the next harvest happened…I found this to be non-trivial. Searches were done. IRC channels were queried. Things were tried. Blasphemous oaths were sworn. The wall beside my desk acquired a remarkable negative mold of my forehead. A couple of weeks passed.
Then my friend Rich came over to help me with a migration, and I asked him to see if he could take a shot at the backups. He did the research, discovered Percona, downloaded the Percona Server, built it on the prod server, shut down the stock MySQL server, and brought the Percona Server back up. There was our familiar MySQL environment, serving all our data, happily connecting to the ‘mysql’ CLI and our application. The whole thing, soup to nuts, took about an hour.
Just goes to show –sometimes, the best approach is to ask someone else to do it.
Ok, this has been a long post; thanks for sticking with me, and I hope you enjoy ChannelMeter!
Update 2011-05-25-18:29PST: Fixed the name of the max_allowed_packet variable, which I had mistyped in several places.