Saturday, 29 December 2012

Recovering from Cassandra "Value already present" error

This is a write-up of the process I recently went through with a sysadmin colleague to recover from a crippling Cassandra failure that threatened to make all of our data inaccessible. I am deviating from my usual step-by-step instruction style here because I am not quite confident that this particular method is applicable to all situations. We had a clear mandate about which sets of data had to be absolutely saved and which could be "let go". This might not be the case for you.  Hopefully, reading about our experience could help you to figure out the best course of action for your situation.

Platform: DataStax Cassandra 1.1.7 (community edition) running on a five node cluster hosted on Amazon EC2.

It all started when our sysadmin brought down the cluster for some maintenance work. This was necessitated by the fact that some our applications had started failing with connection timeout errors from the Cassandra nodes. The CPU load on the cluster was quite high and it appeared that there were excessive garbage collection attempts happening in the JVMs. My colleague was hoping that tuning the heap size and memtable size would solve this problem.

Config changes done, the cluster was ready to be brought back into life but that's when things started going awry. The cluster wouldn't start and we saw the following error message in the logs:
Exception encountered during startup
java.lang.IllegalArgumentException: value already present: 687508

A bit of googling revealed that this is a bug in Cassandra that has been fixed in the latest 1.2 branch.( However, nobody had any suggestions about how to recover from it and get an affected cluster back to life. It was apparently due to a race condition that allocated the same ID to Column Families in two different keyspaces.

We enabled debug logging and figured out the keyspaces and column families that were at the root of the issue. We then thought that perhaps if we edited the metadata and changed the ID of one of the offending column families to something else, the cluster might just start back up. The only way we could think of doing this was by using the sstable2json command to look through the system metadata sstables.

Unfortunately, the sstable2json command first tries to load the entire schema into memory using the exact same process that the daemon uses. This obviously fails because of the duplicate CFID. Now we were truly stuck.

I asked my colleague to send over the system/schema_keyspaces and system/schema_columnfamilies directories to see if I could figure out a way through to them. Perhaps running the sstable2json command on a clean environment without the rest of the data files would work. I was half right in this regard. The tool no longer failed at the duplicate CFID error. But it did throw a different exception now.
sstable2json /home/cell/tmp/schema_columnfamilies/system-schema_columnfamilies-hf-712-Data.db

no non-system tables are defined
Exception in thread "main" org.apache.cassandra.config.ConfigurationException: no non-system tables are defined

My next step was to download the source code for Cassandra and try to figure out what was causing this error. It turns out that there's an explicit check in the SSTableExport class to not dump any of the system tables. There was no way to switch this behaviour off, so the only option was to modify the source itself. This was quite easy but the next hurdle I had to jump through was getting Cassandra to build locally with my changes. The Ant build script was throwing errors and there was too much pressure and very little time to spend on actually trying to figure out why this was so.

Fortunately, I had another idea. The SSTableExport class which forms the sstable2json tool is a single Java file that uses the Cassandra core classes to do the bulk of the work. I created a new Java project in my IDE, added all the jar files from the Cassandra binary distribution to the classpath and then created a single source file - which was a copy the SSTableExport file with my code changes. This compiled without any issues and I was able to generate my own jar file containing the modified code. (
java -cp /home/cell/apps/dsc-cassandra-1.1.7/lib/*:/home/cell/apps/dsc-cassandra-1.1.7/config:my-ss.jar me.otiose.SSTableExport /home/cell/tmp/schema_columnfamilies/system-schema_columnfamilies-hf-712-Data.db

The above command now produced a JSON representation of the schema_columnfamilies sstable. However, the ID that was causing the problem was nowhere to be found. So the idea of editing it was no longer possible.

My final desperate attempt to salvage the situation was to try something of a different tack. What if I removed from the metadata, one of the keyspaces that was causing the issue. Will Cassandra then ignore it and proceed with cluster startup?

Using the above command, I dumped the schema_keyspaces sstables to JSON and removed the lines the contained references to the keyspace that we were sacrificing. In order to convert the modified JSON files back to sstables, I tried the json2sstable command but it threw the same exception as the sstable2json command. So it was back to the IDE to edit some more code and I created a new copy of json2sstable without the system table check. (
java -cp /home/cell/apps/dsc-cassandra-1.1.7/lib/*:/home/cell/apps/dsc-cassandra-1.1.7/config:my-ss.jar me.otiose.SSTableImport -K system -c schema_keyspaces /home/cell/tmp/modified_sstable.json /home/cell/tmp/schema_keyspace_fixed/system-schema_keyspaces-he-712-Data.db

The above command produces a bunch of Data, Index, sha1 and other files. I copied them all to the system/schema_keyspaces directory on the node, overwriting the existing files. Then to verify that the problem was fixed, I ran the unmodified sstable2json command. Now it worked without any complaints about the duplicate CFID!

I quickly followed the same procedure on the other nodes. (Each node has a unique set of files in the system/schema_keyspaces directory, so you cannot simply copy a single fixed file to every node). My sysadmin colleague started bringing the nodes up one-by-one and finally, the cluster was back up again albeit in a very flaky state.

Now it was the time for my colleague to do his magic. One or two nodes had missing schemas for some of the column families. He did a nodetool repair on those and then ran a describe on one of the functioning nodes to obtain a schema definition, which he then ran on the unstable nodes to recreate the schema.

Finally, after a long fought battle, our cluster was operational again with no data loss. We were quite lucky in this regard because the keyspace that I manually removed from the metadata was due to be dropped anyway. If it were not the case, I would probably have tried editing the column family metadata itself. We may have experienced some data loss in the latter case but it is far more preferable to losing all of it.

1 comment:

Jacob Albertson said...

This just saved me hard core. Thanks!