|Subject:||Let's discuss Snapshots Feature Testing|
|From:||Aleksandr Shulman (ale...@cloudera.com)|
|Date:||Jan 14, 2013 10:31:39 am|
I'd like to start a thread about Cloudera's testing efforts on the upcoming snapshots feature. This is a new feature and it's important that we explain our testing efforts and get the community's opinion on what we'd all like to see tested. My hope is that from this discussion, we can get more ideas about what needs to be tested and gain confidence in the testing we have in place.
Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a software engineer at Cloudera, working primarily on HBase. Within HBase, I am focusing on the quality side of things. What this means to me is an conversation unto itself, but in brief, I will be writing tests and test frameworks. I will also be an advocate for the user experience, with particular focus on API compatibility and ease-of-use.
So let's discuss snapshots: There are two main areas that should be tested and they correspond nicely into what can be done as unit tests and what is better left as Jenkins job or some other automation, unit testing and non-unit testing. We've been working on this for a bit, so there is already some progress in these areas:
Unit testing - In progress or completed:
1. HBase Snapshots Repeatability and Idempotency Test: This test class verifies proper behavior with regards performing restore/clone operations on tables that themselves were created as a clone or restored from a snapshot. This is an interesting set of cases because of the way snapshots work. They work by pointing to the original HFiles. We can use these tests to verify correctness in the file system and test closure under deletion of the original table.
2. HBase Snapshots HTable Descriptor Test This test class verifies proper behavior with regards to changes to the information about the table itself before and after snapshotting in the 'before' table and the 'after' table.
3. HBase Snapshots HFileLink Test This test class inspects the correctness of the HFileLink files. It looks into their permissioning, the naming convention, and how they respond events. Events may include an HFile being deleted or moved.
4. HBase Snapshots Table Dimensions Test This test class inspects operations on tables that are empty, have only one row, have one or two CFs, etc. Basically if there is an edge scenario in what the table looks like, that may affect the way it snapshotted or restored/cloned.
5. HBase Snapshots Independence Test This test should verify that all aspects of table independence are guaranteed between the original table and the restored snapshot/clone. This includes things like data mutations, compactions, splits, etc. It also includes metadata changes.
6. HBase Snapshots Aborted or Failed Snapshot Cleanup Verifies that no cruft is left over after an attempt to snapshot a table fails or is aborted. We should be able to account for every file in the file system before and after.
7. HBase Snapshots HFile Archive Test This test task is to fill in any gaps in testing of archiving as it relates to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go through and find out what needs to be tested between them.
8. HBase Snapshots Export Test This test should verify that export of a snapshot to another cluster works properly. Implemented as: mvn clean test -PlocalTests -Dtest=org.apache.hadoop.hbase.snapshot.TestExportSnapshot However, we need to add more test around chmod, chown and checksums
9. HBase Snapshots Concurrent Snapshots Test This test class will enforce proper behavior in situations where race conditions can occur. For example, if one process attempts to restore a table and another one tries to do so simultaneously, what happens? We need to know how dangerous this could be and whether it is possible for data to be lost. Covered in HBASE-7536.
Unit testing - Lightly tested so far, or tests we are hoping to write soon:
1. HBase Snapshots File System Correctness Tests -
This test class verifies proper behavior with regards to what the file system looks like. What the file system contains should be predictable after certain events, both snapshot-specific and environment-specific. For example, after a snapshot, we should expect there to be files in the /hbase/.snapshot/ folder. Also, after a split occurs on the base table and the underlying HFiles go through flux, we should be able to know beforehand where files move. In particular, this is important to test after repeated deletions and modifications. Also -- we want to make sure no cruft remains after various operations occur.
2. HBase Snapshots (Re)Naming Test [Note: Renaming snapshots is not supported yet!]
These tests should verify valid/invalid names for snapshots. In particular, it should use the rename_snapshot command to attempt to rename to a table that already exists, or to a snapshot that already exists (or had existed but was deleted). Things like special characters or semantically-meaningful characters are important as well. Other things that need to be tested are what happens if a snapshot is created, deleted, the underlying table is modified, and then another snapshot is taken. The snapshot should contain the most recent data.
3. Snapshots logline test: Verifies that the proper loglines are generated for events. Manual testing for this might include making sure that spurious, misleading, or unnecessary log lines are not present.
4. HBase Snapshots Aborted or Failed Clone or Restore
Verifies that no cruft is left over after an attempt to restore or clone a snapshotted table fails or is aborted and that further snapshots can take place. This may be tricky and could require writing some additional utilities.
This area of testing is less straightforward and more exploratory in nature. It's open-ended but with some direction. Particularly, we want to test a lot of "what if this happens when we do something related snapshots". By "this happens", I mean compactions, splits, processes dying, master failing over to backup master, etc. By "something related to snapshots", that could mean taking a snapshot, restoring a snapshot, or cloning a snapshot, among other things. In addition, we can see what happens as scaling factors, (e.g. the number of regions, amount of data per node, duration of test, and frequency of compactions/splits) increases. Finally, we should benchmark the time it takes to take/restore/clone a snapshot and see how it changes with scale factors.
We are testing some of these combination internally. When we see something go awry, we fix and rerun the trial, with the expectation that the feature becomes more stable and reliant.
Some of the things we have tried: -Long running tests: Run repeated snapshots while verifying that all is well.
-Meanness tets: 1. Killing the master 2. Performing a compaction 3. Table enable/disable
Feel free to follow-up with questions.
-- Best Regards,
Aleks Shulman 847.814.5804 Cloudera