Add support for snappy compression in Bitcask backend #108

cdahlqvist · 2013-09-23T14:49:45Z

The performance of read intensive applications using the Bitcask backend can be enhanced if the operating system level file cache is used efficiently. In order to allow as much data as possible to be cached, one approach is to compress the data stored.

When using bitcask together with Riak Search and/or Yokozuna, storing compressed data complicates the indexing process, and it would instead be desirable to have the backend handle this internally like LevelDB does, which is why I have added support for snappy compression to Bitcask.

evanmcc · 2013-09-23T16:37:44Z

src/bitcask.erl

I'd prefer these to be named maybe_compress and maybe_decompress to more clearly state intent.

evanmcc · 2013-09-23T17:06:31Z

This seems like a reasonable feature to add, given the new snappy nif. What kind of performance testing would you suggest for further review? Is there going to be a corresponding riak_kv branch to enable it?

evanmcc · 2013-09-23T17:15:58Z

src/bitcask.erl

may want to change false here to _, as the function can also return unknown if there is an error, which is possible if some feature of the data confuses snappy enough to throw an error.

cdahlqvist · 2013-09-23T20:17:35Z

I have updated according to review comments, but have kept the threshold parameter instead of the suggested size limitation.

Small objects are less likely to benefit a lot from compression and therefore more likely not to be stored in compressed form. I updated the default threshold value to 1.0 instead of the rather arbitrary 0.8. This means that any object that is reduced in size by compression will be stored in compressed form.

There are now 3 configuration parameters available that should be specified in app.config under bitcask. These are:

enable_compression - Used to enable/disable compression completely for backend. Specified as a boolean value and defaults to false.
compression_ratio_threshold - Specifies the compression threshold used to determine whether the object is to be stored in compressed form or not. This is specified as a float and the default value is 1.0. The value 0.8 indicates that if the compressed value is 80% of the original objects size or less, it will be stored in compressed form.
compression_size_threshold - Objects smaller than this size threshold, specified in bytes, will never be compressed. Default value is 1024 bytes. Note that this size refers to the serialised form of the Riak object, not the value associated with it.

slfritchie · 2013-12-12T03:26:44Z

@cdahlqvist and @evanmcc -- Any updates on this ticket?

gburd · 2014-01-24T14:58:40Z

What is the impact on performance (on reads and writes)? Do we see longer latency times for compressed objects? What happens to the flat latency of Bitcask when compression is enabled? Do we track metrics on compression ratio, number compressed/not-compressed per cask, number attempted for compression but found not to meet the threshold, number ignored due to size, etc. What is the real savings in file size when this is enabled? With LevelDB the compression is across blocks, which may contain many values, which can lead to much higher compression rates. This code compresses objects, so it won't be as effective as LevelDB. Can we determine the compression ratio for the same data set in LevelDB vs in Bitcask?

evanmcc · 2014-01-25T18:48:53Z

Most of the reason that this has slipped is that I haven't had the time to determine any of those things. This PR just adds the capability of compression, it doesn't add any tracking. All of those changes would require infrastructure tweaks all up and down the stack, both to retain them and to report them.

gburd · 2014-01-25T22:36:05Z

I think we need to do more testing of this feature to see if it's actually going to add value or is just something that is perceived to add value. If we do add those stats, we should consider wiring LevelDB such that it produces similar information.

cdahlqvist · 2014-01-27T14:06:06Z

I am starting to think this might actually be the wrong way to go, as It requires checks and processing on both reads and writes.

It might be a better solution to enable the Search extractors to consider content-encoding as well as content-type and unzip data before indexing it. This would store the data in compressed form and only require work on writes. For large documents the compression ratio would probably also be better than what can be achieved with snappy.

The drawback is naturally that this would mean additional processing for the Search extractor when indexing, but only objects having the correct parameters would be affected, allowing this to be controlled by the application. It would therefore possibly lead to reduced overhead for read-intensive applications and benefit all backends.

gburd · 2014-01-27T14:46:34Z

Your comments about search extractors lead me to believe that this feature was requested by someone using Bitcask and Riak2.0's Search, is that the case? What was the problem they were trying to address that led you to this prototype implementation of compressing objects stored by Bitcask?

slfritchie · 2014-11-20T05:50:10Z

Closing due to old age and lack of followup.

Christian Dahlqvist added 3 commits September 20, 2013 14:01

Added support for snappy compression of data

36794a9

Added configurable compression threshold

f44bc6f

Corrected threshold error.

19dc7bf

evanmcc reviewed Sep 23, 2013
View reviewed changes

src/bitcask.erl Outdated

Copy link

Contributor

evanmcc Sep 23, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these to be named maybe_compress and maybe_decompress to more clearly state intent.

evanmcc reviewed Sep 23, 2013
View reviewed changes

Updated according to review comments and set default threshold to 1.0

f90c66f

Added size threshold and renamed existing threshold.

b6d273c

ghost assigned evanmcc Dec 13, 2013

evanmcc modified the milestones: 2.1, 2.0 Mar 19, 2014

slfritchie added the Enhancement label Mar 24, 2014

evanmcc removed their assignment Jun 26, 2014

slfritchie closed this Nov 20, 2014

seancribbs deleted the cd-snappy branch April 1, 2015 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add support for snappy compression in Bitcask backend #108

Add support for snappy compression in Bitcask backend #108

Uh oh!

cdahlqvist commented Sep 23, 2013

Uh oh!

evanmcc Sep 23, 2013

Uh oh!

evanmcc commented Sep 23, 2013

Uh oh!

evanmcc Sep 23, 2013

Uh oh!

cdahlqvist commented Sep 23, 2013

Uh oh!

slfritchie commented Dec 12, 2013

Uh oh!

gburd commented Jan 24, 2014

Uh oh!

evanmcc commented Jan 25, 2014

Uh oh!

gburd commented Jan 25, 2014

Uh oh!

cdahlqvist commented Jan 27, 2014

Uh oh!

gburd commented Jan 27, 2014

Uh oh!

slfritchie commented Nov 20, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Add support for snappy compression in Bitcask backend #108

Add support for snappy compression in Bitcask backend #108

Uh oh!

Conversation

cdahlqvist commented Sep 23, 2013

Uh oh!

evanmcc Sep 23, 2013

Choose a reason for hiding this comment

Uh oh!

evanmcc commented Sep 23, 2013

Uh oh!

evanmcc Sep 23, 2013

Choose a reason for hiding this comment

Uh oh!

cdahlqvist commented Sep 23, 2013

Uh oh!

slfritchie commented Dec 12, 2013

Uh oh!

gburd commented Jan 24, 2014

Uh oh!

evanmcc commented Jan 25, 2014

Uh oh!

gburd commented Jan 25, 2014

Uh oh!

cdahlqvist commented Jan 27, 2014

Uh oh!

gburd commented Jan 27, 2014

Uh oh!

slfritchie commented Nov 20, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants