Skip to content

Conversation

@cdahlqvist
Copy link

The performance of read intensive applications using the Bitcask backend can be enhanced if the operating system level file cache is used efficiently. In order to allow as much data as possible to be cached, one approach is to compress the data stored.

When using bitcask together with Riak Search and/or Yokozuna, storing compressed data complicates the indexing process, and it would instead be desirable to have the backend handle this internally like LevelDB does, which is why I have added support for snappy compression to Bitcask.

src/bitcask.erl Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these to be named maybe_compress and maybe_decompress to more clearly state intent.

@evanmcc
Copy link
Contributor

evanmcc commented Sep 23, 2013

This seems like a reasonable feature to add, given the new snappy nif. What kind of performance testing would you suggest for further review? Is there going to be a corresponding riak_kv branch to enable it?

src/bitcask.erl Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to change false here to _, as the function can also return unknown if there is an error, which is possible if some feature of the data confuses snappy enough to throw an error.

@cdahlqvist
Copy link
Author

I have updated according to review comments, but have kept the threshold parameter instead of the suggested size limitation.

Small objects are less likely to benefit a lot from compression and therefore more likely not to be stored in compressed form. I updated the default threshold value to 1.0 instead of the rather arbitrary 0.8. This means that any object that is reduced in size by compression will be stored in compressed form.

There are now 3 configuration parameters available that should be specified in app.config under bitcask. These are:

  • enable_compression - Used to enable/disable compression completely for backend. Specified as a boolean value and defaults to false.
  • compression_ratio_threshold - Specifies the compression threshold used to determine whether the object is to be stored in compressed form or not. This is specified as a float and the default value is 1.0. The value 0.8 indicates that if the compressed value is 80% of the original objects size or less, it will be stored in compressed form.
  • compression_size_threshold - Objects smaller than this size threshold, specified in bytes, will never be compressed. Default value is 1024 bytes. Note that this size refers to the serialised form of the Riak object, not the value associated with it.

@slfritchie
Copy link
Contributor

@cdahlqvist and @evanmcc -- Any updates on this ticket?

@ghost ghost assigned evanmcc Dec 13, 2013
@gburd
Copy link

gburd commented Jan 24, 2014

What is the impact on performance (on reads and writes)? Do we see longer latency times for compressed objects? What happens to the flat latency of Bitcask when compression is enabled? Do we track metrics on compression ratio, number compressed/not-compressed per cask, number attempted for compression but found not to meet the threshold, number ignored due to size, etc. What is the real savings in file size when this is enabled? With LevelDB the compression is across blocks, which may contain many values, which can lead to much higher compression rates. This code compresses objects, so it won't be as effective as LevelDB. Can we determine the compression ratio for the same data set in LevelDB vs in Bitcask?

@evanmcc
Copy link
Contributor

evanmcc commented Jan 25, 2014

Most of the reason that this has slipped is that I haven't had the time to determine any of those things. This PR just adds the capability of compression, it doesn't add any tracking. All of those changes would require infrastructure tweaks all up and down the stack, both to retain them and to report them.

@gburd
Copy link

gburd commented Jan 25, 2014

I think we need to do more testing of this feature to see if it's actually going to add value or is just something that is perceived to add value. If we do add those stats, we should consider wiring LevelDB such that it produces similar information.

@cdahlqvist
Copy link
Author

I am starting to think this might actually be the wrong way to go, as It requires checks and processing on both reads and writes.

It might be a better solution to enable the Search extractors to consider content-encoding as well as content-type and unzip data before indexing it. This would store the data in compressed form and only require work on writes. For large documents the compression ratio would probably also be better than what can be achieved with snappy.

The drawback is naturally that this would mean additional processing for the Search extractor when indexing, but only objects having the correct parameters would be affected, allowing this to be controlled by the application. It would therefore possibly lead to reduced overhead for read-intensive applications and benefit all backends.

@gburd
Copy link

gburd commented Jan 27, 2014

Your comments about search extractors lead me to believe that this feature was requested by someone using Bitcask and Riak2.0's Search, is that the case? What was the problem they were trying to address that led you to this prototype implementation of compressing objects stored by Bitcask?

@evanmcc evanmcc modified the milestones: 2.1, 2.0 Mar 19, 2014
@evanmcc evanmcc removed their assignment Jun 26, 2014
@slfritchie
Copy link
Contributor

Closing due to old age and lack of followup.

@slfritchie slfritchie closed this Nov 20, 2014
@seancribbs seancribbs deleted the cd-snappy branch April 1, 2015 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants