-
Notifications
You must be signed in to change notification settings - Fork 166
Disk buffering api implementation #2183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Disk buffering api implementation #2183
Conversation
🔧 The result from spotlessApply was committed to the PR branch. |
…n' into disk-buffering-api-implementation
It's ready for review. Please take a look, @zeitlinger. |
It doesn't look this is just adding a new implementation for an interface. Can you create a PR that just moves things around? That would make it easier to review. |
It depends. I was actually trying to do it in 2 parts, but I'm reusing a lot of logic from the old API, while at the same time, I had to change internal tools to adapt them to the new one. So if I don't delete the old API in this PR, then the project won't compile, because of the changes to the old internal tools needed to work with the new implementation. What parts of the current PR are you referring to moving around? |
an option would be to stack PRs on top of each other |
Yes, but that still keeps the issue I mentioned, which is that the project won't compile because the old API will no longer be compatible with the internal tools, which have to be slightly changed to better adapt to the new API. @zeitlinger I know that seeing
If it helps, the real file changes to implement this API are the following:
Those are the important changes. Let me know if it helps making the review smoother, if not, I guess I can create an intermediate state in which both APIs can compile, which might cause the need to duplicate internal classes such as Storage and ReadableFile, to make them work with the new API without touching the old ones, so that the old API compiles for the first PR. It seemed like a lot of work that wasn't needed as the old API will get removed anyway, but if you think it's still needed after what I explained above, let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this new Storage API is as flexible as you suggest. It locks the re-serialization into place and forces extra inefficiencies. It would be better if Storage
dealt with raw byte arrays or IO streams.
throws IOException { | ||
return readAndProcess(processing, 1); | ||
@Nullable | ||
public ReadableResult<T> readNext(SignalDeserializer<T> deserializer) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a mistake to move the deserialization deeper into the storage as it prevents what I've done in #2190 to avoid unnecessary re-serialization. Perhaps if this class had an additional access method that returned an InputStream factory against the file (so it could be read twice, first to count the items and second to stream it to a network output stream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this new Storage API is as flexible as you suggest. It locks the re-serialization into place and forces extra inefficiencies. It would be better if Storage dealt with raw byte arrays or IO streams.
I think there might be a misunderstanding of the new API, so I'll clarify it here. In a nutshell, Storage
is not part of the API; it's an internal tool. This is the core of the new API: SignalStorage.

It provides a way to write and read a collection of signals, as it's also the way that upstream exporters receive data.
It locks the re-serialization into place and forces extra inefficiencies
The API isn't aware of its implementations, so it's not aware of the default implementation that uses serialization and deserialization. The reason why the API provides an iterator of a collection of signals as a way to read stored data is not to enforce anything around serialization, but simply because that's also what it receives in its write()
method.
I think it's a mistake to move the deserialization deeper into the storage as it prevents what I've done in #2190 to avoid unnecessary re-serialization.
Serialization is just an internal part of what the default implementation needs to do to receive and return the same kind of data that it receives, and it was never the intention to make it publicly supported in this module; the same way it is also not publicly supported in the core Java repo. So it shouldn't mean a mistake to move it around, as the users of the default impl shouldn't be aware of it.
As I mentioned to you a couple of days ago when you suggested this approach over DM, I wouldn't recommend that you work on top of the old API, as a lot of it is going to be changed or removed.
Perhaps if this class had an additional access method that returned an InputStream factory against the file (so it could be read twice, first to count the items and second to stream it to a network output stream).
Maybe in the future, not this class, because it's not part of the API, but we might be able to somehow extend SignalStorage
to allow consuming the data already serialized, so that it can be forwarded directly to a network exporter. This is definitely something that can help with performance for the specific use case where the storage serializes signals in the same way that they are going to be needed by the network exporter impl that will consume them. The downside of this approach, and the reason why deserializing it is more generic, is that it ties the storage impl to a specific data format and protocol for exporting. So if we wanted to do so in the future, it must be done in a way that custom implementations can provide their own serialized versions of the data, in case the format we use is not what they need, to enable them to work on their own impls.
Alternative with the new API
Since the new API is just an interface, it shouldn't prevent you from creating your own implementation that adds the functionality that you need. To make it work with your use case, you could wrap the default implementation around your custom implementation for the write()
part, and then add a method that provides you with an imputstream where you manually read the files to provide the serialized data. Alternatively, if you'd like to reuse disk buffering's internal Storage
class, which is the one that manages the stored data, you can add a method to it that does something around the lines of what you mentioned here:
Perhaps if this class had an additional access method that returned an InputStream factory against the file (so it could be read twice, first to count the items and second to stream it to a network output stream).
And use it for your custom implementation. Just bear in mind that, since Storage is internal, there's no guarantee to maintain it as is forever (though I don't see why it should change tbh).
Creation of a default implementation for the new API that was previously defined here.
Note
It looks like there are a lot of changes, but 27 of those "files changes" are deleted files from the old implementation types and their test types. Also, a couple of file changes are just moving around files and/or adding minor changes.
New usage
The same old API behavior is kept for the new API's default implementation.
Writing
Setting up the signal storage objects as well as their
toDisk
exporters to be used in an OpenTelemetry instance.Reading
More info in the README.md and DESIGN.md files.