-
Couldn't load subscription status.
- Fork 243
Open
Description
This describes some of the improvements, changes and additions to htsjdk to better support Hadoop-BAM and its proposed new incarnation codenamed Squark.
Individual issues or PRs can be created later to implement the changes as needed.
Small changes
- Code to write a BAM header to a stream. Currently resides in htsjdk's BAMFileWriter#writeHeader, which is protected (and the class is package private). Needed by BamSink in Squark.
- More block compressed file pointer utils. BlockCompressedFilePointerUtil#makeFilePointer should be public. It would also be useful to have an overloaded version that doesn't have an offset (i.e. is 0). Squark has BgzfVirtualFilePointerUtil
- Expose CRAMIntervalIterator. Currently private, it's needed in Squark to get reads overlapping intervals, much like BAMFileReader#createIndexIterator.
Improvements
- Seeks within SeekableBufferedStream's buffer should not create a new buffer. See Squark's ExtSeekableBufferedStream for the changes.
- An optimized version of CramContainerIterator that only reads the header for each container. See Squark's CramContainerHeaderIterator
- A way to read a VCFHeader from a stream without knowing if the file is VCF or BCF, or compressed or not. Implementation in Hadoop-BAM's VCFHeaderReader.
- A way to use ReferenceSequenceFileFactory with streams. It should be possible to open a reference sequence by passing an input stream to a FASTA (and to its index). This would allow reading from Hadoop filesystems without having to use the file NIO library. (One of the goals of Squark is to make it a user-controllable option as to whether to use NIO or Hadoop filesystems.)
New features
- Splitting-bai. Hadoop-BAM introduced a simple index format to locate read boundaries after arbitrary offsets in a file, which helps reads BAMs in parallel. It would be beneficial to have the logic to read and write splitting-bai files in htsjdk, since they are useful for distributed processing in general.
Metadata
Metadata
Assignees
Labels
No labels