Improvements to support Hadoop-BAM/Squark

This describes some of the improvements, changes and additions to htsjdk to better support Hadoop-BAM and its proposed new incarnation codenamed [Squark](https://github.com/tomwhite/squark).

Individual issues or PRs can be created later to implement the changes as needed.

Small changes
- Code to write a BAM header to a stream. Currently resides in htsjdk's BAMFileWriter#writeHeader, which is protected (and the class is package private). Needed by [BamSink](https://github.com/tomwhite/squark/blob/htsjdk-comments/src/main/java/com/tom_e_white/squark/impl/formats/bam/BamSink.java#L74) in Squark.
- More block compressed file pointer utils. BlockCompressedFilePointerUtil#makeFilePointer should be public. It would also be useful to have an overloaded version that doesn't have an offset (i.e. is 0).  Squark has [BgzfVirtualFilePointerUtil](https://github.com/tomwhite/squark/blob/htsjdk-comments/src/main/java/com/tom_e_white/squark/impl/formats/bgzf/BgzfVirtualFilePointerUtil.java)
- Expose CRAMIntervalIterator. Currently private, it's needed in [Squark](https://github.com/tomwhite/squark/blob/htsjdk-comments/src/main/java/com/tom_e_white/squark/impl/formats/cram/CramSource.java#L158) to get reads overlapping intervals, much like BAMFileReader#createIndexIterator.

Improvements
- Seeks within SeekableBufferedStream's buffer should not create a new buffer. See Squark's [ExtSeekableBufferedStream](https://github.com/tomwhite/squark/blob/htsjdk-comments/src/main/java/htsjdk/samtools/ExtSeekableBufferedStream.java) for the changes. 
- An optimized version of CramContainerIterator that only reads the header for each container. See Squark's [CramContainerHeaderIterator](https://github.com/tomwhite/squark/blob/htsjdk-comments/src/main/java/htsjdk/samtools/CramContainerHeaderIterator.java)
- A way to read a VCFHeader from a stream without knowing if the file is VCF or BCF, or compressed or not. Implementation in Hadoop-BAM's [VCFHeaderReader](https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/VCFHeaderReader.java).
- A way to use ReferenceSequenceFileFactory with streams. It should be possible to open a reference sequence by passing an input stream to a FASTA (and to its index). This would allow reading from Hadoop filesystems without having to use the file NIO library. (One of the goals of Squark is to make it a user-controllable option as to whether to use NIO or Hadoop filesystems.)

New features
- [Splitting-bai](https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/SplittingBAMIndexer.java). Hadoop-BAM introduced a simple index format to locate read boundaries after arbitrary offsets in a file, which helps reads BAMs in parallel. It would be beneficial to have the logic to read and write splitting-bai files in htsjdk, since they are useful for distributed processing in general.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to support Hadoop-BAM/Squark #1112

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improvements to support Hadoop-BAM/Squark #1112

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions