Skip to content

Improvements to support Hadoop-BAM/Squark #1112

@tomwhite

Description

@tomwhite

This describes some of the improvements, changes and additions to htsjdk to better support Hadoop-BAM and its proposed new incarnation codenamed Squark.

Individual issues or PRs can be created later to implement the changes as needed.

Small changes

  • Code to write a BAM header to a stream. Currently resides in htsjdk's BAMFileWriter#writeHeader, which is protected (and the class is package private). Needed by BamSink in Squark.
  • More block compressed file pointer utils. BlockCompressedFilePointerUtil#makeFilePointer should be public. It would also be useful to have an overloaded version that doesn't have an offset (i.e. is 0). Squark has BgzfVirtualFilePointerUtil
  • Expose CRAMIntervalIterator. Currently private, it's needed in Squark to get reads overlapping intervals, much like BAMFileReader#createIndexIterator.

Improvements

  • Seeks within SeekableBufferedStream's buffer should not create a new buffer. See Squark's ExtSeekableBufferedStream for the changes.
  • An optimized version of CramContainerIterator that only reads the header for each container. See Squark's CramContainerHeaderIterator
  • A way to read a VCFHeader from a stream without knowing if the file is VCF or BCF, or compressed or not. Implementation in Hadoop-BAM's VCFHeaderReader.
  • A way to use ReferenceSequenceFileFactory with streams. It should be possible to open a reference sequence by passing an input stream to a FASTA (and to its index). This would allow reading from Hadoop filesystems without having to use the file NIO library. (One of the goals of Squark is to make it a user-controllable option as to whether to use NIO or Hadoop filesystems.)

New features

  • Splitting-bai. Hadoop-BAM introduced a simple index format to locate read boundaries after arbitrary offsets in a file, which helps reads BAMs in parallel. It would be beneficial to have the logic to read and write splitting-bai files in htsjdk, since they are useful for distributed processing in general.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions