I would like to sort reads in a bam file by read length. Which tool on the galaxy website can i use?
Hello,
The bam
datatype in Galaxy means something specific with respect to format and content: A coordinate sorted BAM dataset. Other bam datatypes are available but tools (most if not all) will restrict available inputs to a dataset with a bam
datatype assigned. This avoids many usage errors/tool failures due to unexpected sorting.
BAM datatypes are described in the 18.01 release notes here, scroll down to the section named "New BAM datatypes": https://docs.galaxyproject.org/en/master/releases/18.01_announce.html
There isn't a tool to sort by the read lengths. Coordinate-sorted BAM datasets are the expected input bam
format for most tools. The few that require queryname sorted BAMs have options on the tool form to queryname sort the data for processing (the original input is still expected to be in a coordinated sorted input, with the datatype bam
assigned). Assigning the bam
dataset to data that is not coordinate sorted will result in an error or warning, and if just a warning, expect downstream tools to fail when using that input.
Options:
- BAM data can be sorted by either queryname or start coordinate with the tool: SortSam sort SAM/BAM dataset (Galaxy Version 2.18.2.1).
- BAM data can be filtered by read length with the tool: BAM filter Removes reads from a BAM file based on criteria (Galaxy Version 0.5.9)
- Generate a basic summary of read lengths with the tool: FastQC Read Quality reports (Galaxy Version 0.72). Note this is a sample of the first 200k sequences or so, not the complete dataset.
- If you want to do something else with the data, it can be converted to interval format and manipulated from there. The steps would involve a workflow such as: BAM-to-SAM > Convert SAM to interval > Compute (subtract start from end for read length) > Sort data in ascending or descending order on the new length column.
There are a few ways to get the data into a tabular format and manipulate it. SAM format is essentially a tabular format once the header is removed and any of the tools that work directly with tabular input could be used (the Text Manipulation tool group includes most but also see Datamash, Filter and Sort, and Join, Subtract and Group.
Thanks! Jen, Galaxy team