Is it possible to use Galaxy to sample sample a subset from a fastq file? For example, to get 10 millions reads from a 50 million read file.
Hello,
There are a few options, these are the top two for fastq input:
1) To select lines from a dataset (top or bottom), see the tools "select first/select last" in the group Text Manipulation. Fastq data will be accepted as a tabular input datatype, just make sure to select lines in multiples of "4". Fastq format has four lines for each read, so "10 million reads" would be "40 million lines".
2) To randomly select lines (fastq entries), try these tools, in this order. If it produces the output you want, the tools could be placed into a workflow for later reuse, in effect creating your own custom tool.
- Convert Fastq to Tabular
- Select random lines - or optionally, some functions of the tool Datamash
- Convert Tabular to Fastq
You may need to reassign the datatype fastq/fastqsanger after either.
There are other tools to sample from BAM and VCF datasets (not Fastq directly). These could be an option if the data is already mapped or in other downstream analysis. Search with the term "sample" in the tool panel at http://usegalaxy.org to review the choices.
Thanks, Jen, Galaxy team