Hello,
I used the Extract Genomic DNA function in Galaxy and it outputted a fasta file using my original genome fasta and my annotation. However, in each entry of the multifasta file, the sequences are duplicated twice.
For example...
>gene.1
TATATTTATTAATTTACGGGACTATATTTATTAATTTACGGGATC
If you notice, "TATATTTATTAATTTACGGGATC" is duplicated twice in the single entry. This happens to all of genes in the fasta file. When looking at the coordinates, the number of nucleotides does not add up because the sequence is duplicated within it and is twice the supposed size according to the coordinates. For example, coordinates of 700 - 750 should have 50 nucleotides but in my fasta file it has 100.
I have run this function in galaxy twice and I still have this problem. I am sure that my annotation has the correct coordinates and that there is nothing wrong with the genome as it works for all of my other functions. Thanks for the help!
If this using your own local Galaxy and installed indexes, would you share a few lines of the input file that produce this kind of output and I'll test at http://usegalaxy.org to see if I can reproduce the output? That will narrow down the problem space. Please also include the setting used on the tool form.
If you are working already at Galaxy Main (http://usegalaxy.org), a share link to the history with the data would be even better. You can post it here publically if you do not mind everyone seeing your history, or, generate the history share link and include it in an email to galaxy-bugs@lists.galaxyproject.org (private list). Note the dataset numbers for input/output and also please include a link to this post so we can associate the two.
Thanks! Jen, Galaxy team
Here is the first gene that appears in the galaxy produced fasta file (gtf annotation/genome fasta file)
Fasta file is a typical version:
Here is how the galaxy fasta file output looks like
The number of bases given is 695 compared to the 340 that there are supposed to be.