I have got TPM data from running Salmon on RNA-seq data. However, this is TPM for each individual transcript (often multiple different ones per gene). I want to collapse multiple transcripts to single genes before running DESeq2. Is there a way to do this in Galaxy? Preferably, a simple way for someone new to this.
Hello,
Salmon can output both transcript and gene level TPM counts. You will need to provide a file of gene-to-transcript mapping. This is the last option on the Salmon tool form and the label starts with "File containing a mapping of transcripts to genes."
The transcript-to-gene mapping is a tabular dataset or GTF dataset (with gene_id and transcript_id populated) and can be also used with DeSeq2. This is a required input for DeSeq2 when using TPM counts as input instead of counts from featurecounts or htseq_count.
Tutorials: https://galaxyproject.org/learn/
Hope that helps! Jen, Galaxy team
Thank you.
Seems I had a couple of problems there - first off, I was using a gene to transcript mapping file with too much information (too many columns). I trimmed this down to two columns: the transcript id (#mm10.knownGene.name) and the official gene symbol (mm10.kgXref.geneSymbol), and that partly worked. The additional problem is that there is a hand-full of official gene symbols that get spread across 2 or more columns (spaces commas etc?). This put some of it out of register. Cutting only columns 1 and 2 to a new file worked, and it seems to be fine now. Maybe I didn't start with the best format gene to transcript mapping file.
All of this sounds like the correct way to troubleshoot. Inputs format can really make a difference in how content is interpreted by tools (whether used in Galaxy or elsewhere).
One-to-many transcript-to-gene mapping is present in the UCSC "Known Genes" track when combined with many the related Xref tables (by design). If you want to try a simpler 1-1 transcript-to-gene mapping instead (and can utilize those identifiers), the UCSC track "RefSeq Genes" is another option. All the data is in the primary table with the RefSeq gene name in the column "name2". To format, extract the entire table from the Table Browser into Galaxy, then use Cut to isolate just the "transcript -tab- gene" data. Or use the Table Browser's option to output "selected columns from the primary and related tables" and only pick the transcript+gene columns for extraction to Galaxy.
Glad this worked out!