Converting transcript data from Salmon to gene level TPM

Question: Converting transcript data from Salmon to gene level TPM

10 months ago by

dw2p • 0

dw2p • 0 wrote:

I have got TPM data from running Salmon on RNA-seq data. However, this is TPM for each individual transcript (often multiple different ones per gene). I want to collapse multiple transcripts to single genes before running DESeq2. Is there a way to do this in Galaxy? Preferably, a simple way for someone new to this.

tpm salmon galaxy deseq2 rna-seq • 1.4k views

ADD COMMENT • link •

modified 10 months ago • written 10 months ago by dw2p • 0

10 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Salmon can output both transcript and gene level TPM counts. You will need to provide a file of gene-to-transcript mapping. This is the last option on the Salmon tool form and the label starts with "File containing a mapping of transcripts to genes."

The transcript-to-gene mapping is a tabular dataset or GTF dataset (with gene_id and transcript_id populated) and can be also used with DeSeq2. This is a required input for DeSeq2 when using TPM counts as input instead of counts from featurecounts or htseq_count.

Tutorials: https://galaxyproject.org/learn/

https://galaxyproject.org/tutorials/rb_rnaseq/#transcript-quantification

Hope that helps! Jen, Galaxy team

ADD COMMENT • link modified 10 months ago • written 10 months ago by Jennifer Hillman Jackson ♦ 25k

10 months ago by

dw2p • 0

dw2p • 0 wrote:

Thank you.

Seems I had a couple of problems there - first off, I was using a gene to transcript mapping file with too much information (too many columns). I trimmed this down to two columns: the transcript id (#mm10.knownGene.name) and the official gene symbol (mm10.kgXref.geneSymbol), and that partly worked. The additional problem is that there is a hand-full of official gene symbols that get spread across 2 or more columns (spaces commas etc?). This put some of it out of register. Cutting only columns 1 and 2 to a new file worked, and it seems to be fine now. Maybe I didn't start with the best format gene to transcript mapping file.

ADD COMMENT • link written 10 months ago by dw2p • 0

All of this sounds like the correct way to troubleshoot. Inputs format can really make a difference in how content is interpreted by tools (whether used in Galaxy or elsewhere).

One-to-many transcript-to-gene mapping is present in the UCSC "Known Genes" track when combined with many the related Xref tables (by design). If you want to try a simpler 1-1 transcript-to-gene mapping instead (and can utilize those identifiers), the UCSC track "RefSeq Genes" is another option. All the data is in the primary table with the RefSeq gene name in the column "name2". To format, extract the entire table from the Table Browser into Galaxy, then use Cut to isolate just the "transcript -tab- gene" data. Or use the Table Browser's option to output "selected columns from the primary and related tables" and only pick the transcript+gene columns for extraction to Galaxy.

Glad this worked out!

ADD REPLY • link modified 10 months ago • written 10 months ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »