computeMatrix skipping bug is causing a problem with large data sets

Question: computeMatrix skipping bug is causing a problem with large data sets

4 months ago by

When running computeMatrix with a large set of data I keep receiving this error multiple times each run:

Skipping (gene), due to being absent in the computeMatrix output.

This is a problem as I am using the resulting matrix to make heatmaps outside of galaxy, but it's impossible to line the genes up with their values because several hundred-thousand genes are being left out. Is anyone else getting this error or know how to fix it? Even just figuring out how to get the full list of genes left out would be fine, as I could line up the ones that remain, but currently galaxy is only giving me four at a time in the info box. Thanks for all the help in advance,

-Matt

computematrix deeptools galaxy rna-seq • 214 views

ADD COMMENT • link •

modified 4 months ago by Jennifer Hillman Jackson ♦ 25k • written 4 months ago by Dietz.Matthew • 0

Hi Matt, Could you explain in more detail what steps are leading up to this problem? I'm not sure what "genes being left out" means or how you are preparing the data for this tool.

It would be helpful to include exact tool names (including version) and the copied contents of the Job Details tool run settings (don't post back the job API links, or your data will not remain private). Or you can post back a shared history link (would be public) or send in a bug report from the error dataset (private). If you choose to send in the bug report, be sure to leave all input/output datasets undeleted and include a link to this post so we can link the two.

How-to and troubleshooting FAQs: https://galaxyproject.org/support/

My job ended with an error. What can I do?
Reporting Usage Issues or Software bugs

Thanks, Jen, Galaxy team

ADD REPLY • link written 4 months ago by Jennifer Hillman Jackson ♦ 25k

I'm using the computeMatrix tool in NGS: DeepTools, version 2.5.0.0 and by "genes being left out", I mean that the tool is deciding, for one reason or another, to skip some of the genes and leave them out of the computeMatrix output. The settings are as shown:

computeMatrix has two main output options reference-point
The reference point for the plotting beginning of region (e.g. TSS)
Discard any values after the region end False
Distance upstream of the start site of the regions defined in the region file 1000
Distance downstream of the end site of the given regions 1000
Show advanced output settings yes
Save the matrix of values underlying the heatmap True
Save the regions after skipping zeros or min/max threshold values False
Show advanced options yes
Length, in bases, of non-overlapping bins used for averaging the score over the regions length 50
Sort regions maintain the same ordering as the input files
Method used for sorting mean
Define the type of statistic that should be displayed. mean
Convert missing values to 0? True
Skip zeros False
Minimum threshold Not available.
Maximum threshold Not available.
Scaling factor Not available.
Use a metagene model False
trascript designator transcript
exon designator exon
transcriptID key designator transcript_id
Blacklisted regions in BED/GTF format
Job Resource Parameters no

The full message in the info box is as shown:

Skipping uc009skg.1, due to being absent in the computeMatrix output. Skipping uc009skh.1, due to being absent in the computeMatrix output. Skipping uc029xhh.1, due to being absent in the computeMatrix output. Skipping uc029xhi.1, due to being absent in

Either fixing the skipping issue or even just getting a complete list of the skipped ones would effectively fix the problem I'm having. I'm fairly certain the problem isn't with how the data is prepared, as I have run this through with different sets and with smaller sets and it was fine before. Thanks for looking into this, and feel free to ask if you need any more clarification. When I get time I'll see if I can't figure it out myself.

ADD REPLY • link written 4 months ago by Dietz.Matthew • 0

My initial guess is that the skipped transcripts (genes) map to places not represented in the bigwig score data. The bigWig was created from a BAM that was mapped to the mm10 primary autosomes + chrX, chrY, and chrM (an uploaded BAM, but the BAM headers are a match for mm10, so I don't think there is a genome mismatch problem).

The mm10 UCSC genes track includes transcripts that map to haplotypes and unmapped (the full genome). These are probably what is being skipped.

I'll be checking for that (there are not that many skipped so this makes sense) -- but also please continue to check/troubleshoot your way & considering this info.

ADD REPLY • link written 4 months ago by Jennifer Hillman Jackson ♦ 25k

4 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Ok, that went quicker than expected. The skipped BED transcripts are not represented in the bigWig file. This might be Ok depending on your analysis goals.

An example input region that was skipped because it hit to an unmapped chromosome not included in the BAM/bigWig:

chrUn_GL456372  6883    13335   uc009skg.1  0   -   6883    6883    0   3   680,151,109,    0,933,6343,

Two choices: Use the data as is (leave out the transcripts that didn't pass through the tool). Or, recreate the BAM used to generate the bigWig using the full mm10 genome build.

I only see a small number of skipped transcripts/regions (for this reason) when I review the stderr for the job. Click on the job details icon ("i" icon) to review this content. You can also send in a bug report - I'll know it is yours, so is fine to do that without expecting more feedback - and you'll get a copy to self-review the full details.

Overall, 63,814 regions were input and 63,572 matrix lines were generated. The difference of a few hundred are most likely those transcripts that were not in the bigWig at all (skipped).

The current results appear to be Ok to plot. If you have a have a failed plot (I couldn't find one active in your history), a bug report for that can also be sent in/reviewed for more details about why that is failing. It could be that some setting for ComputeMatrix needs to be tuned. For example, you might want to set the parameter "Skip zeros" to "Yes" (what this exactly does is explained on the tool form). In short, when no overlap is found for the transcript regions vs any of the bigWig's regions those data will be removed from the heatmap plot output.

Hope that helps! Jen, Galaxy team

ADD COMMENT • link modified 4 months ago • written 4 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »