Hi all,
I have a batch of 170 samples (represented by 170 comma-separated files) and am attempting to have them all as inputs into a single R script that will produce 1 output using data from each sample. When I run this script outside of galaxy, the input is a directory of the files. I feel that the multiple datasets option is what I should be using, but I can't seem to figure out its structure. Any help is appreciated!
Here is an example of what the script would be doing outside of galaxy:
# Get files
arguments <- commandArgs(trailingOnly=TRUE);
directory <- arguments[1]
files <- list.files(directory)
# Read in 1st file to use to make matrix
temp.file <- paste(directory, files[1], sep="")
temp <- read.csv(temp.file)
# Get rows from file and columns from the number of files in the directory
num.rows <- length(temp[[1]])
num.cols <- length(files)
#Create and populate matrix
matrix.to.populate <- matrix(nrow = num.rows, ncol = num.cols)
for(i in 1:length(files){
populate matrix }
After the matrix is populated, I do a few calculations using the data.
I originally tried to use a dataset collection, but that wanted to run the script 170 times, one for each sample. I'm currently trying to use the multiple datasets input option (I found this post How to select multiples files as inputs to parse them simulaneously ? ), but can't get that to work. My xml file has the following commmand and input:
<command>R --vanilla --file=my.script.R --args $inputs $output </command>
<param name="inputs" type="data" multiple="True" />
I tried to change my original R script to accommodate this input, but have not been successful. The galaxy version of my R script looks like this:
inputs <- commandArgs(trailingOnly=TRUE)[1]
output <- commandArgs(trailingOnly=TRUE)[2]
# Get rows and columns for matrix
temp <- read.csv(inputs[1])
num.rows <- length(temp[[1]])
num.cols <- length(inputs)
The num.rows works, but the num.cols returns 1 instead of 170. It looks like only the first file in the multiple datasets field is being used. In my galaxy script,
read.csv(inputs[1])
is the same as
temp.file <- paste(directory, files[1], sep="")
temp <- read.csv(temp.file)
Trying to see what inputs looks like:
write.table(inputs, quote=FALSE, row.names=FALSE, col.names=FALSE)
returns
"~/galaxy/database/files/000/dataset_433.da"
I have dificulties following your R code...but a few points:
In the galaxy code: why "read.csv(inputs[1])" and not read.csv(inputs) ? since 'input' is a path
"num.cols <- length(inputs)" will give you the length of the vector 'input'. 'input' is just the path and therefore a vector of length 1
As far as the R code goes:
set a directory path as the argument
Create a list of all of the files in that directory (csv's with 260 rows and 5 columns)
Read in the first file
Get the number of rows for the new matrix (260) from first file
Get the number of columns for the new matrix (1 column for each file in the directory)
Create an empty matrix with specified rows and columns
Populate the matrix with values from the files
I'm essentially taking the 5th column from each of the csv's and then cbinding them together so that the final result will be 1 column of 260 values for each sample...hope that explained it better.
As far as the galaxy code, I was just experimenting with how to access all of the files in the multiple inputs. I can only ever access the first file. Both "read.cv(inputs[1])" and "read.csv(inputs)" give me the same result actually, the first file in the multiple dataset argument (5 columns of 260 rows).
I'm trying to find someway to refer to all of the files in the multiple inputs (i.e. if I have 170 samples, there will be 170 columns, if I have 140 samples, there will be 140 columns, etc.)