Skip to content

Using R to Import and Perform Simple Calculation on Multiple Files

July 25, 2013

My first post about using R back in January focused on creating custom plots and led to some interaction with other R users in the comment thread and this post from R expert Gavin Simpson in which he kindly offered a solution (and code) to what I was trying to do. I ended up using a slightly modified version of that code to make a bunch of plots for a paper I’m working on. (In fact, I should be working on that paper right now … yep.)

Anyway, for this post I want to briefly summarize another aspect of R (or most programming languages for that matter) that is quite powerful: automating data processing and/or computational tasks that need to be done for multiple files. This funny graphic* below was going around some months ago and very nicely illustrates what I’m talking about.

geeks-repetitive-tasks

Just a bit of background before I get on to the R code. I’m starting some research that will take me and my current/future students at least a few years, perhaps more, to accomplish simply because of the huge number of samples. I have a literal boatload of deep-sea sediment samples from IODP Expedition 342 (see this post for context) that will be used to better understand the history of deep ocean circulation in response to past climate change. Specifically, we’ll be measuring the variability in grain size of the terrigenous (land-derived, non-biogenic) sediment through time. To the naked eye, all this sediment is mud. But, subtle differences in the mean size of the silt fraction over time can tell us about the relative intensity of long-lived abyssal currents that transported the sediment. (If you want to know more about this approach, including all the assumptions, limitations, and caveats, this is a nice review.)

Okay, back to data processing with R. The particle size instrument we use to make the measurement outputs a standard text (ASCII) file with some header information and two columns of data: particle diameter and relative mass %. From these two columns of data we want to calculate the mean particle diameter from a subset of the entire range. If I had only a handful of files, I would end up doing something like this: Import .txt file into Excel > build formulas to calculate what we want to calculate > copy resultant value to another spreadsheet, and then repeat for each file. On the graphic above, this would be doing it ‘manually’.

But, what if you have (or will generate in coming months/years) thousands of individual text files? Going through and doing manual manipulation in Excel — no matter how quick for one file — on large numbers of files would be enough to drive a person to madness … madness! It’s a waste of our precious time and it could lead to little errors. So, let’s use R to automate this whole thing. (Oh, I should quickly note here, I’m a programming/coding newbie. This post is for people like me who have little to no experience with this stuff. I’d love it for experts to read the whole thing and offer their perspective and correct any mistakes, but no worries if this is all too elementary and boring.)

First thing to do (after setting the working directory) is to define the function we want to apply to the data files. In this case, we need to calculate the mean from relative frequency data. Remember, any text with a ‘#’ before it in these examples of R code are just notes to remind myself what the code does.

# Define function to calculate mean from instrument output data file
meanSS = function(diam, freq){(sum(diam*freq))/(sum(freq))}

Next step is to import the text files into R. The way it’s written here you need to make sure the files are in the working directory. This will bring in all files that have a ‘.TXT’ in the file name.

# batch import text files (files must be in working directory); 'pattern' is case-sensitive
txtfiles = list.files(pattern="*.TXT")

Now, here’s the cool part. The code below is a loop that does a few things. First, we tell R to skip the first 15 rows of the text files. These rows contain header information that is unnecessary for our purposes. Second, it creates a subset of the data. In this case I want to make the calculation only on the 10-63 micron range. Because 63 microns is the max (samples were sieved at 63 microns prior to analysis) we tell R to create a subset of all data that is >9.99 in the first column. Third, we provide names for the two columns (‘diam’ and ‘freq’) that correspond with the function above. I suppose you wouldn’t have to name the columns, but I like to be able to just look at this code and know what it’s doing. Finally, it applies the function ‘meanSS’ that we defined above. This is all within a ‘for’ loop command such that it applies these steps to every text file that was imported.

# Loop that creates subset (SS, which is 10-63 micron range), assigns name to columns, and 
# computes mean of that subset for all the text files that were read in
means = numeric(length(txtfiles))

for (i in 1:length(txtfiles)){
  tmp = read.table(txtfiles[i], sep=",", skip=15)
  SS = data.frame(subset(tmp, V1 > 9.99)) 
  diam = SS[[1]]
  freq = SS[[2]]
  means[i] = meanSS(diam, freq)
}

I had been stuck on this loop code for some time. I spent several cumulative hours over a couple weeks on websites like Stack Overflow searching for tips and trying out different things, but couldn’t quite get it to work. Thankfully, I was able to get help from this fantastic service on campus called LISA (Laboratory for Interdisciplinary Statistical Analysis). One of the stats department grad students working at LISA fixed my nearly-working code in about five minutes over email. Awesome!

Next, the results of the above calculation, which in this case is a single number, is put into it’s own vector in R. ‘Printing’ the results simply displays the calculated values on the screen.

# define vectors to store results (mean sortable silt values)
results = data.frame(txtfiles, means)
print(results)

I’ve only generated a few text files so far, we have a lot of sample preparation work to do before we get to generating these text files. But, it was pretty satisfying to open up R, highlight the code, hit run, and see the values from multiple files appear in seconds. Future additions to this code will be to create time series plots of the calculated values by merging with a table of age data. And, ultimately, we’ll want to create time series plots of these data combined with time series data generated by other scientists from Expedition 342 (e.g., oxygen isotopes).

To summarize, the processing and calculation steps in this code are not complicated, they are rather simple. But, the ability to loop those simple steps and automatically apply to numerous text files with a few key strokes is powerful. If anyone has any tips to improve this code, I’d love to hear it.

* As far as I can tell, it looks like the ‘Geeks and Repetitive Tasks’ graph came from Bruno Oliveira’s G+ page.

9 Comments leave one →
  1. July 25, 2013 6:50 am

    Cool post! I’m in the process of learning R as well but since I do have experience with MATLAB, its mainly syntax and structure that I’m trying to familiarize myself with…

    Thanks for the link to the McCave and Hall paper – looks interesting. Are you guys running any oxygen isotopes in your lab?

    • July 25, 2013 7:08 am

      No, not running any isotope measurements here. About half the science party from this expedition are measuring isotopes of one kind or another.

  2. July 29, 2013 10:47 am

    This figure sums up my life as a grad student (on the geek side of it)

  3. bill turnbull permalink
    July 29, 2013 10:56 am

    Some data sets are not amenable to manual analysis, the ONLY practical method is using compuer-aided analysis. The example shown is a good example of a learning scenario – data set to get your brain around but large enough to make the automated analysis productive.

    • bill turnbull permalink
      July 29, 2013 10:58 am

      “…. data set (small enough) to get your brain around …”

  4. bill turnbull permalink
    July 31, 2013 11:18 am

    I’m guessing that the change suggested by your colleague was to move meanSS out of the input loop and changed it to (the equivalent of, I’m not familiar with R):

    meanSS(diam, freq)
    /* pass in arrays diam, freq,
    return normalized mean */

    Sum_Diam-Freq = 0;
    Sum_Freq = 0;

    do (i = 1; i sample_size; i++)
    /* Accumulate the sums */
    Sum_Diam-Freq = Sum_Diam-Freq + diam(i) * freq(i);
    Sum_Freq = Sum_Freq / freq (i);
    end do

    return Sum_Diam-Freq / Sum_Freq;

    I wish I were familiar with R, I would love to consult just for the fun of participating in actual research.

    • July 31, 2013 11:34 am

      Yes, the guy who helped me out noted it was important to define the function at the top and outside of the loop. Don’t feel bad about not knowing R … this little bit of R is the *only* programming/coding I know.

  5. bill turnbull permalink
    July 31, 2013 11:21 am

    Grrr, I meant:
    Sum_Freq = Sum_Freq + freq(i).

    my formating was removed from my original post too, I wish that “extra” spaces, etc. weren’t automatically removed.

  6. March 21, 2014 7:45 am

    Hi Brian

    Thank you for you post! I did some practice following your post. But there is a question I want to ask. I have two .txt files, lets say. There are two rows and two columns values in each file.

    It seems only the first row did the function and output. What about the second row if I also want them to do the function??

    Best regards,
    Ye

Leave a comment