Creating Data Plots with R
First things first, I’m a novice when it comes to coding/programming. So, those of you who are much better at this are free to point and laugh at my lack of skills. I’ve gotten used to revealing my ignorance on a nearly daily basis, which is fine because the more that I do that the more I learn useful stuff.
I’ve recently started playing around with the free statistical/graphical programming language and software R. (Here’s some good intro documentation.) For a sampling of what the graphics look like, check out this image gallery. I use R through the separate RStudio interface, which I like quite a bit. This image is from the RStudio website with my own text annotation explaining what the four panels are for.
For this first foray into R I’m interested in creating a simple line plot with custom annotation for some grain-size data I have. Down the road I’d like to learn more about statistical analyses with R, but for now I just want to make some nice-looking graphics for a paper I’m writing.
I’m not sure how many of you who read this blog use R, but I thought it might be fun and mutually helpful to start sharing some ideas and code. Just about everything I’ve learned so far (which is the tip of the tip of the R iceberg) was found by googling various commands. There are numerous sites, forums, and blogs out there with people helping each other out. It’s amazing. However, even with all those resources it’s still sometimes difficult to find exactly what you need. This is especially true for a beginner (i.e., me) who isn’t familiar with all the commands yet.
The light-gray boxes with green text have bits of code that generated the plot that is shown below the code. Any lines of code that start with a ‘#’ are extra explanation and not information that R will try to read. At this point I’m putting a lot of notes in the code to remind myself what it is these commands do. For this post, I amend the code with very simple changes (additions, subtractions, modifications) and show the corresponding change to the plot.
# load the data file (a .csv file with volume% and particle diameter data in columns 1 and 2, respectively) smb_bed1a <- read.csv("SMB-bed1a.csv", header=TRUE)
This simply loads the data file (first row of the csv file is a text header). The data is now ready to be analyzed or plotted. (Before loading any data files make sure to set your working directory to wherever the file is located.)
Okay, let’s make a simple line plot.
# line plot of vol% on y-axis and diameter on x-axis plot(smb_bed1a,type="l",main="SMB Bed 1a")
There are numerous plot types; this one is type ‘l’, which is just a curved line. The best way to check out the options is to simply try out different ones. The default is to put the first column on the y-axis and second column on the x-axis. This can, of course, be changed. It also automatically put the text from row 1 as the labels for the axes. The only other info I put in the code was to add a title to the plot using the ‘main’ function.
But, grain size is best viewed on a log scale.
# same plot, but changing x-axis to be logarithmic plot(smb_bed1a,type="l",log="x",main="SMB Bed 1a")
Sweet, that was easy. Now I want to change the labels for the axes from their default (the first row in the loaded csv file) to something more accurate. This is done with the ‘xlab’ and ‘ylab’ functions. Note that I also use the ‘expression(paste)’ function to get the Greek symbol ‘mu’ to display for micron units. A little thing like that actually took me some time to figure out; I eventually stumbled across this web page.
# adding better axes labels & horiz label text; including adding Greek symbol 'mu' plot(smb_bed1a,type="l", las=1,log="x",xlab=expression(paste("Particle Diameter (", mu,"m)")), ylab='Volume (%)',main="SMB Bed 1a")
Now that I look at this plot, I want to see the data points superimposed on the line. This is done by changing the ‘type’ and the ‘pch’
function plot parameter relates to the kind of symbol displayed.
# changing to line w/ points plot(smb_bed1a,type="o", las=1, pch=16, log="x",xlab=expression(paste("Particle Diameter (", mu,"m)")), ylab='Volume (%)',main="SMB Bed 1a")
For my purposes, the default x-axis labeling is not very useful. These are data are from sedimentary deposits, so I want to emphasize the particle diameter divisions based on the Wentworth grain-size scale. This will show which part of the distribution is clay, fine/medium silt, coarse silt, very fine sand, and so on. The first thing is to remove the default tick marks and numbering, which is done by setting ‘xaxt’ to ‘n’ (found here). Then, I add my own axis label by specifying what values on the x-axis I want to display. The ‘c’ function is used ubiquitously in R to assign a vector, which is just an ordered collection of numbers. The ‘cex.axis’ function changes the font size of the label.
# removing default x-axis labels; adding labels to denote sedimentological grain-size divisions plot(smb_bed1a,type="o", las=1, pch=16, log="x",xaxt='n',xlab=expression(paste("Particle Diameter (", mu,"m)")), ylab='Volume (%)',main="SMB Bed 1a") axis (1, at=c(4,31,63,125,250,500,1000,2000), cex.axis=0.65)
After looking at this iteration, I’d like to decrease the size of the data points. I’ll do this using the ‘cex’ function in the main plot command. I’d also like to add some vertical reference lines at the labeled grain-size divisions to help the viewer more easily see how those divisions relate to the distribution. I use the ‘abline’ function for vertical (‘v’) lines and make it look how I want with the color (‘col’), line width (‘lwd’), and line type (‘lty’)
functions plot parameters.
# decreasing data point size ('cex'); adding vertical lines ('abline v') to show sed grain size divisions plot(smb_bed1a,type="o", las=1, pch=16, cex=0.65, log="x",xaxt='n',xlab=expression(paste("Particle Diameter (", mu,"m)")), ylab='Volume (%)',main="SMB Bed 1a") axis (1, at=c(4,31,63,125,250,500,1000,2000), cex.axis=0.65) abline(v=c(4,31,63,125,250,500), col="gray40", lwd=0.8, lty=5)
What I’d like to do next I haven’t yet figured out. Maybe this is where some readers could help. I’d like to color-fill under the curve and with different colors for each of the grain-size domains (areas between the custom x-axis tick marks and corresponding dashed lines). Ultimately, I’d like to vertically stack a bunch of these plots so the reader can quickly visualize subtle changes in distribution from one datum to the next. I’ve been wrestling with the ‘polygon’ function quite a bit but haven’t figured it out yet. Sure, I could take what I have here into Illustrator and add the embellishments by hand, but I have >50 of these plots so putting the effort in now to create code I can apply over and over again is worth it.
Now, because this was essentially the first time doing this I’m sure that what I’ve cobbled together here is not the most efficient code. That’s the beauty of R (and any programming for that matter), there are numerous ways to get to the desired result. I’m guessing that there are some functions I don’t know about that could clean this up and make it even simpler.