This Summer, while I’ve been interning at Shutterfly, I’ve also been taking a course in introductory statistics. Of course, since I’m majoring in Computer Science, I’m always looking for new ways to tie computer science into whatever I happen to be learning. Fortunately, there is quite a bit of overlap between these fields, and there are many computational tools out there that make it easy to analyze data.

Since my calculator (a TI-92, for those of you that are into calculators!) doesn’t have very many statistical capabilities or build in programs, my tools of choice for this semester have been a combination of the statistics package , as well as several programs I’ve written for my calculator in BASIC. is free and open source, and can very likely be found in your package manager of choice. In this article, I hope to cover a few ways to use it to perform some basic analysis on data sets.

Once you have installed, grab a data set and dive in!

1-Dimensional Data Sets

Let’s say you simply have a list of values, and you want to know some of its statistical properties - the mean, median, mode, and maybe even the different quartiles the values fall in to. The first order of business is to get your data into a file.

Doing everything from the command line makes things fairly straightforward. If you can copy your data onto the clipboard, you can easily pipe it into a file with xclip. If you don’t already have xclip installed, and you’re on a Debian variant, you can install it using

sudo apt-get install xclip

Since xclip’s command-line options aren’t especially intuitive, it helps to make quick aliases for the copy and paste commands. You can use whatever you’d like - I went with ‘cpaste’ and ‘ccopy’ - and simply echo them into your .bashrc, .zshrc, or external alias file. Make sure to refresh your .rc file after adding them, though!

echo "alias cpaste='xclip -selection clipboard -o'" >> ~/.bashrc
echo "alias ccopy='xclip -selection c'" >> ~/.bashrc
source ~/.bashrc

Once you have this, you can then pipe your clipboard contents into a file.

cpaste > ~/data.csv

Great! So now we have a quick way of getting some data into a file. Next, we need to load it as a data frame in . Lucky for us, there are already plenty of functions in place to accomplish this, so I’ll just demonstrate one. You can launch by simply typing “R”, as long as it’s on your path.

# Read in the data
t <- read.csv("~/data.csv", header=TRUE)

t           # Display the contents of t - namely, the data we just read in.

tail(t)     # Display the last few entries of t

length(t)   # Display how many entries t contains

summary(t)  # Get a quick overview

Scatterplots and Linear Regression

Of course, the whole point of using something like $R$ is because it makes it so easy to perform statistical analysis. So let’s dive right in!

plot(t$x, t$y)                  # Produce a scatterplot of x vs. y
cor(t$x, t$y)                   # Display the R value for linear correlations

fit < lm(formula = t$x ~ t$y)   # Perform a linear regression, and store the result
fit                             # Display the slope and y-intercept the linear regression
summary(fit)                    # Display a more detailed analysis

Hypothesis Testing with Multiple Populations

To be continued!

Leave a comment