This Summer, while I’ve been interning at Shutterfly, I’ve also been taking a course in introductory statistics. Of course, since I’m majoring in Computer Science, I’m always looking for new ways to tie computer science into whatever I happen to be learning. Fortunately, there is quite a bit of overlap between these fields, and there are many computational tools out there that make it easy to analyze data.

Since my calculator (a TI-92, for those of you that are into calculators!) doesn’t have very many statistical capabilities or build in programs, my tools of choice for this semester have been a combination of the statistics package $R$, as well as several programs I’ve written for my calculator in BASIC. $R$ is free and open source, and can very likely be found in your package manager of choice. In this article, I hope to cover a few ways to use it to perform some basic analysis on data sets.

Once you have $R$ installed, grab a data set and dive in!

1-Dimensional Data Sets

Let’s say you simply have a list of values, and you want to know some of its statistical properties - the mean, median, mode, and maybe even the different quartiles the values fall in to. The first order of business is to get your data into a file.

Doing everything from the command line makes things fairly straightforward. If you can copy your data onto the clipboard, you can easily pipe it into a file with xclip. If you don’t already have xclip installed, and you’re on a Debian variant, you can install it using

Since xclip’s command-line options aren’t especially intuitive, it helps to make quick aliases for the copy and paste commands. You can use whatever you’d like - I went with ‘cpaste’ and ‘ccopy’ - and simply echo them into your .bashrc, .zshrc, or external alias file. Make sure to refresh your .rc file after adding them, though!

Once you have this, you can then pipe your clipboard contents into a file.

Great! So now we have a quick way of getting some data into a file. Next, we need to load it as a data frame in $R$. Lucky for us, there are already plenty of functions in place to accomplish this, so I’ll just demonstrate one. You can launch $R$ by simply typing “R”, as long as it’s on your path.

Scatterplots and Linear Regression

Of course, the whole point of using something like $R$ is because it makes it so easy to perform statistical analysis. So let’s dive right in!

To be continued!

Tags:

Updated: