Help! I have quantitative data, now what do I do? (Part 2) | Macarena Quiroga

Help! I have quantitative data, now what do I do? (Part 2)

Basic guide to start working with quantitative data. In this post: mean, median and mode with R

Image made by me with the {aRtsy} package

This post is the second installment of the saga How to start working with quantitative data without it being torture, designed for people who -like me- have a background in social or human sciences and suddenly have to work with numerical data or similar. In the first installment we saw how to create the RStudio environment, how to import the data and how to inspect the data table. This time, we are going to discuss the three most important measures of central tendency: the mean, the median, and the mode.


Before we begin, what is statistics?

No, sorry, that’s a lie. We are not going to delve into what statistics is for several reasons: first and foremost because I am not a mathematician, but above all because there is a lot of material on the internet to delve into these questions (starting with the Wikipedia page). Beyond that, it does seem important to me to start with a key difference: on the one hand, we have descriptive statistics, whose objective is to describe and summarize the available data, and on the other, inferential statistics that’s the one that deals with generating models, inferences and predictions about the data. Although today it seems that the coolest thing is to do inferential statistics, the truth is that the first mandatory step is always the description of the data. To explain why data is the way it is or predict how similar data would behave, we need to have a clear idea of what our data is like.

Within descriptive statistics we have two large important areas: measures of central tendency (the topic of this post) and measures of dispersion or variability (we will see it later). The three most important measures of central tendency are the following:

  • The mean or average: when refer to the mean, we generally mean the arithmetic mean, which is different from the weighted mean or the geometric mean. It is calculated by adding all the values in the set and dividing by the number of elements.

  • The median: is the value that is in the middle of the data set when ordered from smallest to largest, that is, the value that leaves the middle of the data below.

  • The mode: is the most frequent value in the data set.

All of these values convey different information about the data set and have their advantages and disadvantages. Viewing their results simultaneously will help you understand the characteristics of the data.


Mean, median and mode with R

We are going to work with the penguins package, but you can use whatever dataset you need. We install, load, and save the penguin dataset to an object, just like we did in the previous post:

install.packages("palmerpenguins")
library(palmerpenguins)
penguins <- penguins

If you already installed the package, you don’t have to install it again. Sometimes we want to leave lines of code written in our script but we don’t want to execute them (for example, to keep them handy or not to forget): in this case, we can put the # symbol at the beginning of the line. This is called commenting a line of code, and it’s a way both to prevent commands from being executed and to add feedback for your future self. Commenting or documenting your code is good practice so that your future self understands what you were thinking when you wrote each thing. At first it may seem a bit tedious, but believe me, you will appreciate it.

For example, the code block above could be documented as follows:

# install the package
# install.packages("palmerpenguins")

# load the package
library(palmerpenguins)

# create the object
penguins <- penguins

Everything to the right of the # is not executed. You can place it that way or on a separate line. That is a matter of personal taste.

Once we have created the object (which will appear in the Environment, the panel in the upper right corner), you can inspect it by clicking on it or using the following line of code: View(penguins). In this table we have three categorical variables (species, island, sexo) and five numerical variables (beak length and height, fin length, body mass and year of measurements).

Let’s consider the beak lenght to calculate the mean, median, and mode. To do this, we will use functions, which are keywords that indicate the action to be performed. In fact, in the above code we already used other functions, such as View(), install.packages() or library(). You recognize them because they always have the set of parentheses. Inside that parenthesis are going to be located the elements (or arguments) on which that action is going to be executed or the specifications of how we want it to be executed.

To calculate the mean we are going to use the mean() function. But it is not enough to specify the table inside the parentheses: we have to also specify which variable or column we want to calculate. This is done with the $ operator, something called subsetting (you can read more about subsetting methods here). Ok, let’s run it (to run a command, place the cursor on that line and press control+enter or click Run at the top right of the script).

mean(penguins$bill_length_mm)
## [1] NA

Hey, what happened?! When we execute this command, we see that in the console the result is NA, that is, Not Available. In other words, this means that the mean could not be calculated or that the result of the mean is a missing value. There are many reasons why an operation can give this result, but luckily in this case the reason is quite simple: if we inspect the table, we see that in row 4 we have NA as the value of the peak length.

If you remember, in the previous post we saw the summary() function, which returned a summary of the descriptive statistics of the entire table, among which was the number of missing elements in each column (it also returns the three measures that we will see in this post, but let’s ignore it). So we see that there are missing values in almost all the columns, so we have to do something.

One of the alternatives is to exclude from the calculation of the average those observations (rows, little penguins) that do not have the value that we want to calculate, that is, make a custom calculation. Many functions allow you to specify the way they are executed: for example, by default the mean is calculated with all the values in the column, but you can specify missing values that won’t be taken into account. To find out if a function (in our case, mean()) can be specified, we can type ?mean in the console: that shows us the official documentation of the function. I have a duty to warn you that the official documentation is not always very friendly and easy to understand, but over time you get used to it.

The important thing is that one of the function’s arguments, na.rm, allows us to specify if we want to remove the NA values, which is effectively our case. To turn that switch on, what we do is add that argument to the function inside the parentheses, separated by a comma (in this case, the order of the factors alters the product, but we’re not going to dig too deep for now):

mean(penguins$bill_length_mm, na.rm = TRUE)
## [1] 43.92193

Now we do have the value of the average length of the beak: 43.92193. What a horrible number, right? We can round the value with the round() function that takes two arguments: the first is the object to round (in our case, the mean calculation) and the second is the number of decimal places we want.

round(mean(penguins$bill_length_mm, na.rm = TRUE), 2)
## [1] 43.92

As you will see, functions allow recursion, in the sense that one function can be the argument of another. This can get hard to read very quickly, so always remember that functions in R are executed from the inside out (ie, in the code above, the mean is executed first, and rounding is executed on that result).

Well, then we have that, on average, the beaks of these penguins measure 43.92mm. The mean is a very useful value, but one of its great disadvantages is that it is very sensitive to extreme values. Think of it this way: if I have a family that earns $100 a month and another that earns $1000, saying that the families earn an average of $550 a month is a tricky number. And this artificiality can occur even with a larger sample: if we have 4 families that earn $100 a month and a family that earns $1,500, the average would say that the families earn an average of $400 a month, which is four times what the first four families really earn.

It is to avoid these biases that we can use the other measures of central tendency to complement our analysis. The median orders the values from smallest to largest and looks for the value that is exactly in the middle:

median(penguins$bill_length_mm, na.rm = TRUE)
## [1] 44.45

In this case it is not necessary to round, because it is not the result of a calculation, but rather a specific value. We see that the median value (44.45) is slightly higher than the mean value (43.92), but it doesn’t seem to be enough to attract attention.

Now, whoever knows something about sexual dimorphism might wonder if the length of the beak is the same in male penguins and in female penguins. For that, we can calculate the median (or any other function) for the two groups separately. We do this with the aggregate() function, whose arguments are the column we want to calculate, then the column that divides the observations (which should appear inside the list() function for some reasons that don’t matter now), and finally the function that we want to execute.

aggregate(x = penguins$bill_length_mm, by = list(penguins$sex), FUN = median)
##   Group.1    x
## 1  female 42.8
## 2    male 46.8

We then see that the median for females is 42.8, while for males it is 46.8. So we already began to see certain differences that could explain the distance between the mean and the median. Could it be that there is more data in the table for one sex than for the other? If we go back to the table returned by summary() we see that we have 168 records of male penguins and 165 of females, the difference is very small.

Finally, we are left with the last of the measurements of central tendency, the mode, which is the value that is most repeated in the table. Here we are faced with the following problem, which you really did not know you had. Until now we have been using functions from what is called base R, that is, the R programming language without any kind of addition. However, like any programming language, in R there are different types of packages that provide different functions and objects for different uses (in fact, we used a package to download our base penguins, I don’t know if you remember). But here’s the thing: in base R there is no function to calculate the mode (why? I don’t know). So we have two alternatives: either we create our own function that calculates the mode, or we can look for a package that has such a function. Creating our own functions is a very useful thing, but it’s definitely outside the scope of these posts (although in the future, who knows).

To search for functions we can go to sites like RDocumentation and use their search engine. We found that there is a statip package that contains the mfv function that allows us to calculate the mode. We install it with the install.packages() function and it is ready to use. In this case, since we’re only interested in using that one function, we don’t need to load the entire package. So, we can use the symbol :: which is used to indicate the origin of the function:

# install.packages("statip")
statip::mfv(penguins$bill_length_mm, na_rm = TRUE)
## [1] 41.1
aggregate(x = penguins$bill_length_mm, by = list(penguins$sex), FUN = statip::mfv)
##   Group.1    x
## 1  female 46.5
## 2    male 41.1

In this case, the way to remove the missing values is with the na_rm argument; this can be checked with the official documentation (by typing ?mfv in the console). We then see that the most frequent value for peak length is 41.1, much lower than the mean and median. However, when we calculate it by group, we see that the mode for males is 46.5 and for males it is 41.1. Quite different from what the median gave, right? Remember that in this table there are other groups that we are not taking into account, such as the penguin species and the island where it lives. It is possible that these other grouping variables help to understand these differences. And you know what helps a lot to understand these things? The graphics. But this is the subject of the next post.


Closing

In this post we looked at three measures of central tendency: mean, median, and mode. In addition, we saw how to specify that we do not want to take missing values into account and how to perform the calculations disaggregating by groups. Finally, we go over a bit of how to work with packages. In the next post we are going to see a function to start making basic graphics.

As always, remember you can suscribe to my blog to stay updated, and if you have any questions, don’t hesitate to contact me. And if you like what I do, you can buy me a cafecito from Argentina or a kofi. See you nex time!

Macarena Quiroga
Macarena Quiroga
Linguist/PhD student

I research language acquisition. I’m looking to deepen my knowledge of statistis and data science with R/Rstudio. If you like what I do, you can buy me a coffee from Argentina, or a kofi from other countries. Suscribe to my blog here.

Related