Help! I have quantitative data, what do I do? (part 1) | Macarena Quiroga

Help! I have quantitative data, what do I do? (part 1)

Basic guide to start working with quantitative data, part 1. In this post, how to install R and RStudio, import data and inspect a table.

Image made by me with the {aRtsy} package

Did you start a new job and led them to believe you had an average knowledge of statistics, but only saw it once in high school? Are you studying for a degree and are you facing your first analysis job? Do you come from the social sciences and a project involving quantitative analysis fell into your hands? Do not fear, this is your salvation. This is the first post in a new saga that aims to guide those people who have to learn to perform basic statistical analysis in R.

It’s not that I think there is no information about this on the internet: believe me, I know there is a lot. But much of that information is intended as statistical study material. And, let’s be honest, sometimes we just need to know that what we’re doing isn’t wrong, without necessarily fully understanding the mathematical models behind what we’re doing. Should we know? Maybe yes, maybe no, it seems to me that it is open to debate. But beyond that, my intention is to throw in a couple of life preservers and, while we’re at it, reinforce my own knowledge on the subject.

This first post is dedicated to thinking about that first, very first approach to quantitative databases. Basic level: I opened the file and I don’t know what to do. We are going to mention some aspects to take into account when understanding our databases, such as the types of elements that we can have and how they are organized in the table. In the next few posts we are going to advance with some measures of central tendency and dispersion, which are the most important in descriptive analyses. Let us begin!


First step: load the table

Actually, I already started by lying to you: before you can look at the table, you have to have the table available. This involves loading or importing the data into your R environment. I’m going to assume you have R and RStudio installed; otherwise your step zero is going to be install R and then install RStudio (in that order). When you can open RStudio and it works, you already have everything you need to work.

The first thing you are going to do is create a script, which is going to be the file where you are going to save your analysis, similar to a text document. You do this in RStudio in the top menu: File > New File > R Script. Once you open the script, you will see that the screen is divided into four large sections:

  • The script window in the upper left corner, where you will write your code. This is similar to a text document: you can save it and open it at a later time to rerun your analysis.

  • The console window, in the lower left corner. This is where you will see the results of the code you run. You can also write code directly in the console, but it is not saved. Also, it’s hard to find a previously executed function after you’ve been working on it for a long time.

  • In the upper right corner you will find the environment [environment]. Here all the elements you create will be stored, such as your data tables. There are other tabs like History, Connections, etc.: we are not going to use them for now.

  • In the lower right corner you will find a panel where you will be able to browse your files [Files], where the graphs that you are going to make [Plots] will appear and where you will be able to search for information about the packages and functions that you are going to use [Help], among other things.

The arrangement of these four panels may vary, but that is to liking and preference of each person. The important thing is that you recognize them and that you know they are there. Now, regarding the effective use that we are going to make: we are always going to write the functions in the script (not in the console) and to execute them we place the cursor in that line and press control + enter (you can also click on the Run button that is in the upper right corner of the script, but I find it awkward). The output of the command will appear in the console. You can try typing something simple like 10 + 20.

Now yes, the table. There are many different ways to import a table into your R environment. The easiest way is to go to the environment panel and click “Import Dataset”. It will open a series of options that will depend on the type of file you have: if the extension of your file is .csv, then choose “From text (base)”; instead, if it’s an .xls or .xlsx file, choose “From Excel”. Another alternative is to look for the file in the Files section: once you find it, click on it and two options will appear: “View file” and “Import dataset”. I chose the second.

In any case, it will show you a preview of what the file will look like. After accepting you will see that a new element appears in the environment panel with the name of your file. And voila! We already have the base loaded.

For this post I am going to use a database on penguins from Palmer Island, which is a well-known database for the study of R. If you do not have your own data or if you want your first approach to be more controlled, you can follow this post with this same database. To install the package and load the database, copy this code into your script and run it (to run it, place the cursor on each line and press control+enter or click “Run”):

install.packages("data")
library(palmerpenguins)
penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Wow, so much information. Let’s go little by little: in the first line we use the install.packages() function to install the package. What it does is download the package to your computer, so you can use it when you need it. The second line calls the package with the library() function, loads it into your environment. Think about the difference between installing and loading a package in the same way that you install and open a program: you can download Word, but to write a text you first have to open it. In this context, “package” and “library” are synonymous. Important: when you use this library again you don’t need to reinstall it, so don’t run the first line again, only the second one.

In the third line we load the “penguins” dataset that is inside the package we just loaded; Note that when you execute that line (the single word) in the console, the table with a lot of information appeared. This happens that easily because this table is named like that, but if you run another word, like “house”, it will give you the following error: Error: object 'house' not found. It is very important to get used to reading the errors that RStudio returns, because that way we can understand what is not working.

Now, in order to use that table we need to save it with a name, in other words, we need to create an object with that information in order to manipulate it. For that we are going to use the assignment symbol <- as follows:

penguins <- penguins

The word next to the - sign is the object’s content, while the word next to the < is going to be the name of the object. You can also create the object with the elements inverted:

penguins -> ping2

But it is not the most frequent nor the most comfortable. The important thing though is to identify the assignment symbol and see that executing that command brings up the object in the Environment, the top right corner pane. We can inspect the object by clicking on it or by running the following command:

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>


Second step: inspect the table

This data table, which we can call dataframe, is made up of 8 columns and 344 rows, each of which corresponds to an observation. In this case, each observation corresponds to a penguin. This is not the only way to organize the information, but it is the most common in this type of analysis and the one that will make your life easier. However, it is very likely that your own data table will not have this format; Later we will see what to do in that case.

The R language has several functions that allow you to have a global idea of the content of the table without delving into it on a case-by-case basis. To see the structure, we can use the str() function

str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

This function roughly describes what type of data is in each of the columns. We can see, for example, that the first two columns, species and island, and the penultimate one, sex, have data of the factor type, that is, they are categorical values, which allow the data to be grouped into different groups. Then we have a series of columns with numerical values for bill length and height, flipper length, body mass, and the year those penguins were observed. However, you will see that some are of type num and some are of type int: num are all numbers, but int are integers (ie those without decimals).

Another useful function for these first moments is summary():

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

This function performs a series of descriptive statistics of the variables (columns) present in the table. If you are reading this it is possible that you do not understand these results very well or you do not know what to do with them: do not worry, they will be the subject of the next post. What matters now is that you know that this function returns a general overview that helps you to begin to understand the data a little more.


Closing

In this post we then saw the preview of any quantitative analysis: we installed R and RStudio, imported the data and began to see what is in the table. It’s already a lot! In the next post we are going to start working with descriptive statistics.

As always, remember you can suscribe to my blog to stay updated, and if you have any questions, don’t hesitate to contact me. And if you like what I do, you can buy me a cafecito from Argentina or a kofi.

Macarena Quiroga
Macarena Quiroga
Linguist/PhD student

I research language acquisition. I’m looking to deepen my knowledge of statistis and data science with R/Rstudio. If you like what I do, you can buy me a coffee from Argentina, or a kofi from other countries. Suscribe to my blog here.

Related