<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>data science | Macarena Quiroga</title><link>https://macarenaquiroga.com/en/category/data-science/</link><atom:link href="https://macarenaquiroga.com/en/category/data-science/index.xml" rel="self" type="application/rss+xml"/><description>data science</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>@ 2026 Macarena Quiroga</copyright><lastBuildDate>Sun, 02 Jan 2022 00:00:00 +0000</lastBuildDate><image><url>https://macarenaquiroga.com/media/icon_hu663b3345a4e19f2cbd1c27fe4eae8279_20867_512x512_fill_lanczos_center_2.png</url><title>data science</title><link>https://macarenaquiroga.com/en/category/data-science/</link></image><item><title>Analysis of Starbucks Beverages</title><link>https://macarenaquiroga.com/en/post/starbucks-beverages-data-science/</link><pubDate>Sun, 02 Jan 2022 00:00:00 +0000</pubDate><guid>https://macarenaquiroga.com/en/post/starbucks-beverages-data-science/</guid><description>
&lt;p>This year-end was quite busy, between organizing the &lt;a href="https://ciipmeconicet.wixsite.com/ciipme-flacso-2021">VI Meeting of Researchers in Development, Learning and Education&lt;/a> and writing two! papers. And without even considering the pandemic.&lt;/p>
&lt;p>I had been wanting to do something for #TidyTuesday, but the latest proposals didn’t motivate me much. Don’t get me wrong, I love the &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-14/readme.md">Spice Girls&lt;/a> and &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-07/readme.md">spiders&lt;/a> are nice (although I couldn’t say I’m very interested in &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-11-30/readme.md">cricket&lt;/a>), but the types of data they brought weren’t particularly engaging for me. While this project aims to work on visualization tools, what I need most right now is to improve my statistical skills.&lt;/p>
&lt;p>That’s why I chose this week: data on Starbucks beverages. The dataset has a lot of numerical information that will allow me to reproduce a basic statistical analysis. Let’s start by importing and organizing the data a bit.&lt;/p>
&lt;pre class="r">&lt;code>tuesdata &amp;lt;- tidytuesdayR::tt_load(&amp;#39;2021-12-21&amp;#39;)&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## ---- Compiling #TidyTuesday Information for 2021-12-21 ----
## --- There is 1 file available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 1: &amp;quot;starbucks.csv&amp;quot;&lt;/code>&lt;/pre>
&lt;pre class="r">&lt;code>starbucks &amp;lt;- tuesdata$starbucks
starbucks$product_name &amp;lt;- as.factor(starbucks$product_name)
starbucks$size &amp;lt;- as.factor(starbucks$size)
starbucks$milk &amp;lt;- as.factor(starbucks$milk)
starbucks$whip &amp;lt;- as.factor(starbucks$whip)
starbucks$trans_fat_g &amp;lt;- as.numeric(starbucks$trans_fat_g)
starbucks$fiber_g &amp;lt;- as.numeric(starbucks$fiber_g)&lt;/code>&lt;/pre>
&lt;div id="descriptive-analysis" class="section level1">
&lt;h1>Descriptive Analysis&lt;/h1>
&lt;p>To understand a bit more about how these beverages are composed, we will perform a descriptive analysis. We have two size measures (&lt;code>size&lt;/code> and &lt;code>serve_size_m_l&lt;/code>), calories, total fats (saturated fats and trans fats), grams of cholesterol, grams of sodium, total carbohydrates, grams of fiber, grams of sugar, and grams of caffeine.&lt;/p>
&lt;p>The first thing we’re going to do is see how the beverages are distributed according to the quantity of these ingredients. However, I encountered the problem that the different elements appear, even within their distribution, with very different absolute values. This makes them not clearly distinguishable if I present them all in the same graph. I will therefore maintain the separation by sizes, but I will only work with standard sizes (short, tall, venti, grande).&lt;/p>
&lt;pre class="r">&lt;code>library(tidyverse)&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Warning: package &amp;#39;tibble&amp;#39; was built under R version 4.4.1&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Warning: package &amp;#39;purrr&amp;#39; was built under R version 4.4.1&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Warning: package &amp;#39;stringr&amp;#39; was built under R version 4.4.1&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.6.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (&amp;lt;http://conflicted.r-lib.org/&amp;gt;) to force all conflicts to become errors&lt;/code>&lt;/pre>
&lt;pre class="r">&lt;code>starbucks_longer &amp;lt;- starbucks %&amp;gt;%
mutate(caffeine_g = caffeine_mg*100,
cholesterol_g = cholesterol_mg*100,
sodium_g = sodium_mg*100) %&amp;gt;%
select(-c(caffeine_mg, cholesterol_mg, sodium_mg)) %&amp;gt;%
pivot_longer(cols = c(5:15),
names_to = &amp;quot;character&amp;quot;,
values_to = &amp;quot;n&amp;quot;)
starbucks_longer$size &amp;lt;- factor(starbucks_longer$size, levels = c(&amp;quot;short&amp;quot;, &amp;quot;tall&amp;quot;, &amp;quot;grande&amp;quot;, &amp;quot;venti&amp;quot;))
starbucks_longer %&amp;gt;%
filter(character %in% c(&amp;quot;calories&amp;quot;, &amp;quot;total_fat_g&amp;quot;, &amp;quot;sodium_g&amp;quot;, &amp;quot;cholesterol_g&amp;quot;, &amp;quot;fiber_g&amp;quot;, &amp;quot;sugar_g&amp;quot;, &amp;quot;caffeine_g&amp;quot;)) %&amp;gt;%
filter(size %in% c(&amp;quot;short&amp;quot;, &amp;quot;tall&amp;quot;, &amp;quot;venti&amp;quot;, &amp;quot;grande&amp;quot;)) %&amp;gt;%
ggplot()+
geom_density(mapping = aes(n, color = size))+
facet_wrap(~character, scales = &amp;quot;free&amp;quot;)&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://macarenaquiroga.com/en/post/starbucks-beverages-data-science/index_files/figure-html/unnamed-chunk-2-1.png" width="672" />&lt;/p>
&lt;p>In principle, it can be seen that, broadly speaking, the graphs follow a logical pattern: larger sizes have flatter distributions, while smaller sizes have more pronounced peaks. The caffeine graph particularly catches my attention: both “short” and “tall” sizes have very similar peaks (a lot of caffeine for a small size or little caffeine for a larger size?). It is possible, however, that not all types of beverages come in all four sizes, and that this introduces biases into the graphs. We will see this later.&lt;/p>
&lt;pre class="r">&lt;code>starbucks_longer %&amp;gt;%
filter(character %in% c(&amp;quot;calories&amp;quot;, &amp;quot;total_fat_g&amp;quot;, &amp;quot;sodium_g&amp;quot;, &amp;quot;cholesterol_g&amp;quot;, &amp;quot;fiber_g&amp;quot;, &amp;quot;sugar_g&amp;quot;, &amp;quot;caffeine_g&amp;quot;)) %&amp;gt;%
filter(size %in% c(&amp;quot;short&amp;quot;, &amp;quot;tall&amp;quot;, &amp;quot;venti&amp;quot;, &amp;quot;grande&amp;quot;)) %&amp;gt;%
ggplot()+
geom_boxplot(mapping = aes(x = size, y = n, fill = size))+
facet_wrap(~character, scales = &amp;quot;free&amp;quot;)&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://macarenaquiroga.com/en/post/starbucks-beverages-data-science/index_files/figure-html/unnamed-chunk-3-1.png" width="672" />&lt;/p>
&lt;p>The boxplot more clearly shows the difference between sizes, which nonetheless continues to show similarities in the first two sizes for both caffeine and fiber content. A good question for later is: are there statistically significant differences between the amount of caffeine in beverages of the first two sizes?&lt;/p>
&lt;p>Now, which beverages have the highest amount of each element? To compare, we will stick to the smallest versions (“short”), without considering other characteristics:&lt;/p>
&lt;pre class="r">&lt;code>library(gt)&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Warning: package &amp;#39;gt&amp;#39; was built under R version 4.4.1&lt;/code>&lt;/pre>
&lt;pre class="r">&lt;code>library(gtExtras)
starbucks %&amp;gt;%
filter(size == &amp;quot;short&amp;quot;) %&amp;gt;%
arrange(desc(caffeine_mg)) %&amp;gt;%
select(product_name, caffeine_mg) %&amp;gt;%
unique() %&amp;gt;%
slice_head(n =10) %&amp;gt;%
gt() %&amp;gt;%
tab_options(
column_labels.text_transform = &amp;quot;Product Name&amp;quot;
) %&amp;gt;%
tab_header(title = &amp;quot;Top 10 Products with the Most Caffeine&amp;quot;) %&amp;gt;%
gt_theme_nytimes() %&amp;gt;%
tab_source_note(source_note = &amp;quot;Source: Starbucks Coffee Company | Made by @_msquiroga for #TidyTuesday&amp;quot;) %&amp;gt;%
cols_align(&amp;quot;left&amp;quot;)&lt;/code>&lt;/pre>
&lt;div id="adkllvgfty" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
&lt;style>@import url("https://fonts.googleapis.com/css2?family=Source+Sans+Pro:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap");
@import url("https://fonts.googleapis.com/css2?family=Libre+Franklin:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap");
@import url("https://fonts.googleapis.com/css2?family=Source+Sans+Pro:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap");
#adkllvgfty table {
font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
#adkllvgfty thead, #adkllvgfty tbody, #adkllvgfty tfoot, #adkllvgfty tr, #adkllvgfty td, #adkllvgfty th {
border-style: none;
}
#adkllvgfty p {
margin: 0;
padding: 0;
}
#adkllvgfty .gt_table {
display: table;
border-collapse: collapse;
line-height: normal;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: none;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}
#adkllvgfty .gt_caption {
padding-top: 4px;
padding-bottom: 4px;
}
#adkllvgfty .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}
#adkllvgfty .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 3px;
padding-bottom: 5px;
padding-left: 5px;
padding-right: 5px;
border-top-color: #FFFFFF;
border-top-width: 0;
}
#adkllvgfty .gt_heading {
background-color: #FFFFFF;
text-align: left;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#adkllvgfty .gt_bottom_border {
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#adkllvgfty .gt_col_headings {
border-top-style: none;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: none;
border-bottom-width: 1px;
border-bottom-color: #334422;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}
#adkllvgfty .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 12px;
font-weight: normal;
text-transform: Product Name;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}
#adkllvgfty .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 12px;
font-weight: normal;
text-transform: Product Name;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}
#adkllvgfty .gt_column_spanner_outer:first-child {
padding-left: 0;
}
#adkllvgfty .gt_column_spanner_outer:last-child {
padding-right: 0;
}
#adkllvgfty .gt_column_spanner {
border-bottom-style: none;
border-bottom-width: 1px;
border-bottom-color: #334422;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}
#adkllvgfty .gt_spanner_row {
border-bottom-style: hidden;
}
#adkllvgfty .gt_group_heading {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
text-align: left;
}
#adkllvgfty .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}
#adkllvgfty .gt_from_md > :first-child {
margin-top: 0;
}
#adkllvgfty .gt_from_md > :last-child {
margin-bottom: 0;
}
#adkllvgfty .gt_row {
padding-top: 7px;
padding-bottom: 7px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}
#adkllvgfty .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 5px;
padding-right: 5px;
}
#adkllvgfty .gt_stub_row_group {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 5px;
padding-right: 5px;
vertical-align: top;
}
#adkllvgfty .gt_row_group_first td {
border-top-width: 2px;
}
#adkllvgfty .gt_row_group_first th {
border-top-width: 2px;
}
#adkllvgfty .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#adkllvgfty .gt_first_summary_row {
border-top-style: solid;
border-top-color: #D3D3D3;
}
#adkllvgfty .gt_first_summary_row.thick {
border-top-width: 2px;
}
#adkllvgfty .gt_last_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}
#adkllvgfty .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}
#adkllvgfty .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}
#adkllvgfty .gt_last_grand_summary_row_top {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: double;
border-bottom-width: 6px;
border-bottom-color: #D3D3D3;
}
#adkllvgfty .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}
#adkllvgfty .gt_table_body {
border-top-style: none;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #FFFFFF;
}
#adkllvgfty .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#adkllvgfty .gt_footnote {
margin: 0px;
font-size: 90%;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}
#adkllvgfty .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}
#adkllvgfty .gt_sourcenote {
font-size: 90%;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}
#adkllvgfty .gt_left {
text-align: left;
}
#adkllvgfty .gt_center {
text-align: center;
}
#adkllvgfty .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}
#adkllvgfty .gt_font_normal {
font-weight: normal;
}
#adkllvgfty .gt_font_bold {
font-weight: bold;
}
#adkllvgfty .gt_font_italic {
font-style: italic;
}
#adkllvgfty .gt_super {
font-size: 65%;
}
#adkllvgfty .gt_footnote_marks {
font-size: 75%;
vertical-align: 0.4em;
position: initial;
}
#adkllvgfty .gt_asterisk {
font-size: 100%;
vertical-align: 0;
}
#adkllvgfty .gt_indent_1 {
text-indent: 5px;
}
#adkllvgfty .gt_indent_2 {
text-indent: 10px;
}
#adkllvgfty .gt_indent_3 {
text-indent: 15px;
}
#adkllvgfty .gt_indent_4 {
text-indent: 20px;
}
#adkllvgfty .gt_indent_5 {
text-indent: 25px;
}
#adkllvgfty .katex-display {
display: inline-flex !important;
margin-bottom: 0.75em !important;
}
#adkllvgfty div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
height: 0px !important;
}
&lt;/style>
&lt;table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
&lt;thead>
&lt;tr class="gt_heading">
&lt;td colspan="2" class="gt_heading gt_title gt_font_normal gt_bottom_border" style="font-family: 'Libre Franklin'; font-weight: 800;">Top 10 Products with the Most Caffeine&lt;/td>
&lt;/tr>
&lt;tr class="gt_col_headings">
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="color: #A9A9A9; font-family: 'Source Sans Pro'; text-transform: uppercase;" scope="col" id="product_name">product_name&lt;/th>
&lt;th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" style="color: #A9A9A9; font-family: 'Source Sans Pro'; text-transform: uppercase;" scope="col" id="caffeine_mg">caffeine_mg&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody class="gt_table_body">
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Clover Brewed Coffee - Dark Roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">190&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">brewed coffee - True North Blend Blonde roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">180&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Clover Brewed Coffee - Medium Roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">170&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">brewed coffee - medium roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">155&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Clover Brewed Coffee - Light Roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">155&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Latte Macchiato&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">150&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">brewed coffee - dark roast&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">130&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Flat White&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">130&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Caffè Mocha&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">90&lt;/td>&lt;/tr>
&lt;tr>&lt;td headers="product_name" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">Caffè Misto&lt;/td>
&lt;td headers="caffeine_mg" class="gt_row gt_left" style="font-family: 'Source Sans Pro'; font-weight: 400;">75&lt;/td>&lt;/tr>
&lt;/tbody>
&lt;tfoot class="gt_sourcenotes">
&lt;tr>
&lt;td class="gt_sourcenote" colspan="2">Source: Starbucks Coffee Company | Made by @_msquiroga for #TidyTuesday&lt;/td>
&lt;/tr>
&lt;/tfoot>
&lt;/table>
&lt;/div>
&lt;p>So, friends: if you need a little help to start your day, you can stop by your nearest Starbucks and order a &lt;em>Clover Brewed Coffee - Dark Roast&lt;/em> (if it exists in your countries, of course: I don’t think it does in mine).&lt;/p>
&lt;/div>
&lt;div id="regression" class="section level1">
&lt;h1>Regression&lt;/h1>
&lt;p>I started this post by stating that I wanted to practice my statistical skills, but I’m no longer so sure that this dataset will give me many opportunities for major operations. For now, I’ll stick with this question: what aspect of beverage composition contributes significantly most to the total fat content? Of course, this is an answer that can be obtained from a nutritional point of view: the presence of milk and cream, along with the size of the beverage, should be the three most significant predictors. But let’s see if statistics supports us.&lt;/p>
&lt;pre class="r">&lt;code>starbucks &amp;lt;- starbucks %&amp;gt;%
filter(!size %in% c(&amp;quot;1 scoop&amp;quot;, &amp;quot;1 shot&amp;quot;))
fat_reg &amp;lt;- lm(total_fat_g ~ size + milk + whip + serv_size_m_l + calories + cholesterol_mg + sodium_mg + fiber_g + sugar_g + caffeine_mg, data = starbucks)
summary(fat_reg)&lt;/code>&lt;/pre>
&lt;pre>&lt;code>##
## Call:
## lm(formula = total_fat_g ~ size + milk + whip + serv_size_m_l +
## calories + cholesterol_mg + sodium_mg + fiber_g + sugar_g +
## caffeine_mg, data = starbucks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5403 -0.5246 0.0487 0.5103 7.6863
##
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) -0.2538835 0.4009213 -0.633 0.5267
## sizegrande -1.2414507 0.6557588 -1.893 0.0586 .
## sizequad -0.7248447 0.5507526 -1.316 0.1884
## sizeshort -0.2524751 0.4828904 -0.523 0.6012
## sizesolo 0.4126085 0.5473979 0.754 0.4511
## sizetall -0.5817991 0.5569494 -1.045 0.2964
## sizetrenta -1.5889205 1.0794471 -1.472 0.1413
## sizetriple -0.3613976 0.5473937 -0.660 0.5093
## sizeventi -1.8653208 0.8295612 -2.249 0.0247 *
## milk1 -2.5098877 0.1339820 -18.733 &amp;lt; 2e-16 ***
## milk2 -1.5945736 0.1510997 -10.553 &amp;lt; 2e-16 ***
## milk3 -0.9801896 0.1475690 -6.642 4.81e-11 ***
## milk4 0.9084118 0.1445475 6.285 4.70e-10 ***
## milk5 -0.6896911 0.1582705 -4.358 1.43e-05 ***
## whip1 1.8390692 0.1669839 11.013 &amp;lt; 2e-16 ***
## serv_size_m_l 0.0021254 0.0011338 1.875 0.0611 .
## calories 0.0688260 0.0013809 49.842 &amp;lt; 2e-16 ***
## cholesterol_mg 0.0416803 0.0064425 6.470 1.46e-10 ***
## sodium_mg 0.0047823 0.0007800 6.131 1.20e-09 ***
## fiber_g -0.5514295 0.0284809 -19.361 &amp;lt; 2e-16 ***
## sugar_g -0.2720957 0.0062719 -43.383 &amp;lt; 2e-16 ***
## caffeine_mg -0.0003979 0.0004756 -0.837 0.4029
## ---
## Signif. codes: 0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
##
## Residual standard error: 1.022 on 1122 degrees of freedom
## Multiple R-squared: 0.9712, Adjusted R-squared: 0.9707
## F-statistic: 1801 on 21 and 1122 DF, p-value: &amp;lt; 2.2e-16&lt;/code>&lt;/pre>
&lt;p>The first thing I always look at is that the model is significant: we can see this with the &lt;em>p-value&lt;/em> that appears at the end of the summary. Indeed, the model appears to be significant; in other words, the description of the situation we proposed is supported by the data. Furthermore, the &lt;code>$R^2$&lt;/code> is very high.&lt;/p>
&lt;p>Now, let’s look at the variables. Size only turned out to be a significant predictor for the largest size, “venti”. This strongly catches my attention, because I had predicted that as size increased, the total fat content would also increase, but it’s possible that when controlling for other variables, size no longer plays an important role (perhaps because the beverage composition changes? because everything always has a lot of fat?).&lt;/p>
&lt;p>The presence of milk and cream was, as I had predicted, a significant predictor in all cases. This is, of course, logical from a nutritional point of view. The same applies to the calorie count: as far as I understand, calories are a measure of energy produced by fats, proteins, and carbohydrates. Therefore, while it is logical that it turned out to be a significant predictor, it should not have appeared in the model from the perspective of the underlying theory. Cholesterol was also a significant predictor, which is logical because it is a type of fat.&lt;/p>
&lt;p>What strikes me is that sodium, fiber, and sugar were also statistically significant predictors. Again, my basic nutritional knowledge tells me that these elements should not contribute to total fat content, but it’s possible that the ingredients that provide these elements also contain high levels of fats. It is therefore crucial to thoroughly understand the elements we work with when proposing a statistical model.&lt;/p>
&lt;p>Finally, we can analyze the model’s residuals to get a clearer picture.&lt;/p>
&lt;pre class="r">&lt;code>library(performance)&lt;/code>&lt;/pre>
&lt;pre>&lt;code>## Warning: package &amp;#39;performance&amp;#39; was built under R version 4.4.1&lt;/code>&lt;/pre>
&lt;pre class="r">&lt;code>performance::check_model(fat_reg)&lt;/code>&lt;/pre>
&lt;p>&lt;img src="https://macarenaquiroga.com/en/post/starbucks-beverages-data-science/index_files/figure-html/unnamed-chunk-6-1.png" width="672" />&lt;/p>
&lt;p>The residual analysis shows several of the things mentioned earlier: for example, the high degree of collinearity among several ingredients. Linearity and homogeneity of variance clearly have very strange shapes, and I wonder if it has to do with the different sizes, which propose different ingredient compositions from the same formulas. I believe a more appropriate methodology would be to run different models for each size, to be able to compare between beverages (i.e., between recipes).&lt;/p>
&lt;div id="in-short-the-residual-analysis-shows-as-we-already-knew-that-although-the-model-summary-initially-yielded-seemingly-significant-results-the-residual-analysis-reveals-that-there-are-some-issues-that-are-overlooked-and-that-this-model-would-not-be-the-most-appropriate-for-this-scenario.-this-analysis-could-be-continued-as-i-mentioned-before-by-running-different-models-for-each-size-proposing-correlations-between-ingredients-to-detect-problematic-collinearities.-but-this-was-enough-for-today.-thanks-for-reading" class="section level2">
&lt;h2>In short, the residual analysis shows, as we already knew, that although the model summary initially yielded seemingly significant results, the residual analysis reveals that there are some issues that are overlooked and that this model would not be the most appropriate for this scenario. This analysis could be continued, as I mentioned before, by running different models for each size, proposing correlations between ingredients to detect problematic collinearities. But this was enough for today. Thanks for reading!&lt;/h2>
&lt;/div>
&lt;/div></description></item></channel></rss>