Load the data set into R environment. This data set is collect by Walmart for researching variables that might influence the CPI score (Consumer price index).
[DOWNLOAD HERE]
data =read.csv('../data/walmart.csv')head(data)
A data.frame: 6 × 9
Store
year
month
day
Temperature
Fuel_Price
CPI
Unemployment
IsHoliday
<int>
<int>
<int>
<int>
<dbl>
<dbl>
<dbl>
<dbl>
<lgl>
1
1
2010
2
5
42.31
2.572
211.0964
8.106
FALSE
2
1
2010
2
12
38.51
2.548
211.2422
8.106
TRUE
3
1
2010
2
19
39.93
2.514
211.2891
8.106
FALSE
4
1
2010
2
26
46.63
2.561
211.3196
8.106
FALSE
5
1
2010
3
5
46.50
2.625
211.3501
8.106
FALSE
6
1
2010
3
12
57.79
2.667
211.3806
8.106
FALSE
Simple plots
Making some simple plots is a smart way of knowing our data set in the first stage. Try to make all following plots. What conclusion can you make by observing?
A side-by-side boxplot
plot(x =factor(data$year), y = data$CPI, xlab ="Year", ylab ="CPI")
Welch Two Sample t-test
data: cpi2010 and cpi2012
t = -6.4642, df = 4497, p-value = 1.127e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.924313 -5.305366
sample estimates:
mean of x mean of y
168.1018 175.7166
Test if the fuel price in 2010 and 2012 are the same. Then, assume that the population variance is the same.
Two Sample t-test
data: fuel2010 and fuel2012
t = -119.71, df = 4498, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8622102 -0.8344239
sample estimates:
mean of x mean of y
2.823767 3.672084
Correlation coefficient
Calculate the pairwise correlation between year, temperature, and fuel price. Present it in both the plot (scatter plot) and correlation matrix.
cor(data[, c(2, 5, 6)])
A matrix: 3 × 3 of type dbl
year
Temperature
Fuel_Price
year
1.00000000
-0.03837331
0.6577771
Temperature
-0.03837331
1.00000000
0.1013542
Fuel_Price
0.65777712
0.10135422
1.0000000
plot(data[, c(2, 5, 6)])
Advanced
Try this section after you finish all previous sections
Fancy plots
Paint scatter plot by years.
pairs(data[, c(2, 5, 6)], pch =20, col =rainbow(4)[factor(data$year)])
Add density plot, correlation coefficient score, and confidence interval. (HINT: ??pairs.panels)
This section will try to fit a linear model with independent variables having a discrete data type.
Fit store, year, temperature, fuel price, and unemployment rate into the model. Note that store and year should be considered category data in this case.
This will return a long table. When we consider category information as a dependent variable, using a dummy variable is how we calculate.
What are the reference points for stores and years?
fit =lm(Fuel_Price ~factor(Store) +factor(year), data)summary(fit)
Fit a two-way ANOVA model: CPI = Store + year + e. - Do your degrees of freedom make sense? (If it is 1, you may forget to convert your data type as factor) - What conclusion can you make from this result?
fit =aov(Fuel_Price ~factor(Store) +factor(year), data)summary(fit)