R与数据分析(一) 回归模型

本文最后更新于:2023年2月28日 下午

1. 模型的建立

1.1 线性模型(the linear model)

原理

回归方程 (regression equation), the expected value of yy given xx

E(Yixi)=β0+β1xiE(Y_i|x_i)=\beta_0+\beta_1x_i

估计的回归方程 (estimated regression equation), the estimated value of E(Yixi)E(Y_i|x_i):

yi^=β0^+β1^xi\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i

回归模型 (regression model)

the exact value of yy:

y=E(Yx)+ϵ=β0+β1x+ϵy=E(Y|x)+\epsilon=\beta_0+\beta_1x+\epsilon

the estimated value of yy:

yi=yi^+ri=(β0^+β1^xi)+riy_i=\hat{y_i}+r_i=(\hat{\beta_0}+\hat{\beta_1}x_i)+r_i

ϵ\epsilon (error) : 误差项的随机变量,反映除xx 变化引起的 yy 的线性变化之外随机因素对 yy 的影响,满足 ϵiidN(0,σ2)\epsilon \sim^{iid} N(0,\sigma^2)

  • ii : independent —— 样本相互独立 (every observation to be independent of every other observation)
  • ii : identically distributed —— 方差齐性 (the equality of variance assumption, (EOV))
  • dd : distribute_d normally —— 误差项正态分布 (ϵi\epsilon_i must be from a normal distribution, at least approximately)

最小二乘法(least squares)

Residual rir_i (残差):

ri=yiyi^r_i=y_i-\hat{y_i}

1
resid(examtest.fit)[20] # 得到指定行的残差

Residual Sum of Suqares RSSRSS (残差平方和)

RSS=i=1nri2=i=1n(yiyi^)2RSS=\sum_{i=1}^n{r_i}^2=\sum_{i=1}^n(y_i-\hat{y_i})^2

1
sum(resid(examtest.fit)^2) # RSS

Residual Standard Error ss (残差标准差)

s=RSSdf(RSS)s=\sqrt{\frac{RSS}{df(RSS)}}

实现

1
2
examtest.fit = lm(Exam~1+Test, data=Stats20x.df)
examtest.fit = lm(Exam~Test, data=Stats20x.df) # explicit
1
summary(examtest.fit)

1
coef(examtest.fit)

1
confint(examtest.fit) # 默认置信水平0.05的置信区间 

image-20221022155420260

1.3 零模型(the null model)

原理

假设自变量与因变量之间没有关系的模型为零模型。

H0H_0 : TestExam 间没有关系

β1=0y=β0+ϵϵiidN(0,σ2)\beta_1=0 \\y=\beta_0+\epsilon \\ \epsilon \sim^{iid} N(0,\sigma^2)

零模型残差:

ri=yiyi^=yiβ0^=yiyir_i=y_i-\hat{y_i}=y_i-\hat{\beta_0}=y_i-\overline{y_i}

Total Sum of Squares TSSTSS (零模型残差平方和)

TSS=i=1nri2=n(yiyi)2TSS=\sum_{i=1}^n{r_i}^2=^n(y_i-\overline{y_i})^2

Multiple R-squared R2R^2

R2=TSSRSSTSS=1RSSTSSR^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS}

We can say that $ R^2 \times 100% $ of the variation in the exammark is explained by using a straight line relationship (i.e., simple linear

model) with test score.

R21,yi^yiR^2 \rightarrow 1, \hat{y_i} \rightarrow y_i 说明拟合效果越好

Explained Sum of Squares ESSESS (回归平方和)

ESS=(yi^yi)2TSS=ESS+RSSdf(TSS)=df(ESS)+df(RSS)ESS=\sum(\hat{y_i}-\overline{y_i})^2 \\ TSS=ESS+RSS \\ df(TSS)=df(ESS)+df(RSS)

Adjusted R-squared 调整R2R^2

AdjR2=1RSSdf(RSS)TSSdf(TSS)Adj-R^2=1-\frac{\frac{RSS}{df(RSS)}}{\frac{TSS}{df(TSS)}}

实现

1
2
3
null.fit = lm(Exam~1, data=Stats20x.df)
summary(null.fit)
confint(null.fit) # 默认置信水平0.05

image-20221022121334265

confint()

1.4 t检验(t-test)

单样本t检验(one-sample t-test)

单样本 tt 检验用于比较样本均值与一个特定数值之间的差异情况。前提条件为样本数据的因变量满足正态分布。

tt 检验的计算公式

TT 值可由样本均值 $ \overline y$、标准差 ss、样本个数 nn 得到,满足自由度为 n1n-1tt 分布:

T值

tt 临界值 (t-multiplier) 取决于自由度和置信水平。置信水平为 0.050.05 (双尾) 时有:

t临界值

置信区间可以表示为:

置信区间

1
2
3
4
5
6
7
8
9
10
11
12
# 手动实现
y=Stats20x.df$Exam
n=length(y)
mu=60
df=n-1
CI=0.05

t.stat=(mean(y)-60)/(sd(y)/sqrt(n))
t.mult = qt(1-CI/2, df=n-1)
confint.lwr= mean(y) - t.mult*sd(y)/sqrt(n)
confint.upr= mean(y) + t.mult*sd(y)/sqrt(n)
confint = mean(y) + c(-1,1)*t.mult*sd(y)/sqrt(n)
  1. β0=μ=0\beta_0=\mu=0

    1. lm(Exam~1) 零模型

      summary(null.fit) 得到 H0:β0=μ=0H_0:\beta_0=\mu=0PP值,PP值越小越拒绝 H0H_0

      confint(null.fit) 得到 β0=μ\beta_0=\mu 置信水平为 0.050.05的置信区间

    2. t.test(Stats20x.df$Exam,mu=0)

  2. β0=μ=yi\beta_0=\mu=\overline{y_i}

    1. t.test(Stats20x.df$Exam,mu=60)

      t.test()

    2. nullmu.fit=lm(I(Exam-60)~1,data=Stats20x.df)

      lm(I()~1)

配对t检验(the paired t-test)

配对 tt 检验用于分析配对定量数据之间的差异对比关系。前提条件为配对数据的差值满足正态分布。配对数据是指相互连接或关联 (不相互独立),样本量相同的数据。

The paired t-test is just an application of the one-sample t-test applied to differences.

以本例中的 ExamTest 为例,检验 if the midterm test marks and exam marks have the same expected value.

1
2
3
4
5
6
7
8
9
10
11
12
13
# 量纲标准化(the same scale)
Stats20x.df$Test2 = 5 * Stats20x.df$Test
# 计算差值Test-Exam(视为相互独立)
Stats20x.df$Diff = Stats20x.df$Test2 - Stats20x.df$Exam
# 差值Diff需满足正态分布
hist(Stats20x.df$Diff)
# H_0:μ_diff=0
# lm实现
diff.fit1 = lm(Diff~1, data=Stats20x.df)
coef(summary(diff.fit))
confint(diff.fit)
# t检测实现
t.test(Stats20x.df$Diff)

由图可知

  • P-value 7×106\approx 7 \times 10^{-6}
  • this difference to be about 2.9 to 7.1 marks

1.5 广义线性模型(generalized linear mode)

二次模型(quadratic model)

The quadratic model:

E(Yixi)=β0+β1xi+β2xi2E(Y_i|x_i)=\beta_0+\beta_1x_i+\beta_2x_i^2

ExamAssign 为例

1
2
3
trendscatter(Exam ~ Assign, data = Stats20x.df) 
examassign.fit=lm(Exam~ Assign,data = Stats20x.df)
plot(examassign.fit,which=1)

trendscatter

residual-fitted

从两图中能得到结论:Sure looks like some curvature, but, at least the scatter looks fairly constant around this curve.

1
2
3
# add a square term
examassign.fit2=lm(Exam~ Assign + I(Assign^2), data = Stats20x.df)
summary(examassign.fit2)

二分类模型 (2-level categorical variable model)

哑变量 (Dummy Variable)

在构建回归模型时,

  • 如果自变量 xx为连续性变量,回归系数 β1\beta_1 (slope) 可以解释为:在其他自变量不变的条件下,xx 每改变一个单位,所引起的因变量yy的平均变化量;
  • 如果自变量xx为二分类变量,例如是否饮酒(1=是,0=否),则回归系数 β1\beta_1 (slope)可以解释为:其他自变量不变的条件下,x=1x=1(饮酒者)与 x=0x=0(不饮酒者)相比,所引起的因变量 yy 的平均变化量。

当自变量 xx多分类变量时,引入哑变量(Dummy Variable),又称为虚拟变量将各个多分类变量均转化为0-1变量。

ExamAttend 为例

The 2-level categorical variable model:

Exam=β0+β1AttendExam=β0,Attend=No(0)Exam=β0+β1,Attend=Yes(1)Exam = \beta_0 + \beta_1 \cdot Attend \\ Exam = \beta_0, Attend = No(0) \\ Exam = \beta_0 + \beta_1, Attend = Yes(1)

  • β0\beta_0 is the mean exam mark for non-attenders
  • β0+β1\beta_0 + \beta_1 is the mean mark for attenders

so β1 is the difference in mean mark for attenders compared to non-attenders.

To test the null hypothesis of no attendance effect, H0:β1=0H_0:\beta_1=0.

1
2
3
4
5
# box plot
boxplot(Exam ~ Attend, data = Stats20x.df)

examattend.fit = lm(Exam ~ Attend, data = Stats20x.df)
summary(examattend.fit)

β0、β1相关情况

p=1.27×106p=1.27 \times 10^{-6} , We have extremely strong evidence that attendance is related to exam score.

1
confint(examattend.fit)

Regular attenders will obtain an increased exam mark of between 9.5 to 21.6 compared to non-attenders.

The expected exam mark of a student who regularly attends class is 9.5 to 21.6 marks higher than that of a non-attendee.

乘法模型(multiplicative model)

应用乘法模型对数据拟合的三个原则

  1. 数据右偏分布

    right skewed, median < mean

  2. 数据指数增长(exponential growth)或指数下降(exponentialdecay)

  3. 常识

    This type of right-skew distribution is very common when it comes to things involving money , resources, growth, salary, age, advantage

    and energy, to name but a few.

  4. 残差图 (residuals vs fitted values) 呈漏斗状

House PriceAge 为例

The multiplicative model:

log(Price)=β0+β1AgePrice=eβ0×(eβ1)Agelog(Price) = \beta_0 + \beta_1 \cdot Age \\ Price=e^{\beta_0} \times(e^{\beta_1})^{Age}

  1. 由于数据经过对数变换后 mean 改变,median 不变。所以 β0\beta_0 视为对 median 的描述。

  2. depreciation (增长率)

    点估计增长率:β1^\hat{\beta_1} 取得点估计值

    1. β1^>0\hat{\beta_1}>0xx 每增加一个单位,yy 的 median 增加了 (exp(β1)1)×100%(exp(\beta_1)-1) \times 100\%
    2. β1^<0\hat{\beta_1}<0xx 每增加一个单位,yy 的 median 减少了 (1exp(β1))×100%(1-exp(\beta_1)) \times 100\%

    区间估计增长率:β1^\hat{\beta_1} 取得区间估计值 [L,R][L,R]

    1. β1^>0\hat{\beta_1}>0xx 每增加一个单位,yy 的 median 增加了 [exp(L)1,exp(R)1]]×100%[exp(L)-1, exp(R)-1]] \times 100\%
    2. β1^<0\hat{\beta_1}<0xx 每增加一个单位,yy 的 median 减少了 [1exp(L),1exp(R)]×100%[1-exp(L),1-exp(R)] \times 100\%
1
2
3
# add a log term
LogPriceAge.fit=lm(log(price)~age, data=Mazda.df)
summary(LogPriceAge.fit)

β0、β1相关情况

1
2
# expotential confint
exp(confint(LogPriceAge.fit))

eβ0、eβ1相关情况

1
2
# depreciation
100*(exp(confint(PriceAge.fit2)) - 1)

分类变量的乘法模型 (Multiplicative model with categorical explanatory variable)

BycatchTrawl 为例

1
boxplot(Bycatch~Trawl,data=Bycatch.df, horizontal = TRUE)

1
2
3
# 2-level categorical variable model
Trawl.lm=lm(Bycatch~Trawl,data=Bycatch.df)
plot(Trawl.lm,which=1)

残差图拟合不佳

右偏分布

添加 loglog

1
2
boxplot(log(Bycatch)~Trawl,data=Bycatch.df, horizontal = TRUE)
Trawl.lmlog=lm(log(Bycatch)~Trawl,data=Bycatch.df)

幂律模型 (power law model)

1.6 多元线性回归模型 ()

1.7 模型检验(checks of model assumptions)

EOV check

检测线性模型是否合适。

1
2
# resuidual-fitted sctter
plot(examtest.fit, which=1)

residual-fitted sctter

绘制 resuidual values - fitted values 散点图

由于 The residuals will be randomly scattered around 0,r-f scatter中红线期望为$ y=0$ 的水平直线。

如果红线呈曲线,则需要考虑广义线性模型中的变式。

normality check

检测样本是否满足正态分布。

1
2
# library(s20x)
normcheck(examtest.fit)

normcheck-qq图/直方图

直方图

根据直方图及去实现可以判断样本是否满足正态分布以及是否 左偏 (lef skewed)、右偏 (right skewed) 等。

qq图

Q-Q图是一种散点图,横坐标为某一样本的分位数,纵坐标为另一样本的分位数。通过Q-Q图可以判断样本与模型是否大致符合以及是否存在离群点。

outlier check

检测离群点。

1
2
# library(s20x), 可画出每个点的Cook距离
cooks20x(examtest.fit)

cooks20x

Cook距离越大,此点对模型的影响越大,异常的可能性越大。an observation is deemed to be influential if:

Removal of the observation changes any estimated parameter value by more than one standard error, or

Its Cook’s distance is greater than 0.4

1.8 模型预测与评估(estimation of fitted values)

模型预测

1
2
3
4
5
6
7
8
9
10
preds.df=data.frame(Test=c(0,10,20)) # 预测 Test=0,10,20 时 Exam 的值

# 1. 点估计
predict(examtest.fit, preds.df, interval="confidence")
# 2 区间估计--置信区间(CI)
# 对均值的区间估计
predict(examtest.fit, preds.df, interval="confidence")
# 3 区间预测--预测区间(PI)
# 对随机变量的区间预测
predict(examtest.fit, preds.df, interval="prediction")

点估计

置信区间

预测区间

1

2. 模型的实现流程

graph LR;
	Question_of_interest/goal_of_the_study ---> Read_in_and_inspect_the_data ---> Fit_model ---> Model_check ---> Method_and_Assumption_Checks ---> Executive_Summary

Question_of_interest/goal_of_the_study

we are interested in a numerical response variable y (e.g.: Exam) and its relationship with a possible explanatory variable x (e.g.:Test).

Read_in_and_inspect_the_data

  1. 数据的读入

    1
    2
    3
    4
    5
    6
    # s20x库中有部分封装好的函数
    library(s20x)

    # 1. 数据读入
    # header=T 首行为字段名
    Stats20x.df = read.table("Data/STATS20x.txt", header=T)

    数据格式

    1. 数据的散点图描述

      根据散点图描述关系种类,提出合适的模型

      1
      2
      3
      4
      5
      6
      7
      8
      # 2. 数据描述
      # 2.1 基本统计量
      # 维度
      dim(Stats20x.df)
      # 统计量合集
      # Min,Max,Mean
      # 1st Qu<-lower quartile,Med<-median,3rd Qu<-upper quartile,
      summary(Stats20x.df$Exam)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      # 2.2 基本图像
      # 散点图
      plot(Exam~Test,data=Stats20x.df)
      # library(s20x),平滑拟合函数趋势
      # blue: locally weighted scatterplot smoothing smoother to the data(局部加权散点图平滑器)
      # red: estimates of the variation about that line(添加固定数值的浮动变量)
      trendscatter(Exam~Test, data = Stats20x.df)
      # 直方图
      hist(Stats20x.df$Exam,breaks=20,main="")

    plot-散点图

    trendscatter-平滑曲线

案例

  1. 简单线性模型

  2. 二次模型

  3. 对数模型

Fit_model

  1. 使用lm 函数实现模型

  2. 模型检验

    • EOV check

      1
      plot(examtest.fit, which = 1),
    • Norm check

      1
      2
      # library(s20x)
      normcheck(examtest.fit)
    • Outlier check

      1
      2
      # library(s20x), 可画出每个点的Cook距离
      cooks20x(examtest.fit)

Method_and_Assumption_Checks

  1. 关联关系与初步建模

    案例 表述
    Assignment1 - Q1 Since we have a linear relationship in the data, we have fitted a simple linear regression model to our data.
    Assignment1 - Q2 We have applied a one sample t-test to it, equivalent to an intercept only linear model (null model).
    Assignment1 - Q3
    第6章案例 The scatter plot of age vs price showed clear nonlinearity and an increase in variability with price.
  2. 模型调整

    根据 Fit_model 中的代码及生成图像描述每个模型、该模型模型检验情况、模型调整方向。

    案例 表述
    Assignment1 - Q1 The initial residuals and Cooks plot showed two distinct outliers USA and China). So we **limited **our countries and therefore could be following a totally different pattern so we limited our patternless scatter with fairly constant variability - so no problems.
    The normaility checks don’t show any major problems (slightly long tails, if anything) and the Cook’s plot doesn’t reveal any further unduly influential points. Overall, all the model assumptions are satisfied.
    Assignment1 - Q2
    Assignment1 - Q3 After fitting the quadratic, the residuals were fine, there were no problems with normality and no unduly influential points. We have independence from taking a random sample.
    第6章案例 Residuals from a simple linear model showed failed the equality of variance and no-trend assumptions, and so the prices were log transformed. A simple linear model fifitted to logged price satisfified all assumptions.
  3. 最终模型方程式

    Our model is: <方程式>

    案例 模型 表述
    Assignment1 - Q1 简单线性模型
    Assignment1 - Q2 零模型
    Assignment1 - Q3 二次模型
    第6章案例 对数模型
  4. 模型评价(以 R2R^2 为指标)

    Our model explains R2×100%R^2 \times 100\% of the total variation in the response variable, and so will be reasonable for prediction.

Executive_Summary

  1. 研究问题

    Question_of_interest/goal_of_the_study

  2. 研究结论及依据

    案例 模型 表述
    Assignment1 - Q1 简单线性模型 We restricted our analysis to countries with GDP less than 6,000 billion dollars.
    Assignment1 - Q2 零模型 We estimate that the mean life expectancy is somewhere between 67.3 and 72…3 years. This is consistent with the mean life expectancy being 68 years old. [Or we do not have any evidence to suggest that the mean life expectancy is different from 68 years.)

    Assignment1 - Q3 二次模型
    第6章案例 对数模型 There was clear evidence the price of the cars was exponentially decreasing as the cars got older (P-value 0)
  3. 结论定量说

    案例 模型 表述
    Assignment1 - Q1 简单线性模型 We have strong evidence suggesting that electricity consumption is positively related to the GDP.
    Assignment1 - Q2 零模型 \
    Assignment1 - Q3 二次模型
    第6章案例 对数模型 We estimate that the median price for new Mazda cars (in 1991) was between A$23,600 to A$30,400 (to the nearest A$100).
  4. 预测值定量说明

    案例 模型 表述
    Assignment1 - Q1 简单线性模型 For each additional 100 billion dollars increase in GDP, the average electricity consumption increased by somewhere between 17 and 21 billion kilowatt- hours.
    Assignment1 - Q2 零模型 \
    Assignment1 - Q3 二次模型
    第6章案例 对数模型 We estimate that each additional year in age results in depreciation of between 13.9% to 16.3%.

参考资料