Outline

  • When to use Pearson Correlation and Gaussian Regression
  • Normal Distribution & Normality Test:
           rnorm(), car::qqPlot(), shapiro.test() 
  • Pearson Correlation:
           cor.test(x1, x2) 
  • Independent Two-sample T-test:
           t.test(y1, y2), t.test(y~x)  
  • Simple Linear Regression:
           summary(lm(y~x)), plot(lm(y~x)), lmtest::bptest(), car::ncvTest()
  • Multivariate Gaussian Regression:
           car::vif, kappa()  

When to use Pearson Correlation and Gaussian Regression

Goal:

  • to quantify linear association

  • between continuous outcome with independent variable(s)

Normal Distribution & Normality Test

set.seed(234)
hist(rnorm(1000), breaks=80, prob=TRUE, col="blue", main="Histogram of N(0,1), n=1000")
curve(dnorm(x, mean=0, sd=1), col="red", lwd=2, add=TRUE, yaxt="n")

Normal Distribution

Normal Distribution

Normal Distribution

Normal Distribution

Normality Test: QQ-plot with 95% CI

##library(car)
## 100 samples were simulated from standard normal distribution:
set.seed(234); car::qqPlot(rnorm(100))

Normality Test: Shapiro-Wilk test

shapiro.test(rnorm(100)) # Shapiro-Wilk test of normality
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100)
## W = 0.98084, p-value = 0.1544
## Conclusion: 

## Since p-value is around 0.15, there is not enough evidence to reject 
## the null hypothesis that this sample is drawn from a normal 
## distribution (at the significance level of 0.05). 

Pearson Correlation

  • Pearson Correlation Coefficient:

Pearson Correlation

  • Hypothesis Test of :

Pearson Correlation

## Example Data:
data("DIGdata", package="asympTest")
attach(DIGdata)
dim(DIGdata)
## [1] 6800   72
names(DIGdata)[c(1:5,24:25)]
## [1] "ID"    "TRTMT" "AGE"   "RACE"  "SEX"   "DIABP" "SYSBP"
length(unique(DIGdata$ID))
## [1] 6800

Pearson Correlation

## Example Data:
## scatter plot of diastolic BP vs. systolic BP
plot(DIABP, SYSBP)  

Pearson Correlation

cor.test(DIABP, SYSBP)
## 
##  Pearson's product-moment correlation
## 
## data:  DIABP and SYSBP
## t = 0.83224, df = 6790, p-value = 0.4053
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01368677  0.03387401
## sample estimates:
##        cor 
## 0.01009933

Pearson Correlation

## Example Conclusion:

## The Pearson correlation between DIABP and SYSBP is estimated
## to be 0.01 with 95% CI (-0.01, 0.03). 

## Since the p-value is around 0.41, there is not enough evidence, 
## at the significance level of 0.05, to conclude that DIABP and 
## SYSBP are significantly associated in linear form.

Pearson Correlation

cor.test(DIABP, SYSBP, "greater")
## 
##  Pearson's product-moment correlation
## 
## data:  DIABP and SYSBP
## t = 0.83224, df = 6790, p-value = 0.2027
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  -0.00986294  1.00000000
## sample estimates:
##        cor 
## 0.01009933

Independent Two-sample T-test (Welch's Version)

Independent Two-sample T-test

## Example Data: DIGdata
boxplot(SYSBP ~ SEX, xlab="Gender (1=male or 2=female)", 
        ylab="Systolic Blood Pressure", main="Boxplot of SYSBP by Gender")

Independent Two-sample T-test

  • Two-sided test in R:
t.test(SYSBP~SEX)
## 
##  Welch Two Sample t-test
## 
## data:  SYSBP by SEX
## t = 1.6755, df = 2506.5, p-value = 0.09395
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1634916  2.0833293
## sample estimates:
## mean in group 1 mean in group 2 
##        126.0106        125.0507

Independent Two-sample T-test

## Example Conclusion:

## The estimated sample mean of SYSBP in male and female groups are 
## 126 mmHg and 125 mmHg respectively. The 95% CI of the difference 
## is (-0.16, 2.08).

## Since the p-value is around 0.09, there is not enough evidence 
## to conclude that SYSBP differs significantly by gender (at the 
## significance level of 0.05). That is, average SYSBP seems to be 
## the same for both men and women. 

Simple Linear Regression

plot(x, y); abline(lm(y~x), col="red"); ## lines(lowess(x, y))  

Simple Linear Regression Model

Regression Diagnostic Plot

 plot(lm(y~x)) 

Regression Diagnostic Test:

  • Tests for checking constant variance assumption:
          lmtest::bptest(lm(y ~ x)) ## Breush Pagan Test 
          car::ncvTest(lm(y ~ x)) ## Non-constant Variance Score Test 
    (small p-value means violation of the assumption)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm(y ~ x)
## BP = 2.7024, df = 1, p-value = 0.1002
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.798986    Df = 1     p = 0.09432395

Simple Linear Regression

  • Simulation from a simple linear regression model:
    y = 3 + 5*x + error
set.seed(1234)
x <- runif(1000)  ## random numbers from uniform distribution
y <- 3+5*x+rnorm(1000)
summary(lm(y~x))  ## model parameter estimation results
par(mfrow=c(2,2)); plot(lm(y~x));  ## check model assumptions

Simple Linear Regression

  • Fitting summary:
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91933 -0.62956  0.01084  0.63819  2.73178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.04449    0.06028   50.51   <2e-16 ***
## x            4.88928    0.10306   47.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9486 on 998 degrees of freedom
## Multiple R-squared:  0.6928, Adjusted R-squared:  0.6925 
## F-statistic:  2251 on 1 and 998 DF,  p-value: < 2.2e-16

Simple Linear Regression

Simple Linear Regression

  • Uniform regressor:

Function of Diagnostic Plots:

  • Simulation from a mulivariate model, but fit a simple linear model:
set.seed(1234)
x <- runif(1000)
y <- 3+5*x+100*x^2+rnorm(1000)
summary(lm(y~x))
par(mfrow=c(2,2)); plot(lm(y~x));

Function of Diagnostic Plots:

  • Fitting summary:
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.037  -6.767  -2.016   6.182  19.554 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.1642     0.4897  -28.92   <2e-16 ***
## x           106.2410     0.8374  126.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.707 on 998 degrees of freedom
## Multiple R-squared:  0.9416, Adjusted R-squared:  0.9416 
## F-statistic: 1.61e+04 on 1 and 998 DF,  p-value: < 2.2e-16

Function of Diagnostic Plots:

Simple Linear model: estimate group mean

## Apply Simple Linear Regression to compare SYSBP mean by gender:
summary(lm(SYSBP~as.factor(SEX)))
## 
## Call:
## lm(formula = SYSBP ~ as.factor(SEX))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.051 -16.011  -3.011  13.989  94.949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     126.0106     0.2744 459.242   <2e-16 ***
## as.factor(SEX)2  -0.9599     0.5804  -1.654   0.0982 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.93 on 6795 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.0004024,  Adjusted R-squared:  0.0002553 
## F-statistic: 2.735 on 1 and 6795 DF,  p-value: 0.09821

Simple Linear model: estimate group mean

## Regression Conclusion:

## The estimated sample mean of SYSBP in male and female groups are 
## 126.0106 mmHg and (126.0106-0.9599=125.0507) mmHg respectively. 

## Since the p-value is around 0.0982, there is not enough evidence 
## to conclude that SYSBP differs significantly by gender (at the 
## significance level of 0.05). That is, average SYSBP seems to be 
## the same for both men and women. 

Multivariate Regression