I use R for statistical analysis for the following reasons:
1) It is far more powerful and flexible in its input than Excel could ever hope to be.
2) It is much more intuitive than other packages I have used.
3) The learning curve of it compared to used numpy and pandas with python is much lower.
4) It has many packages with versatile uses available that are simply imported as compared to messing with $SYSPATH in python.
5) It is free and easy to install
6) There are many learning resources and tutorials for novices online.
All linear regressions are in the form of y = mx + b, where m provides an indicator of how a change in one variable affects the other, and b is an intercept that depending on the context, provides either a standardization or may hold other information. The strength of the linear relationship of the plotted line to the data points (the fit) is given in terms of an R2 value. The closer an R2 value is to 1, the better the linear fit of the line to the data. Other statistics that are useful include standard error (how far away most of the points are from the fitted line) and the t-value. The t-value in particular is used to test the probability that the relationship between our two variables is due to chance. If the R2 is near 1, the standard error is near zero and the p-value from our t-test shows a statistically significant relationship, our variables have a linear relationship. Additionally, the F-statistic is used to show the linear relationship between the two variables.
Here are some example data plots in a landmark study on risk factors in prostate cancer (Ref 1).. Figure 1 is showing the essentially random correlation between the variables of age and cancer volume. From our understanding of cancer progression, it makes sense that there is no real relationship between age and cancer volume if the onset of the disease is similar for the groups. Sometimes this understanding of the data going into a problem (referred to as domain knowledge) can almost be as useful as having sophisticated knowledge of machine learning and programming.
The summary statistics in R (shown below) confirm our intuition:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.75703 1.28285 -1.370 0.1755
Age 0.04742 0.01968 2.409 0.0188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.2 on 65 degrees of freedom
Multiple R-squared: 0.08198, Adjusted R-squared: 0.06786
F-statistic: 5.805 on 1 and 65 DF, p-value: 0.01883
![]() |
| Figure 1: Scatter plot showing no significant correlation between the variables age and cancer volume in a landmark prostate cancer study. All figures plotted with R |
Figure 2 is the main result of the study. It shows the correlation between cancer volume and level of PSA (prostate specific antigen) in the blood. Here are the summary statistics from a training data sample.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.53623 0.23688 -2.264 0.0269 *
ResLPSA 0.75427 0.08678 8.692 1.73e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8515 on 65 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5304
F-statistic: 75.55 on 1 and 65 DF, p-value: 1.733e-12
![]() |
| Figure 2: Plot showing a significant correlation between PSA levels and prostate cancer volumes |
Stamey, T.A., Kabalin, J.N.,
McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A. and Yang, N. (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of
the prostate: II. radical prostatectomy treated patients, Journal of Urology 141(5),
1076–1083.


No comments:
Post a Comment
Keep posts professional and do not use any profane language or images.