Computational Chemistry and Data Science: Linear Regressions in Data Science

Linear regressions are simple, yet powerful prediction techniques. They are easily implemented in Microsoft Excel, or even on a piece of graph paper. The basic idea is to take a section of data points and find a straight line that bisects (has a minimized distance between) all of the data points.

I use R for statistical analysis for the following reasons:
1) It is far more powerful and flexible in its input than Excel could ever hope to be.
2) It is much more intuitive than other packages I have used.
3) The learning curve of it compared to used numpy and pandas with python is much lower.
4) It has many packages with versatile uses available that are simply imported as compared to messing with $SYSPATH in python.
5) It is free and easy to install
6) There are many learning resources and tutorials for novices online.

All linear regressions are in the form of y = mx + b, where m provides an indicator of how a change in one variable affects the other, and b is an intercept that depending on the context, provides either a standardization or may hold other information. The strength of the linear relationship of the plotted line to the data points (the fit) is given in terms of an R² value. The closer an R² value is to 1, the better the linear fit of the line to the data. Other statistics that are useful include standard error (how far away most of the points are from the fitted line) and the t-value. The t-value in particular is used to test the probability that the relationship between our two variables is due to chance. If the R² is near 1, the standard error is near zero and the p-value from our t-test shows a statistically significant relationship, our variables have a linear relationship. Additionally, the F-statistic is used to show the linear relationship between the two variables.

Here are some example data plots in a landmark study on risk factors in prostate cancer (Ref 1).. Figure 1 is showing the essentially random correlation between the variables of age and cancer volume. From our understanding of cancer progression, it makes sense that there is no real relationship between age and cancer volume if the onset of the disease is similar for the groups. Sometimes this understanding of the data going into a problem (referred to as domain knowledge) can almost be as useful as having sophisticated knowledge of machine learning and programming.
The summary statistics in R (shown below) confirm our intuition:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.75703 1.28285 -1.370 0.1755
Age 0.04742 0.01968 2.409 0.0188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.2 on 65 degrees of freedom
Multiple R-squared: 0.08198, Adjusted R-squared: 0.06786
F-statistic: 5.805 on 1 and 65 DF, p-value: 0.01883

Figure 1: Scatter plot showing no significant correlation between the variables age and cancer volume in a landmark prostate cancer study. All figures plotted with R

Figure 2 is the main result of the study. It shows the correlation between cancer volume and level of PSA (prostate specific antigen) in the blood. Here are the summary statistics from a training data sample.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.53623 0.23688 -2.264 0.0269 *
ResLPSA 0.75427 0.08678 8.692 1.73e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8515 on 65 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5304
F-statistic: 75.55 on 1 and 65 DF, p-value: 1.733e-12

Figure 2: Plot showing a significant correlation between PSA levels and prostate cancer volumes

The statistically significant results with a p-value of approximately zero show that our results are not due to chance. In fact, the relationship between PSA which is available through a simple blood draw, and the size of a prostate tumor is a humongous breakthrough. This study shows that we can have a complementary diagnosis for prostate tumor presence without the dreaded digital prostate examination. While PSA is sufficient by itself because of false positives, it does provide a basic screening technique.

Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A. and Yang, N. (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. radical prostatectomy treated patients, Journal of Urology 141(5), 1076–1083.

Computational Chemistry and Data Science

Friday, February 1, 2019

Linear Regressions in Data Science

No comments:

Post a Comment

Why should you hire me?

Search This Blog