As a computational chemist seeking to broaden my appeal to a world with many employers not impressed with the solvent-effect free vacuum at 0K in which I had played for so long, I have begun to study the world of business and finance around me. When I ask for Christmas or birthday presents, they usually go to gift cards that purchase textbooks in these fields (Horribly nerdy, but that's how I roll). Quantitative finance has two fundamental axioms (time-value of money) and (zero-arbitrage principle) from which the entire field is derived. It is very reminiscent of the first two laws of thermodynamics, and while mulling over their similarities, I thought about how these apply to my search for a full-time career.
1) You can't borrow your way out of debt. The time-value of money principle states that a dollar today is worth more than a dollar tomorrow. Just because you decide to take out loans to go back to school, or increase your credit card debt to pay for online classes does not guarantee that you will complete the program (assuming that the program is legitimate). More importantly than that, it DOES NOT guarantee employment. Going into further debt to obtain more education that *maybe you will complete so that you can *maybe earn more money at some future point and *maybe maintain that employment for long enough to get out of debt completely is tenuous at best.
2) Employers prefer experience to education, everything else being equal. Work experience with trustworthy references and a proven history of excelling at what you do means much more to most employers than having a doctorate or graduating from a top-tier school. Education implies that you have the potential to be successful at a position. Experience already shows you at an acceptable level. Hiring someone who is educated but untested is a huge risk to an employer. There is something to be said for graduating with a doctorate in a STEM field, or having quality publications. These imply that you know how to work hard, you are not one-dimensional and you have the potential to go above the bare minimum stipulated in job description and provide meaningful contributions. However, it is okay that it may take hundreds of applications and several interviews, because in the end they are taking a huge chance on you.
3) Work is the opposite of play. Meaning or enjoyment for life do not come out of a job title. The major part of a being a scientist who looks at data is staring at numbers on a computer all day. Worse than that, you have to hunt all over for those numbers and they are never in the right format. Computers are not as smart as you would believe. They only understand what you give them and are notorious sticklers for detail. While you are complaining of how your office chair hurts your back and how they really need to fix the AC in your area, there are people getting heat exhaustion from doing landscaping work all day who would love to have your problems.
The zero arbitrage principle is the assumption that in financial modeling, there are no unfair advantages between buyers and sellers on the large scale. It is another way of saying that if you want to make the grass greener, you had better be willing to put in some work. Because nothing comes for free however, we all have the ability to change our situations through effort. Keep going and always remember that everything looks better to us when we are going through hard times in our current careers.
This blog is an introduction to how abstract concepts in mathematics and science can help understand the disorder around us. I try to emphasize how this knowledge is both useful and profitable.
Sunday, February 17, 2019
Friday, February 1, 2019
Linear Regressions in Data Science
Linear regressions are simple, yet powerful prediction techniques. They are easily implemented in Microsoft Excel, or even on a piece of graph paper. The basic idea is to take a section of data points and find a straight line that bisects (has a minimized distance between) all of the data points.
I use R for statistical analysis for the following reasons:
1) It is far more powerful and flexible in its input than Excel could ever hope to be.
2) It is much more intuitive than other packages I have used.
3) The learning curve of it compared to used numpy and pandas with python is much lower.
4) It has many packages with versatile uses available that are simply imported as compared to messing with $SYSPATH in python.
5) It is free and easy to install
6) There are many learning resources and tutorials for novices online.
All linear regressions are in the form of y = mx + b, where m provides an indicator of how a change in one variable affects the other, and b is an intercept that depending on the context, provides either a standardization or may hold other information. The strength of the linear relationship of the plotted line to the data points (the fit) is given in terms of an R2 value. The closer an R2 value is to 1, the better the linear fit of the line to the data. Other statistics that are useful include standard error (how far away most of the points are from the fitted line) and the t-value. The t-value in particular is used to test the probability that the relationship between our two variables is due to chance. If the R2 is near 1, the standard error is near zero and the p-value from our t-test shows a statistically significant relationship, our variables have a linear relationship. Additionally, the F-statistic is used to show the linear relationship between the two variables.
Here are some example data plots in a landmark study on risk factors in prostate cancer (Ref 1).. Figure 1 is showing the essentially random correlation between the variables of age and cancer volume. From our understanding of cancer progression, it makes sense that there is no real relationship between age and cancer volume if the onset of the disease is similar for the groups. Sometimes this understanding of the data going into a problem (referred to as domain knowledge) can almost be as useful as having sophisticated knowledge of machine learning and programming.
The summary statistics in R (shown below) confirm our intuition:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.75703 1.28285 -1.370 0.1755
Age 0.04742 0.01968 2.409 0.0188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.2 on 65 degrees of freedom
Multiple R-squared: 0.08198, Adjusted R-squared: 0.06786
F-statistic: 5.805 on 1 and 65 DF, p-value: 0.01883
Figure 2 is the main result of the study. It shows the correlation between cancer volume and level of PSA (prostate specific antigen) in the blood. Here are the summary statistics from a training data sample.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.53623 0.23688 -2.264 0.0269 *
ResLPSA 0.75427 0.08678 8.692 1.73e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8515 on 65 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5304
F-statistic: 75.55 on 1 and 65 DF, p-value: 1.733e-12
The statistically significant results with a p-value of approximately zero show that our results are not due to chance. In fact, the relationship between PSA which is available through a simple blood draw, and the size of a prostate tumor is a humongous breakthrough. This study shows that we can have a complementary diagnosis for prostate tumor presence without the dreaded digital prostate examination. While PSA is sufficient by itself because of false positives, it does provide a basic screening technique.
I use R for statistical analysis for the following reasons:
1) It is far more powerful and flexible in its input than Excel could ever hope to be.
2) It is much more intuitive than other packages I have used.
3) The learning curve of it compared to used numpy and pandas with python is much lower.
4) It has many packages with versatile uses available that are simply imported as compared to messing with $SYSPATH in python.
5) It is free and easy to install
6) There are many learning resources and tutorials for novices online.
All linear regressions are in the form of y = mx + b, where m provides an indicator of how a change in one variable affects the other, and b is an intercept that depending on the context, provides either a standardization or may hold other information. The strength of the linear relationship of the plotted line to the data points (the fit) is given in terms of an R2 value. The closer an R2 value is to 1, the better the linear fit of the line to the data. Other statistics that are useful include standard error (how far away most of the points are from the fitted line) and the t-value. The t-value in particular is used to test the probability that the relationship between our two variables is due to chance. If the R2 is near 1, the standard error is near zero and the p-value from our t-test shows a statistically significant relationship, our variables have a linear relationship. Additionally, the F-statistic is used to show the linear relationship between the two variables.
Here are some example data plots in a landmark study on risk factors in prostate cancer (Ref 1).. Figure 1 is showing the essentially random correlation between the variables of age and cancer volume. From our understanding of cancer progression, it makes sense that there is no real relationship between age and cancer volume if the onset of the disease is similar for the groups. Sometimes this understanding of the data going into a problem (referred to as domain knowledge) can almost be as useful as having sophisticated knowledge of machine learning and programming.
The summary statistics in R (shown below) confirm our intuition:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.75703 1.28285 -1.370 0.1755
Age 0.04742 0.01968 2.409 0.0188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.2 on 65 degrees of freedom
Multiple R-squared: 0.08198, Adjusted R-squared: 0.06786
F-statistic: 5.805 on 1 and 65 DF, p-value: 0.01883
![]() |
| Figure 1: Scatter plot showing no significant correlation between the variables age and cancer volume in a landmark prostate cancer study. All figures plotted with R |
Figure 2 is the main result of the study. It shows the correlation between cancer volume and level of PSA (prostate specific antigen) in the blood. Here are the summary statistics from a training data sample.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.53623 0.23688 -2.264 0.0269 *
ResLPSA 0.75427 0.08678 8.692 1.73e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8515 on 65 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5304
F-statistic: 75.55 on 1 and 65 DF, p-value: 1.733e-12
![]() |
| Figure 2: Plot showing a significant correlation between PSA levels and prostate cancer volumes |
Stamey, T.A., Kabalin, J.N.,
McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A. and Yang, N. (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of
the prostate: II. radical prostatectomy treated patients, Journal of Urology 141(5),
1076–1083.
Subscribe to:
Posts (Atom)
Why should you hire me?
In my efforts to encounter long-term and stable employment, I have found no shortage of people who seem to make their living through career...

