Computational Chemistry and Data Science

Tuesday, April 9, 2019

Why should you hire me?

In my efforts to encounter long-term and stable employment, I have found no shortage of people who seem to make their living through career and brand development tools. My goal in writing this blog and any efforts that have been done online have all been for the single purpose of securing dependable employment where I can use my training. However, not many people hire specifically for PhD level chemists who have a primary background in theory and computation.

With the number of views that this blog gets (or does not), I hazard to guess that anyone who is reading this is specifically looking at information about me and considering me for some position. So, I will attempt to plead a case for hiring someone with my skill set.

I spend my time thinking about how things work and how to make them better: Just like I do when doing academic research, work is never done just because it is quitting time. I hate not understanding how something works and feeling like I do not have a solution. Because of this, my mind goes constantly until I find some aspect of the problem that I can solve. It is as if my brain were a computer with background processes and applications constantly churning.

This comes from the results-oriented nature of research training. You get no recognition for the hours you spend working on a problem: the only objective measure is results. I do not mean just the incremental work that pads a CV or resume. What I am referring to is handing someone data garnered through integrity that has become part of you. It may sound strange to think about projecting yourself through exciting insights that you just gained from going through literally thousands of numbers, but when it happens, it is magic. You can just call this software aptitude with excel or the piece of scientific software you were using, but it is different. Nearly anyone can use a piece of software and mimic a conclusion. That is why there are so many posers out there that make it difficult to gain the trust of employers. Problem -> solution, google this, get that answer. Most robots could replace them.

The difference between them and someone like me is that they are constrained by a pattern that they have seen before. Like the voice recognition software that cannot understand a different accent, novelty throws them for a proverbial loop. They will say that it cannot be done because no one has done it before. There is nothing that brings me more joy than refuting this way of thinking.

Researchers such as myself are akin to trailblazers who finds solace in the unknown wilderness and wakes up excited at what they get to discover that day. A typical employee states that it is above their pay grade and they find ways to reduce their effort while demanding more from you. To them, problem-solving and being in unfamiliar situations is more effort so they avoid doing it--To me it is the reason that I go to work with a smile.

So, why me and not someone else? They may claim that they are dependable, honest, reliable, trust-worthy and promise you the moon. The difference is if I promise you the moon, then after meeting with me, you will sleep comfortably with the knowledge that you will be owning a natural satellite soon.

Sunday, February 17, 2019

Zero Arbitrage in Career Changes: Greener Grass Takes Work

As a computational chemist seeking to broaden my appeal to a world with many employers not impressed with the solvent-effect free vacuum at 0K in which I had played for so long, I have begun to study the world of business and finance around me. When I ask for Christmas or birthday presents, they usually go to gift cards that purchase textbooks in these fields (Horribly nerdy, but that's how I roll). Quantitative finance has two fundamental axioms (time-value of money) and (zero-arbitrage principle) from which the entire field is derived. It is very reminiscent of the first two laws of thermodynamics, and while mulling over their similarities, I thought about how these apply to my search for a full-time career.

1) You can't borrow your way out of debt. The time-value of money principle states that a dollar today is worth more than a dollar tomorrow. Just because you decide to take out loans to go back to school, or increase your credit card debt to pay for online classes does not guarantee that you will complete the program (assuming that the program is legitimate). More importantly than that, it DOES NOT guarantee employment. Going into further debt to obtain more education that *maybe you will complete so that you can *maybe earn more money at some future point and *maybe maintain that employment for long enough to get out of debt completely is tenuous at best.

2) Employers prefer experience to education, everything else being equal. Work experience with trustworthy references and a proven history of excelling at what you do means much more to most employers than having a doctorate or graduating from a top-tier school. Education implies that you have the potential to be successful at a position. Experience already shows you at an acceptable level. Hiring someone who is educated but untested is a huge risk to an employer. There is something to be said for graduating with a doctorate in a STEM field, or having quality publications. These imply that you know how to work hard, you are not one-dimensional and you have the potential to go above the bare minimum stipulated in job description and provide meaningful contributions. However, it is okay that it may take hundreds of applications and several interviews, because in the end they are taking a huge chance on you.

3) Work is the opposite of play. Meaning or enjoyment for life do not come out of a job title. The major part of a being a scientist who looks at data is staring at numbers on a computer all day. Worse than that, you have to hunt all over for those numbers and they are never in the right format. Computers are not as smart as you would believe. They only understand what you give them and are notorious sticklers for detail. While you are complaining of how your office chair hurts your back and how they really need to fix the AC in your area, there are people getting heat exhaustion from doing landscaping work all day who would love to have your problems.

The zero arbitrage principle is the assumption that in financial modeling, there are no unfair advantages between buyers and sellers on the large scale. It is another way of saying that if you want to make the grass greener, you had better be willing to put in some work. Because nothing comes for free however, we all have the ability to change our situations through effort. Keep going and always remember that everything looks better to us when we are going through hard times in our current careers.

Friday, February 1, 2019

Linear Regressions in Data Science

Linear regressions are simple, yet powerful prediction techniques. They are easily implemented in Microsoft Excel, or even on a piece of graph paper. The basic idea is to take a section of data points and find a straight line that bisects (has a minimized distance between) all of the data points.

I use R for statistical analysis for the following reasons:
1) It is far more powerful and flexible in its input than Excel could ever hope to be.
2) It is much more intuitive than other packages I have used.
3) The learning curve of it compared to used numpy and pandas with python is much lower.
4) It has many packages with versatile uses available that are simply imported as compared to messing with $SYSPATH in python.
5) It is free and easy to install
6) There are many learning resources and tutorials for novices online.

All linear regressions are in the form of y = mx + b, where m provides an indicator of how a change in one variable affects the other, and b is an intercept that depending on the context, provides either a standardization or may hold other information. The strength of the linear relationship of the plotted line to the data points (the fit) is given in terms of an R² value. The closer an R² value is to 1, the better the linear fit of the line to the data. Other statistics that are useful include standard error (how far away most of the points are from the fitted line) and the t-value. The t-value in particular is used to test the probability that the relationship between our two variables is due to chance. If the R² is near 1, the standard error is near zero and the p-value from our t-test shows a statistically significant relationship, our variables have a linear relationship. Additionally, the F-statistic is used to show the linear relationship between the two variables.

Here are some example data plots in a landmark study on risk factors in prostate cancer (Ref 1).. Figure 1 is showing the essentially random correlation between the variables of age and cancer volume. From our understanding of cancer progression, it makes sense that there is no real relationship between age and cancer volume if the onset of the disease is similar for the groups. Sometimes this understanding of the data going into a problem (referred to as domain knowledge) can almost be as useful as having sophisticated knowledge of machine learning and programming.
The summary statistics in R (shown below) confirm our intuition:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.75703 1.28285 -1.370 0.1755
Age 0.04742 0.01968 2.409 0.0188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.2 on 65 degrees of freedom
Multiple R-squared: 0.08198, Adjusted R-squared: 0.06786
F-statistic: 5.805 on 1 and 65 DF, p-value: 0.01883

Figure 1: Scatter plot showing no significant correlation between the variables age and cancer volume in a landmark prostate cancer study. All figures plotted with R

Figure 2 is the main result of the study. It shows the correlation between cancer volume and level of PSA (prostate specific antigen) in the blood. Here are the summary statistics from a training data sample.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.53623 0.23688 -2.264 0.0269 *
ResLPSA 0.75427 0.08678 8.692 1.73e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8515 on 65 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5304
F-statistic: 75.55 on 1 and 65 DF, p-value: 1.733e-12

Figure 2: Plot showing a significant correlation between PSA levels and prostate cancer volumes

The statistically significant results with a p-value of approximately zero show that our results are not due to chance. In fact, the relationship between PSA which is available through a simple blood draw, and the size of a prostate tumor is a humongous breakthrough. This study shows that we can have a complementary diagnosis for prostate tumor presence without the dreaded digital prostate examination. While PSA is sufficient by itself because of false positives, it does provide a basic screening technique.

Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F., Redwine, E.A. and Yang, N. (1989) Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. radical prostatectomy treated patients, Journal of Urology 141(5), 1076–1083.

Sunday, January 20, 2019

Finding Order from Chaos: Why we need data scientists

It often seems like I spend more time cleaning my house than living in it, especially with kids. A basic fact of nature is that matter is constantly becoming more disordered. Scientists refer to this as the Second Law of Thermodynamics, and I am sometimes in awe of how a law that explains why stars die and scents spread through a room is seen in something as simple as toys on the living room floor.

This principle can be readily observed in a business setting where PCs and servers become filled with disorganized and often useless files. While our transition to electronic records and communications has certainly reduced our use of paper, the amount of data that we have to confront is no less intimidating than in "snail mail" times. Regardless of the format, sorting through the inevitable disorder that comes from having a business of any size is intimidating and tedious, because we do not want to throw away anything that might be important.

The size and types of data files that confront even a small business today are getting larger and harder to simply classify as "Junk" or "Valuable." Even if you do have the expertise to go through these files, the axiom of "Time is Money" becomes very true. Also, sometimes what seems like random electronic garbage may have more value than you originally thought.

Buried inside those piles of gigabytes (or even terabytes) are fragments that can provide a surprisingly complete assessment of a company. Data science can provide insights into many important business topics including:

1) Increasing your customer base

2) Providing insights into why customers really chose a competitor

3) How to retain more clients in your practice

4) Employee productivity

5) Ensuring compliance with regulations and preparing for audits

6) How to reduce wasteful spending

7) Increase your margins

The purpose of this blog will be to provide examples of each of the topics listed above. I look forward to demonstrating the need for Data Scientists like me and justifying our role through concrete and easy to read posts.