Automate Office with Python: Logistic Regression

As with the post on linear regression there are already sites out there that explain the theory behind logistic regression so I won't go into details. Basically the idea is to be able to categorise items. Some examples could include is this a fraudulent sales or not, is patient a going to develop condition B or not. The often cited example is determining the type of flower by the sizes of the petals (the iris data set).

I have a bunch of data (anonymous) from Dupuytren's and Ledderhose patients that I have collected for another blog I run. Dupuytren's is more common than Ledderhose so I want to predict for patients that just have Dupuytren's whether they have also developed Ledderhose. I know for a fact that there is not any strong correlation between the data points I have and the patient developing both conditions but it seemed like a good example.

So my plan is simple, run a basic logistic regression model using sci-kit learn. The code for this is below, the results, as expected aren't great but it does give better than 50:50 prediction so it does better than I or any doctor could do... Guess this means I need to analyse the data further to see if I can find where the trend lies. I have already spent a significant amount of time cleansing and reducing the data to possibly relevant data before importing it and some analysis in seaborn which I will cover elsewhere.

The results

I won't go into detail on these results, some useful links for this are:

Hands-On Machine Learning with Scikit-Learn and TensorFlow - A great book on machine learning

Link 2 , Link 3

So basically for class 0, there was an accuracy in the predictions of 72%, which if you look at the data below shows that 303 were correctly predicted and 115 were incorrectly predicted. For class 1 there is a much lower sample and success rate with only 54% being correctly predicted. As I said above this suggests that from the data used (without further work) we are not able to provided a good prediction of whether a Dupuytren's patient can develop Ledderhose.

The code:

Automate Office with Python

Pages

Monday, 1 January 2018

Logistic Regression

No comments:

Post a Comment