As with the post on linear regression there are already sites out there that explain the theory behind logistic regression so I won't go into details. Basically the idea is to be able to categorise items. Some examples could include is this a fraudulent sales or not, is patient a going to develop condition B or not. The often cited example is determining the type of flower by the sizes of the petals (the iris data set).
I have a bunch of data (anonymous) from Dupuytren's and Ledderhose patients that I have collected for another blog I run. Dupuytren's is more common than Ledderhose so I want to predict for patients that just have Dupuytren's whether they have also developed Ledderhose. I know for a fact that there is not any strong correlation between the data points I have and the patient developing both conditions but it seemed like a good example.
So my plan is simple, run a basic logistic regression model using sci-kit learn. The code for this is below, the results, as expected aren't great but it does give better than 50:50 prediction so it does better than I or any doctor could do... Guess this means I need to analyse the data further to see if I can find where the trend lies. I have already spent a significant amount of time cleansing and reducing the data to possibly relevant data before importing it and some analysis in seaborn which I will cover elsewhere.
The results
I won't go into detail on these results, some useful links for this are:
Hands-On Machine Learning with Scikit-Learn and TensorFlow - A great book on machine learning
The code:
Link 2 , Link 3
So basically for class 0, there was an accuracy in the predictions of 72%, which if you look at the data below shows that 303 were correctly predicted and 115 were incorrectly predicted. For class 1 there is a much lower sample and success rate with only 54% being correctly predicted. As I said above this suggests that from the data used (without further work) we are not able to provided a good prediction of whether a Dupuytren's patient can develop Ledderhose.
So basically for class 0, there was an accuracy in the predictions of 72%, which if you look at the data below shows that 303 were correctly predicted and 115 were incorrectly predicted. For class 1 there is a much lower sample and success rate with only 54% being correctly predicted. As I said above this suggests that from the data used (without further work) we are not able to provided a good prediction of whether a Dupuytren's patient can develop Ledderhose.
The code:
No comments:
Post a Comment