What Is Naive Bayes? e In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.[1] But they could be coupled with Kernel density estimation and achieve higher accuracy levels. [Wikipedia]
Train Time for All: 2.30 seconds
precision | recall | f1-score | support | |
---|---|---|---|---|
Real | 0.95 | 1.00 | 0.97 | 4254 |
Fake | 0.00 | 0.00 | 0.00 | 216 |
Accuracy | 0.95 | 4470 | ||
Macro Avg | 0.48 | 0.50 | 0.49 | 4470 |
Weighted Avg | 0.91 | 0.95 | 0.93 | 4470 |
Findings: We used the same MultinomialNB model with all the data which ran in 2.3 seconds and had an accuracy of 0.95 but itwas unable to predict fake posts. We decided to run the ComplementNB nex to see what would happen.
precision | recall | f1-score | support | |
---|---|---|---|---|
Real | 0.95 | 0.99 | 0.97 | 4254 |
Fake | 0.11 | 0.01 | 0.02 | 216 |
Accuracy | 0.95 | 4470 | ||
Macro Avg | 0.53 | 0.50 | 0.50 | 4470 |
Weighted Avg | 0.91 | 0.95 | 0.93 | 4470 |
Findings: The ComplementNB is an adaptation of the standard multinomial naive Bayes algorithm that is particularly suited for imbalanced data sets. We found that with a sub-dataset that contained an equal number of fake and real posts the algorithm run in 0.06 s, had an accuracy of 0.84 and had a precision of 0.81 for fake postings. Unfortunatley the accuracy for the fake posts was very poor.
precision | recall | f1-score | support | |
---|---|---|---|---|
Real | 0.87 | 0.99 | 0.93 | 1082 |
Fake | 0.84 | 0.26 | 0.40 | 216 |
Accuracy | 0.87 | 1298 | ||
Macro Avg | 0.85 | 0.63 | 0.66 | 1298 |
Weighted Avg | 0.87 | 0.87 | 0.84 | 1298 |
Findings: Lastly, to compare with other models we used a subset of data with 1:5 ratio (fake:real posts). The model took 0.48 s to build, the accuracy was .84 and the precision for the fake was .81. In conclusion, these classifiers are very fast models but very poor predictors when applied to our data set.