Fraudulent Job Posts

NLP: Naive Bayes

What Is Naive Bayes? e In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.[1] But they could be coupled with Kernel density estimation and achieve higher accuracy levels. [Wikipedia]

Train Time for All: 2.30 seconds

Multinomial Naive Bayes

Classification Report: Multinomial Naive Bayes All Data

	precision	recall	f1-score	support
Real	0.95	1.00	0.97	4254
Fake	0.00	0.00	0.00	216
Accuracy			0.95	4470
Macro Avg	0.48	0.50	0.49	4470
Weighted Avg	0.91	0.95	0.93	4470

Findings: We used the same MultinomialNB model with all the data which ran in 2.3 seconds and had an accuracy of 0.95 but itwas unable to predict fake posts. We decided to run the ComplementNB nex to see what would happen.

Compliment Naive Bayes

Classification Report: Complement Naive Bayes All Data

	precision	recall	f1-score	support
Real	0.95	0.99	0.97	4254
Fake	0.11	0.01	0.02	216
Accuracy			0.95	4470
Macro Avg	0.53	0.50	0.50	4470
Weighted Avg	0.91	0.95	0.93	4470

Findings: The ComplementNB is an adaptation of the standard multinomial naive Bayes algorithm that is particularly suited for imbalanced data sets. We found that with a sub-dataset that contained an equal number of fake and real posts the algorithm run in 0.06 s, had an accuracy of 0.84 and had a precision of 0.81 for fake postings. Unfortunatley the accuracy for the fake posts was very poor.

1:5 Naive Bayes

Classification Report: 1:5 Naive Bayes

	precision	recall	f1-score	support
Real	0.87	0.99	0.93	1082
Fake	0.84	0.26	0.40	216
Accuracy			0.87	1298
Macro Avg	0.85	0.63	0.66	1298
Weighted Avg	0.87	0.87	0.84	1298

Findings: Lastly, to compare with other models we used a subset of data with 1:5 ratio (fake:real posts). The model took 0.48 s to build, the accuracy was .84 and the precision for the fake was .81. In conclusion, these classifiers are very fast models but very poor predictors when applied to our data set.

Model Comparison

	Model	Precision Fake
NLP	k-NN	0.76
	SVM	0.89
	DL	0.92
	NB	0.84
	RF	0.58
Non-Textual	k-NN	0.85
	SVM	0.84
	DL	0.86
	RF	1.00