Fraudulent Job Posts

Non-Textual: K Nearest Neighbors

What Is k-NN? In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. [Wikipedia]

Classification Report: Non-Text k-NN

Without Paramater Tuning

	precision	recall	f1-score	support
Real	0.97	1.00	0.98	4251
Fake	0.80	0.36	0.49	219
Accuracy			0.96	4470
Macro Avg	0.89	0.68	0.74	4470
Weighted Avg	0.96	0.96	0.96	4470

Classification Report: Non-Text k-NN

With Paramater Tuning

	precision	recall	f1-score	support
Real	0.97	1.00	0.98	4251
Fake	0.85	0.48	0.62	219
Accuracy			0.97	4470
Macro Avg	0.91	0.74	0.80	4470
Weighted Avg	0.97	0.97	0.97	4470

Train Time: 0.29 seconds

Findings: The K Nearest Neighbors model had a pretty decent likelihood of recognizing fraudulent postings before the hyperparameter tuning, with a precision score of .80 and an overall accuracy of .96. After tuning, the precision score increased to .85 and the accuracy increased slightly to .97. It’s not the most effective model of the ones we tested but it certainly holds its own.

Even though the main objective of our project was to explore Natural Language Processing and see how effective these models were with regard to reading and interpreting text, we wanted to see how accurately the same models could predict fraudulent job postings based on some of the other available data in our dataset. The same features were used with each model, they included things such as whether or not a company logo was provided, whether or not a salary and/or benefits were provided, industry, required experience, required education, employment type, etc. It didn’t seem like there was a ton of data to rely on without the job description itself so we weren’t sure what to expect.

Model Comparison

	Model	Precision Fake
NLP	k-NN	0.76
	SVM	0.89
	DL	0.92
	NB	0.84
	RF	0.58
Non-Textual	k-NN	0.85
	SVM	0.84
	DL	0.86
	RF	1.00