Fraudulent Job Posts

Non-Textual: Support Vector Machine

What Is SVM? In machine learning, support-vector machines (SVMs, also support-vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). [Wikipedia]

Classification Report: Non-Textual SVM

Without Parameter Tuning

	precision	recall	f1-score	support
Real	0.95	1.00	0.98	4251
Fake	0.67	0.02	0.04	219
Accuracy			0.95	4470
Macro Avg	0.81	0.51	0.51	4470
Weighted Avg	0.94	0.95	0.93	4470

Classification Report: Non-Textual SVM

With Parameter Tuning

	precision	recall	f1-score	support
Real	0.97	1.00	0.98	4251
Fake	0.84	0.45	0.59	219
Accuracy			0.97	4470
Macro Avg	0.91	0.72	0.78	4470
Weighted Avg	0.97	0.97	0.96	4470

Train Time: 5.41 seconds

Findings: The support vector machine model was not great prior to hypertuning. Although it had an overall accuracy score of .95, it had a precision score of .67, which is by far the lowest score we’ve seen so far, even without parameter tuning. After calculating and applying the best parameters, however, the precision score increased to .84 and the overall accuracy increased to .97. The parameter tuning helped make this model comparable to the others in using the non-textual data to classify fake job postings, but it was still the least effective of the bunch.

Even though the main objective of our project was to explore Natural Language Processing and see how effective these models were with regard to reading and interpreting text, we wanted to see how accurately the same models could predict fraudulent job postings based on some of the other available data in our dataset. The same features were used with each model, they included things such as whether or not a company logo was provided, whether or not a salary and/or benefits were provided, industry, required experience, required education, employment type, etc. It didn’t seem like there was a ton of data to rely on without the job description itself so we weren’t sure what to expect.

Model Comparison

	Model	Precision Fake
NLP	k-NN	0.76
	SVM	0.89
	DL	0.92
	NB	0.84
	RF	0.58
Non-Textual	k-NN	0.85
	SVM	0.84
	DL	0.86
	RF	1.00