Fraudulent Job Posts

Non-Textual: Random Forest

What Is Random Forest? Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set [Wikipedia]

Classification Report: Non-Textual Random Forest

Without Parameter Tuning

	precision	recall	f1-score	support
Real	0.97	1.00	0.99	4251
Fake	0.91	0.50	0.65	219
Accuracy			0.97	4470
Macro Avg	0.94	0.75	0.82	4470
Weighted Avg	0.97	0.97	0.97	4470

Classification Report: Non-Textual Random Forest

With Parameter Tuning

	precision	recall	f1-score	support
Real	0.95	1.00	0.98	4251
Fake	1.00	0.03	0.06	219
Accuracy			0.95	4470
Macro Avg	0.98	0.52	0.52	4470
Weighted Avg	0.95	0.95	0.93	4470

Train Time: 3.28 seconds

Findings: The random forest model was actually incredibly impressive! Before hypertuning the model, it had a precision score of .91 and an overall accuracy score of .97, which are certainly the highest scores we’ve seen so far, especially without any parameter tuning. After hypertuning the model and applying the best parameters, the overall accuracy dropped to .95 but the precision score was a solid 1.00, meaning it was correctly predicting fraudulent postings 100% of the time! It was also by far the fastest to run.

Even though the main objective of our project was to explore Natural Language Processing and see how effective these models were with regard to reading and interpreting text, we wanted to see how accurately the same models could predict fraudulent job postings based on some of the other available data in our dataset. The same features were used with each model, they included things such as whether or not a company logo was provided, whether or not a salary and/or benefits were provided, industry, required experience, required education, employment type, etc. It didn’t seem like there was a ton of data to rely on without the job description itself so we weren’t sure what to expect.

Model Comparison

	Model	Precision Fake
NLP	k-NN	0.76
	SVM	0.89
	DL	0.92
	NB	0.84
	RF	0.58
Non-Textual	k-NN	0.85
	SVM	0.84
	DL	0.86
	RF	1.00