Determining Receipt Validity from E-Mail Subject Line Using Feature Extraction and Binary Classifiers

Chanda Amit Hirway (Technological University of the Shannon & The NPD Group, L. P., Ireland); Enda Fallon (Athlone Institute of Technology, Ireland); Kieran Flanagan (The NPD Group, L. P., Ireland); Paul Connolly (The NPD Group, Inc, Ireland)

Many data quality technologies are now available to manage diverse types of data and their sources as the number of structured and unstructured data sources grows. Modern data quality solutions can improve efficiency and decrease risks by repairing incorrect data at many stages before it is stored in the data warehouse. To improve the accuracy of the machine learning model, these data quality solutions employ machine learning and natural language processing capabilities. The purpose of this study is to determine the authenticity of a customer invoice based on the subject line of an email using Feature Extraction and Binary Classifiers.

To do this, a feature extraction method known as Bag of Words (BOW) was used to create a vocabulary. Three binary classifiers namely Naive Bayes Bernoulli NB, Support Vector Machine, and Random Forest were used to determine the accuracy. The Random Forest classifier was found to be more effective in terms of accuracy, precision, recall, and F1 score. The limitation here is that the classifiers' performance is evaluated using only one feature extraction method. Other classifiers must also be incorporated to further reduce the False Negatives values, which play a significant part in calculating model accuracy.

Journal: International Journal of Simulation- Systems, Science and Technology- IJSSST V23

Published: no date/time given

DOI: 10.5013/IJSSST.a.23.02.03