This paper introduces a real-time vishing (voice phishing) detection method specifically designed for interactive calls, emphasizing the critical need for early detection to mitigate financial losses. Focused solely on acoustic voice features, not conversation content, this study capitalizes on distinctive phonetic traits exhibited by vishing perpetrators. Experimentation yields promising results: 1) Acoustic voice features effectively detect vishing in conversational contexts, 2) early detection is feasible via significant data time length analysis, and 3) the model demonstrates quick inference times suitable for real-time vishing prevention. The experimental models in our study exhibit impressive accuracy rates, and some even achieve perfection. This approach presents a powerful means of real-time vishing prevention, effectively mitigating the potential for financial devastation.
We collected vishing dataset, organized as vishing call data and normal call data from Korean Financial Supervisor Service and AI Hub. We segmented the data by 0.1 second and the codes can be found in time_split.py
in preprocess
folder. On average, the conversation starts within 0.5 seconds, so it is meaningless to use data shorter than 0.5 seconds, and we did not remove the front silence, before 0.5 seconds, due to see how short data can be detected in the actual call.
- Examples of dataset
Image shows the overview of our vishing detection process. We used light models for detection for real-time vishing detection: Machine learning models, which take relatively short learning and evaluation time and basic Deep learning models. basic.py
, DenseNet.py
and LSTM.py
in Model
folder are the codes for experiments. Here are some examples for training and evaluating the models.
Basic ML
python ml.py --model_name 'SVM' --feature_type 'mfcc' --feature_time '0.5' --wav_path './data/split_wav' --result_root './result' --checkpoints_root './checkpoint' --gpu_id 0
Simple DL
python dl.py --model_name 'DenseNet' --feature_type 'mfcc' --feature_time '0.5' --wav_path './data/split_wav' --result_root './result' --checkpoints_root './checkpoint' --gpu_id 0
The hyperparmeters of each model are as follow:
-
Test accuracy for all feature and models
The test results are average of five experiment runs. Most of the results report above 99% accuracy.
-
Test time per case
The table below shows the test time each time segment for each feature extraction methods for all models. As you can see, every case took less than 0.03 seconds, showing that they can be used for real-time detection.