Aim of this project is to do audio classification and analysis using the GTZAN dataset, as well as to build a music recommendation system. This involves processing and extracting key features from audio signals to gain insights into musical structure and improve classification accuracy. Goal is to understand how different music genres exhibit unique frequency and temporal characteristics, leading to better genre recognition and recommendations.
The Harmonic–Percussive Source Separation (HPSS) technique decomposes an audio signal into two layers:
- Harmonic (Blue) – Sustained, tonal elements (vocals, guitar chords, string sections).
- Percussive (Orange) – Rhythmic, transient elements (drum hits, sharp attacks).
- Classical and jazz tend to have smooth harmonic content, with minimal percussive elements.
- Hip-hop and Rock often exhibit strong percussive spikes due to heavy drum usage.
- Metal music displays both intense percussive elements and sustained harmonic components.
MFCCs are a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
How MFCCs work?
- The audio signal is divided into short frames (20-40 ms).
- Fourier Transform (FFT) converts the signal from time-domain to frequency-domain.
- The frequency spectrum is mapped to the mel scale, approximating human hearing perception.
- The logarithm is applied to mimic the human ear's non-linear sensitivity to loudness.
- Discrete Cosine Transform (DCT) is used to extract key MFCC features.
- Each genre has a unique "fingerprint" in its MFCCs, which reflects its musical characteristics (rhythm, harmony, timbre). Rock and metal might have more pronounced high-frequency components, while classical and jazz might have smoother transitions.
- The plots show how the audio features change over time. For instance, a sudden change in color intensity might indicate a transition between sections of the song.
- We can identify similarities (blues and jazz might share some patterns) and differences (classical and metal are likely very distinct).
Chroma features represent the harmonic and melodic characteristics of a musical piece. They are useful for analyzing chord progressions, tonality, and musical key detection.
How it works?
- The audio signal is analyzed to identify the pitch content across the frequency spectrum.
- Frequencies are mapped to their corresponding pitch classes, effectively "folding" all octaves into a single octave (for example, C2, C3, and C4 are all treated as "C").
- The energy (magnitude) within each pitch class is summed up, resulting in a 12-dimensional vector representing the chroma features.
- The transitions between pitch classes over time can reveal chord progressions. For example, a sequence of bright spots moving vertically might indicate a chord change.
- Genres like rock, blues, pop, and hip-hop show dense chroma activity, indicating frequent chord changes.
- Classical and jazz have more structured patterns (jazz is showing more harmonic complexity).
- The pitch class with the highest intensity over time often indicates the tonal center (key) of the music.
A spectrogram represents the frequency content of a signal over time. It helps to visualize how different frequencies evolve throughout the duration of the audio clip.
Genre | Low Frequencies (Bass/Drums) | Mid Frequencies (Guitar/Piano) | High Frequencies (Vocals/Percussion) | Density & Variation |
---|---|---|---|---|
Rock | Strong | High | Present, less prominent | Dense |
Country | Moderate | Strong | Present, less percussive | Structured |
Classical | Low | Strong | Strong | Highly structured |
Hip-Hop | Strong | Moderate | Sparse, rhythmic | Repetitive |
Blues | Moderate | Smooth | Moderate, rhythmic | Improvised |
Pop | Balanced | Present, cleaner than rock | Present, bright | Repetitive |
Reggae | Strong | Present, rhythmic stabs | Moderate | Offbeat structure |
Jazz | Variable | High | Present, dynamic changes | Highly varied |
Disco | Moderate | Present, rhythmic | Present, bright | Rhythmic, consistent |
Metal | Strong | Strong, distorted | Harsh, intense | Dense, aggressive |
The audio features are extracted from audio files and they are saved as CSV files for further processing. We use XGBoost and YAMNet for genre classification because they offer complementary strengths in analyzing and classifying audio signals.
XGBoost is an ensemble learning method that builds a series of decision trees sequentially. Each tree corrects the errors of the previous one, leading to a highly accurate model. Since our extracted audio features are structured and numeric, XGBoost is well-suited to classify them effectively and outperform many other machine learning models.
YAMNet is a pre-trained deep learning model based on the MobileNet architecture designed for audio classification. Instead of training a model from scratch to understand raw audio, we use YAMNet as a feature extractor, converting audio waveforms into meaningful feature representations.
XGBoost outperformed YAMNet in terms of accuracy on the dataset, while YAMNet tends to overfit easily, even with fine-tuning.
Genre | YAMNet Precision | YAMNet Recall | YAMNet F1-score | XGBoost Precision | XGBoost Recall | XGBoost F1-score |
---|---|---|---|---|---|---|
Rock | 0.58 | 0.70 | 0.64 | 0.91 | 0.84 | 0.87 |
Country | 0.88 | 0.70 | 0.78 | 0.81 | 0.87 | 0.84 |
Classical | 0.95 | 0.95 | 0.95 | 0.93 | 0.97 | 0.95 |
Hip-Hop | 0.89 | 0.85 | 0.87 | 0.96 | 0.90 | 0.93 |
Metal | 0.95 | 0.95 | 0.95 | 0.94 | 0.96 | 0.95 |
Blues | 0.89 | 0.85 | 0.87 | 0.90 | 0.89 | 0.89 |
Pop | 0.68 | 0.85 | 0.76 | 0.95 | 0.96 | 0.96 |
Reggae | 0.95 | 0.90 | 0.92 | 0.93 | 0.91 | 0.92 |
Jazz | 1.00 | 0.90 | 0.95 | 0.88 | 0.92 | 0.90 |
Disco | 0.80 | 0.80 | 0.80 | 0.89 | 0.90 | 0.90 |
Overall Accuracy | 0.86 | 0.91 |
The KNN algorithm is used for implementing a music recommendation system. It can recommend songs based on similarity measured between feature vectors extracted from audio files. Instead of using Euclidean distance, we will go for a Cosine distance (it better captures feature vector relationships in high-dimensional space).
How it works?
- Extract MFCCs, Chroma Features, and Spectral Features from the dataset.
- Normalize and scale the feature vectors using StandardScaler.
- Train a model using cosine distance.
- For a given input song, find the K nearest songs based on their cosine distance.
- Recommend the most similar songs to the user, ranking them by similarity percentage.