Sortify Docs

Python TensorFlow Lots of pain

Spotify has amazing recommendations. The amazing recommendations are hidden among plenty of not-so-amazing recommendations, which are scattered across many auto-generated playlists, which are further scattered all across the platform.

Queue up Sortify

Sortify tackles one of these issues: after you’ve settled on a playlist, big or small, Sortify swiftly arranges it to prioritize the tracks you are most likely to enjoy. After taking in a large sample of your likes and dislikes, Sortify is able to predict whether or not you will like a song, how much you will like the song, and also sort a playlist by your liking.

Training the Sortify model takes up a significant amount of time compared to running it. Once Sortify processes all of a user’s likes and dislikes, inference and playlist processing becomes a breeze — catered results are returned in seconds.

Let’s get technical

As with most AI-related projects, Sortify involves data preprocessing steps before the actual TensorFlow model training and inference, all handled with Python scripts. Due to a number of reasons, the length of audio tracks for the training data had to be a consistent number. This requirement is driven by the need for a consistent input length for Mel-frequency cepstral coefficients (MFCC) feature extraction, maintaining input shape for the Convolutional neural network (CNN) model, and artificially increasing the number of training samples in our small amount of input data.

Sample splitting

During testing, Sortify swapped between a "trim" and "segment" strategy, where trim only retains the first n seconds of the audio track and segment attempts to split each each track into audio segments of length n. In both cases, each generated file shorter than length n would later be deleted. The final length of n I settle on was around 30 seconds. The segment strategy is better because it captures a wider variety of nuances in each song, whereas the trim direction mostly captures song intros.

Feature extraction

The settings for the MFCC feature extraction are as follows: 40 coefficients, 2,048 samples, a hop length of 512 samples, 15 segments, and a sample rate of 22,050 Hz. In hindsight, a value of 40 for coefficients is ridiculously high and likely caused major overfitting. Also, I should’ve utilized a dynamic segmentation strategy instead of opting for a flat 15 segments. The decision to use 15 segments likely resulted in the absence of diversity among samples. The trash talk about my settings continues in the verdict.

Training

The MFCC features from the feature extraction step are stored in an array along with the respective labels in a separate array. The features are reshaped into a 4D structure and the labels are converted to a categorical format (a binary class matrix for one-hot encoding). The data are split into a standard 75-25 training/testing split and the training data are further divided into a 80-20 training validation split.

I’m a little scared to say this, but the model has TEN layers. Yeah. Ten. You can start to assume what the verdict will be for this project.

Moving on, there were 3 convolutional layers, 3 max pooling layers, 2 dropout layers, a global average pooling layer, and a dense layer. For the actual training, I opted for a batch size of 32 and 50 epochs. Once again, more trash talk about my settings can be found in the verdict.

Inference

The model is stored after the training step to avoid retraining each time the script is run for inference. All of the preprocessing steps are repeated on all songs not used in the training data. Inference is run on each split audio segment, and the classification for each segment in a song, along with the confidence, is aggregated and stored. Each song that yielded a classification of "yes" (the user would like the song) is output in descending order of confidence (high to low), followed by an inverse ordering for tracks yielding a "no". This leads to the most enjoyable songs for a user appearing at the top, while the least enjoyable tracks are placed at the bottom.

Motivation

The idea behind Sortify is to allow users to open a playlist and immediately know which songs they would like with the highest probability. At the time of this writing, Spotify has kind-of a “stories” feature, where a user can open a playlist and scroll through snippets of songs in the playlist, but Sortify was taking this one step further and automatically ordering the songs by personalized appeal.

So, does Sortify live up to its standards, ideologies, goals, and motivations?

Verdict:

✨Hell no✨

Due to the improper processing of samples and tiny sample size, Sortify falls short of its goals (a “flop”, as my generation would call it). It was really cool to see each song in the playlist being rated and sorted in order of which songs I am most likely going to enjoy the most. Unfortunately, the accuracy of said predictions is abysmal.

Classifying what little training data I had was really difficult. It is obvious what constitutes a song that I “like” — any song that I feel worthy of adding to my “Liked Songs” playlist (right?), but what about the other side of the spectrum, the negative class? Should it be all of the remaining songs on the platform? Should it be songs that I dislike? How strongly should I dislike the songs for them to be candidates for the opposing bucket? How about songs I’m indifferent to? Would songs from genres I don’t listen to be appropriate or should I seek tracks that I dislike from genres that I typically listen to? Plenty of questions and not nearly enough answers.

Sample collection was also incredibly tedious. I love listening to music, but I had to listen to a LOT of music, half of which I didn’t like. By the end of collection, I had hundreds of tracks saved, which might seem like a lot but is a minuscule amount when it comes to training a model.

Another problematic factor is the track length. I had split each track to artificially create more samples, but there were glaring issues with that approach. Liking or disliking a whole song doesn’t accurately reflect feelings about a specific fragment of the song, especially if track is arbitrarily divided by length, potentially cutting through significant auditory features like the chorus/verses or chord strikes.

Despite the limited sample size and other issues with this project, I believe a deeper understanding of the properties of AI models (and as a result, setup parameters) would’ve yielded more success. Factors such as fewer coefficients, dynamic segmentation, and a significant reduction of layers would likely result in a different outcome for this project.

Also, the setup for processing takes some time, though that is a factor that isn’t as glaring of an issue as the low accuracy of the core functionality. There were three separate scripts: one to download the songs, another to split them, and a third one to either label the features or run inference. A quality of life (QOL) change which unified the three scripts would be a simple implementation, but would also significantly advance the project. Throughout the development of Sortify, I had not reached the stage which allowed for QOL improvements as I was working on straightening out the different kinks at each stage of the process.

Final thoughts

Documenting this project proved challenging — as I kept writing, I discover more and more reason to bash the project. Nonetheless, it is important to record and build on the pitfalls; in this case, tremendous and diabolical 😭.