Optimizing a Python application with C++ code

I’ve been working lately in a command line application called Bard which is a music manager for your local music collection. Bard does an acoustic fingerprinting of your songs (using acoustid) and stores all song metadata in a sqlite database. With this, you can do queries and find song duplicates easily even if the songs are not correctly tagged. I’ll talk in another post more about Bard and its features, but here I wanted to talk about the algorithm to find song duplicates and how I optimized it to run around 8000 times faster.

The algorithm

To find out if two songs are similar, you have to compare their acoustic fingerprints. That seems easy (and in fact, it is), but it’s not as straightforward as it seems. A fingerprint (as acoustid calculates it) is not just a number, but an array of numbers, or better said, an array of bits, so you can’t just compare the numbers themselves, but you have to compare the bits in those numbers. If all bits are exactly the same, the songs are considered the same, if 99% of bits are the same then that means there’s a 99% chance it’s the same tune, maybe differing because of encoding issues (like one song being encoded as mp3 with 192 kbits/s and the other with 128 kbits/s).

But there are more things to have in mind when comparing songs. Sometimes they have different silence lengths at the beginning or end, so the bits that we compare are not correctly aligned and they don’t match as they are calculated, but could match if we shifted one of the fingerprints a bit.

This means that to compare two songs, we don’t just have to compare the two fingerprints, but then we have to simulate increasing/decreasing silence lengths at the beginning of a song by shifting its fingerprint and calculate the match level for each shift to see if we improve/worsen the similarity percentage.  Right now, Bard shiftes the array by 100 positions in one direction and then another 100 positions in the other direction, this means that for each song we have to compare the two fingerprints 200 times.

Then, if we want to compare all of the songs in a collection to find duplicates, we need to compare the songs with ID 1 and 2, then we need to compare the song with ID 3 with IDs 1 and 2, and in general, each song with all previous ones. This means for a collection with 1000 songs we need to compare 1000*1001/2 = 500500 songs (or 100100000 fingerprints).

The initial python implementation

Bard is written in python, so my first implementation used a python list to store the array of integers from the fingerprint. For each iteration in which I have to shift the array numbers I prepend a 0 to one fingerprint array and then iterate over them comparing each element. To do this comparison, I xor the two elements and then use a standard good known algorithm to count the bits which are set in a integer:

def count_bits_set(i):
    i = i – ((i >> 1) & 0x55555555)
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
    return (((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) & 0xffffffff) >> 24

Let’s use the speed of this implementation as a reference and call it 1x.

First improvement

As a first improvement I tried replacing that bit counting algorithm with gmpy.popcount which was faster and also improved the algorithm by introducing a canceling threshold. This new algorithm stops comparing two fingerprints as soon as it’s mathematically impossible to have a match over the canceling threshold. So for example, if we are iterating the fingerprints and we calculate that even if all remaining bits would match we wouldn’t get at least a 55% match between songs, we just return “different songs” (but we still need to shift songs and try again, just in case).

With these improvements (commit) the comparisons ran at nearly an exact 2x speed.

Enter C++17

At that time,  I thought that this code wouldn’t scale nicely with a large music collection, so I thought Bard needed a much better implementation. Modifying memory is slow and C/C++ allows for much fine grained low-level optimizations, but I didn’t want to rewrite the application in C++ so I used Boost.Python to implement just this algorithm in C++ and call it from the python application. I should say that I found really easy to integrate methods in C++ with Python and I absolutely recommend Boost.Python .

With the new implementation in C++ (commit) I used STL’s vector to store the fingerprints and I added the maximum offsets in advance so I didn’t need to modify the vector elements during the algorithm and I access elements by simulating the shifting offsets. I also use STL’s map to store all fingerprints indexed on a song ID. Finally, I also added another important optimization which is using the CPU’s instructions to calculate bits set by using gcc’s __builtin_popcount.

The best part of this algorithm is that during the comparison algorithm itself, no fingerprint is copied or modified, which translated in a speed of 126.47x . At this point I started calculating another measure: songs compared per second (remember we’re doing 200 fingerprints comparison for each pair of songs). This algorithm gave an average speed of 580 songs/second. Or put another way, to compare our example collection of 1000 songs, this would take 14 min 22 sec (note that we can calculate the original implementation in Python would take approximately 1 day, 6 hours, 16 minutes and 57 seconds).

First try to parallelize the algorithm

I use an i7 CPU to run Bard and I always thought it was a pity that it only used one CPU. Since the algorithm that compares two songs doesn’t modify their data anymore, I thought it might be interesting to parallelize it to allow it to run in all 8 cores at the same time and just coalesce the result of each independent iterations. So I wondered how to do it and noticed that when I compare each song with all previous ones, this is done using a for loop that iterates over a std::map containing all songs that are already processed. Wouldn’t it be nice to have a for-each loop implementation that runs each iteration on a different thread? Well, there is! std::for_each in C++17 allows to specify an ExecutionPolicy with which you can tell it to run the iterations in different threads. Now the bad news: This part of the standard is not completely supported by gcc yet.

So I searched for a for_each implementation and found this stackoverflow question which included one. The problem is that the question mentions the implementation was copied from the C++ Concurrency in Action book and I’m not sure of the license of that code, so I can’t just copy it into Bard. But I can make some tests with it just for measurements purposes.

This increased the speed to 1897x or ~8700 songs/second (1000 songs would be processed in 57 seconds). Impressive, isn’t it? Well… keep reading 🙂

Second parallelization try

So I needed to find a parallelized for_each version I could use. Fortunately I kept looking and found out that gcc includes an experimental parallel implementation of some algorithms in the C++ standard library which includes __gnu_parallel::for_each (there are more parallelized algorithms documented in that page). You just need to link to the openmp library.

So I ran to change the code to use it and hit a problem: I used __gnu_parallel::for_each but every time I tested, it only ran sequentially! It took me a while to find out what was happening, but after reading the gcc implementation of __gnu_parallel::for_each I noticed it required a random access iterator, but I’m iterating on a std::map and maps have a bidirectional iterator, not a random-access one.

So I changed the code (commit) to first copy the fingerprints from the std::map<int, std::vector<int>> to a std::vector<std::pair<int,std::vector<int>>> and with just that, the same __gnu_parallel::for_each line ran using a pool of 8 threads.

The gcc implementation proved to be faster than the implementation in the stackoverflow question, with a speed of 2442x , ~11200 songs/second and 44 seconds.

The obvious but important improvement I forgot about

While looking at the compiler build Bard I noticed I wasn’t using compiler flags to optimize for speed! So I tested adding -Ofast -march=native -mtune=native -funroll-loops to the compiler (commit). Just that. Guess what happened…

The speed raised to a very nice  6552x, ~30050 songs/second and 16 seconds.

The Tumbleweed improvement I got for free

I’m running openSUSE Tumbleweed in the system I use to develop which as you probably know, is a (very nice) rolling release distribution. One day, while doing these tests, Tumbleweed updated the compiler from gcc 7.3 to using gcc 8.1 by default. So I thought that deserved another measurements.

Just changing the compiler to the newer version increased the speed to 7714x, 35380 songs/second and 14 seconds.

The final optimization

An obvious improvement I didn’t do yet was replacing the map with a vector so I don’t have to convert it before each for_each call. Also, vectors allow to reserve space in advance, and since I know the final size the vector will have at the end of the whole algorithm, I changed to code to use reserve wisely.

This commit gave the last increase of speed, to 7998x, 36680 songs/second and would fully process a music collection of 1000 songs in just 13 seconds.

Conclusion

Just some notes I think are important to keep in mind from this experience:

  • Spend some time thinking how to optimize your code. It’s worth it.
  • If you use C++ and can afford to use a modern compiler, use C++17, it allows you to make MUCH better/nicer code, more efficiently. Lambdas, structured bindings, constexpr, etc. are really worth the time spent reading about them.
  • Allow the compiler to do stuff for you. It can optimize your code without any effort from your side.
  • Copy/Move data as little as possible. It’s slow and many times it can be avoided just by thinking a bit on data structures before starting developing.
  • Use threads whenever possible.
  • And probably the most important note: Measure everything. You can’t improve what you can’t measure (well, technically you can, but you won’t know for sure).

21 thoughts on “Optimizing a Python application with C++ code

  1. (coming from planet kde) The elephant in the room is the O(N^2) number of comparisons. It would be nicer to compute some invariant M stable under shifts and small noise, sort by that invariant, and then make detailed comparisons only if values of M are close. By this strategy it should be possible to scale O(N log N) in the number of items.

    1. Yeah, but how to compute such an invariant under shifts and small noise is the problem here. I’m sure the acoustid people would be quite interested in that too :).

      1. What about building a distribution of bits set per byte or long or whatever per song, inserting the acoustid for that song into a tree indexed by its distribution. To fuzzily match into the tree, index into the tree by part of the distribution, add the index-node and its N closest siblings to a list of nodes to check, and go down a level for all nodes in that list and repeat. The final set of songs in that subtree should be smaller than the full set of songs.

      2. An invariant to shifts and small noise :
        that’s actually a recurring problem in bioinformatics to which multiple solutions have been proposed over time. (to avoid doing pairwise comparison on all sequences, but first do some table/list lookup) Okay, it use a slightly larger alphabet than your case (a bit pattern is 2 symbols, a DNA sequence has 4 bases, and some work with peptide sequence which are an entirely different problem with 26 aminoacids) so not all solutions are directly transferable.
        But have a look at the various hashing solutions used in bioinformatics.

        random example (by guys I know):
        DOI: 10.1093/bioinformatics/btt149
        they count symbols and build a vector.
        most notably, they build counts of pairs (2 bases), including gaps
        (counts of all direct 16 pairs : AT, AG, etc.
        counts of all 16 pairs with whatever in the middle : AxT, AxG, etc.
        counts of all 16 pairs with 2x whatever in the middle : AxxT, AxxG, etc)
        and build a giant 48-wide vector.
        close sequence have close numbers in that vector-space.

        code is roughly invariant to shifts (the distance between pairs is important, not the absolute position in sequence)
        code is proportional to noise (a single change in the sequence moves a few of the vector member 1 count up and down)

        you could treat accoustid the same way, but e.g. count the ocurences of the 6 diffent 2×2 bit patterns they use to normalize (with some gap, or in this case “manhattan distance within the overall pattern”.

        depending on how many vector members you decide using, you might go for some tree-storage, or you could then use a trick commonly used in GPUs :
        encoding the information in Z shape (i.e.: bit twiddling the coordinates to interlace their bits : instead of saving x3 x2 x1 x0 y3 y2 y1 y0 and z3 z2 z1 z0, you save x3 y3 z3 x2 y2 z2 x1 y1 z1 x0 y0 z0. vectors with close manhattan distance have close scalar numbers in that coding). on GPUs that helps tremendously the access to neighbor pixel without killing the cache (closeby pixels are closeby in memory addresses). In you case you can complete drop some of the least significant bits and compare the absolute number so see how much they are differing. And because they are a scalar numbers, simple binary search in the linear space could help pin point where in the vector are potential “neighbors”, on which to run the full blown bit comparison.

        Yet another technique used in sequence alignement:
        DOI: 10.1038/nmeth.3176

        You compute hashes of small subsets of the bit sequence (in the comparison pictures it the accoustid, we can see that here and there there are black regions: points were the song signatures are almost identical)
        (Which makes sense: barring noise, similar bars should give similar bit patterns in accoustid).
        Use a lookup table (with optionally a bloom filter to accelerate), to map which hashes have been seen in which songs.
        On each song, take each hash in sequence and look how many of the past song have the exact same hashes.
        If a song shows up often enough (there are a lot of this subsets in the bit sequence which are shared) (i.e.: which corresponds to enough bars being in common), it might be worthy of doing an extensive check.

        Of course, all of the above are only useful for extremely large collections (in the hundreds thousands to millions range – useful for large database, e.g.: music vendor like iTunes).
        Doing thousands of songs within seconds is good enough for a home user.

  2. The expression :

    36680 songs/second and would fully process a music collection of 1000 songs in just 13 seconds
    is totally misleading

  3. Does the acoustic fingerprint gives the same pattern for silence ? If so, for searching your bit array (acoustic fingerprint), you should trim the beginning and end of each fingerprint of the “silence” bits. Then the comparison will be 1:1 instead of 1:200 (so a 200x additional speedup). Also, you can sort your fingerprint in a TernaryTree, so you can perform a prefix search in O(log(N)) instead of O(N). The item with the longest match will be returned (you can even allow a Levenstein distance here to accept few potential “mutation” in the fingerprint). In that case, the complete algorithm could run in O(log N log M) instead of O(N*M) like you did.

    1. One test I did was finding silence at the beginning/end of the files and trimming them to calculate the fingerprints of two songs with different silence ranges at the beginning/end of the song, and check if they gave a better matching percentage. Unfortunately they didn’t, but I couldn’t find the exact reason why (I just tested with two songs manually, never implemented that in the application itself). I’m not sure how a ternary tree would help here, could you explain a bit more your idea? Reducing the complete algorithm’s O(N*N) would be a huge improvement, indeed.

  4. 1000 songs in 13 seconds means 77 songs per second.

    36680 songs/second is nice, but 13 seconds for 1000 songs is not. So which is it?

    1. Both are right. Please read the post carefully. In order to compare 2 songs, you have to do one comparison between two songs (comparing 200 fingerprint pairs, but that’s not relevant at this moment). If you have 3 songs and want to compare all of them to find which are similar, you have to compare the first with the second and then the first and second with the third, that is, 3 comparisons. To compare 4 songs, you have to compare 1 with 2, then 1,2 with 3, then 1,2,3 with 4 giving a total of 6 comparisons. And to compare N songs, you have to compare (N-1)*N/2 songs. So if you have 1000 songs in your collection, you have to do 999*1000/2 = 499500 song comparisons. At 36680 songs compared/second, that gives: 499500/36680 = 13.62 seconds.

      1. Where you say “songs per second” it’s actually “comparisons per second”. As the number of songs increases, the comparisons per second to resolve the differences increases, hence the increasing discrepancy between comparisons per second and songs per second. So… report it as comparisons per second, because that’s what it is.

        As an aside… it would be nice if your form had a warning that javascript is required to make comments.

  5. Interesting read, thank you!

    Utilizing the cache properly is often more important than O(n) nowadays (within reason, of course), I recommend watching Herb Sutter’s talk on “No free lunch” from a couple of years back. They compare the speed of linked lists and vectors for random input/removal for various sizes and I can tell you that the end result would surprise most people.
    This might be the big reason to why the GCC for_each implementation was faster than the other for_each, simply because vector uses the cache much more efficiently than map. But I guess the only way of knowing is to test and measure 🙂

  6. Two small tips:
    – look at the pybind11 library, it’s header only, so you can get rid of the boost dependency, and I find it a little nicer than boost.python
    – try tbb (threading building blocks), it’s parallel_for is exactly what you need, and it supposed to be faster than openmp

  7. First thing I thought when reading this: “This looks to be cross correlation issue”. Would calculating cross correlation and only doing the comparison for high threshold pairs beneficial ? Or maybe not doing the comparison at all if correlation is in upper 0.9 ?

  8. Google for “Local Sensitive Hashing”. Using this you should be able to get away with the O(n*n) complexity

  9. Thanks for very interesting post. I think there is some space for improvement by introducing some metadata. First I thought about BPM, but what I read in https://oxygene.sk/2011/01/how-does-chromaprint-work/ it does not provide such an information.
    But you could measure percentage of set bits on Nth positon for whole song and use it as an additional indicator. For 32bit intigers you would have 32 percentage values obtained in single pass. If any of them has significant difference, there is no chance that any shifting will fix that, so songs are different.
    Unfortunately, I can’t tell what significant difference really is, but I guess it may be worth experimenting. Leave you with that idea.

Leave a Reply

Your email address will not be published. Required fields are marked *