Saturday, April 18, 2020

Project Part 3: Optimization

We have reached the third and final part of the project in which the goal is to optimize a piece of open source software, is part one we selected and bench marked the software and in part 2 we did some profiling to see what the slow parts of the program were. Building off of these two steps in this step I will look for ways to better optimize FFmpeg.

This task as I have learned is not an easy one, the first step was to decide on which method would best speed up this program. Looking back at my first test I did and seeing the gcc optimization fail to accomplish anything this would not be easy. The next thing I did was check for individual optimization files, and this led to some more promising results. While using the aarm64 machine aarchie, FFmpeg has 147 files in the libavcodec/aarch64 folder, and the only reason the number is this high is because many of the files generated output files when compiled. This is in contrast to the 194 files within the x86 directory, none of which being output files. This means there have been many more optimizations done specifically for x86. Additionally when doing research on the SIMD optimization of FFmpeg I found this page https://github.com/FFmpeg/FFmpeg/blob/master/doc/optimization.txt which explicitly states that the best way to look to optimize for non x86 systems is to look at what has already been done there.  

So I went and checked down all the files looking for .S or .s extensions in the aarch64 directory and looking for .asm in the x86 directory. This gave me two lists which I could use to compare and see what is missing for either. Due to my profiling showing that a function called flac_encode_frame was taking up most of the time I started with the only two files which referenced to flac in the x86 directory, this being flacdsp.asm and flac_dsp_gpl.asm. I was not able to work out exactly what gpl stands for but i believe that dsp stands for display which is not helpful considering I am looking at audio files in my testing. Though this seems like a dead end it still sets me down the right path of checking out these files. 

At this point I go back to my profiling and decide to find where in the code that this function is being called. I find the overarching function in flacenc.c. This file is used for encoding using flac, but file itself does not contain the offending code and so I do some digging. Luckily for me there are some excellent online resources for searching functions and files for FFmpeg namely  https://ffmpeg.org/doxygen/0.11/index.html which although out of date still gave me the ability to go back into the up to date files and find the offending functions. Which I found in golomb.h

golomb.h is an interesting file. After doing some googling I found that Golomb Encoding is a tpye of encoding which is very much optimized for dealing with small input numbers. Apparently there is a subtype of this encoding known as Rice Coding which is popular with audio data compression which is exactly what I am looking for. So looking at the code which is taking up the most time out of any function for my test cases reveals the following two functions.

/**
 * write unsigned golomb rice code (jpegls).
 */

static inline void set_ur_golomb_jpegls(PutBitContext *pb, int i, int k,
                                        int limit, int esc_len)
{
    int e;

    av_assert2(i >= 0);

    e = (i >> k) + 1;
    if (e < limit) {
        while (e > 31) {
            put_bits(pb, 31, 0);
            e -= 31;
        }
        put_bits(pb, e, 1);
        if (k)
            put_sbits(pb, k, i);
    } else {
        while (limit > 31) {
            put_bits(pb, 31, 0);
            limit -= 31;
        }
        put_bits(pb, limit, 1);
        put_bits(pb, esc_len, i - 1);
    }
}

/**
 * write signed golomb rice code (ffv1).
 */
static inline void set_sr_golomb(PutBitContext *pb, int i, int k, int limit,
                                 int esc_len)
{
    int v;

    v  = -2 * i - 1;
    v ^= (v >> 31);

    set_ur_golomb(pb, v, k, limit, esc_len);
}


These functions are used to do the encoding of the audio. As we have learned this type of coding is popular because it is optimized for smaller input values, as such seeing int as the datatype for both v and e to me shows there is room for improvment.

ldr  w19,[x23],#4                                                                
mvn  w19, w19, lsl #1  
eor  w19, w19, w19, asr #31   

 These three lines represent these two lines of code


    v  = -2 * i - 1;
    v ^= (v >> 31);


It is with confidence then That I can now say what my strategy would be for optimizing this code. Firstly I would test to see if I could turn that int v into a int16_t without any issues. If I am able to do this, then it would be possible to manually vecotrize this code, specifically mvn and eor both work with vecotrized data and so this would have potential to cut these to operations which in testing accounted for about 8 percent of the runtime down by a fraction.

This would work by changing the register from full words to quarter words and thus allowing us to run commands on four times the values at the same time. This should in theory quarter the amount of time the program spends on these commands. Which would for our test add up to approximately 6 seconds shaved off of our time an over 6% increase in performance. Though this is assuming two key things which I would have to test for. Firstly It is assuming that I can turn the ints into 16 bit ints. If this failed there may still be ways to only use 16 bit ints sometimes but then the question becomes if doing the check and making the adjustment is slower than the current method. The second assumption is that these lines of code will be faster even after converting between word size, considering these values are then going to be used later on in the program after being encoded. 

In conclusion I believe I have found a viable line of thinking towards improving performance of the audio encoding in FFmpeg on aarch64 systems, though I would need to test the viability of my approaches before being sure they are either possible or more efficient in practice.

Wednesday, April 15, 2020

Project Part 2: Profiling

In this Part of the project out goal was to take out program and profile it, seeing where the program was spending most of its time. Towards this goal we learned to utilize two tools which I used for my own profiling, gprof and perf.

gprof is a profiling tool which inserts itself into the program with the command -pg, which created a gmon.out file allowing us to see how our program runs. This file can then be read though it isn't the easiest to understand and is better when represented graphically. This data can be converted into a graph with:
    gprof ./ffmpeg_g | gprof2dot | dot
This code pipes out gprof output into a program which converts it into a form that can then be piped into a graphical format. I was having issues getting a graph to appear in its own window using the added parameters -T x11, and so I took the output and pasted it onto a web interpreter http://www.webgraphviz.com/. This website took that output and generated the graph seen in figure 5 which shows that flac_encode_frame is where the program ends up spending a lot of its time. This function is most likely the part of the program where the file is re encoded into the new format.


After Finding this I ran perf. This tool allows us to look at which lines of the assembly code are taking up the most time to run on each system and so give us an idea of ways to speed them up.
to run perf first I reconfigured the make to remove -pg then I ran a make clean and then a make. After this I ran the program but with the extra call to perf which looked like:

perf record ./ffmpeg_g -i ../OneHourBenchmark.mp3 OneHourBenchmarkOut.ogg
which gave the following 2 images.
figure 1

figure 2

The first of these images is a screenshot of the highest use functions of code in the program as well as their percentage of time spent there and what file they are a part of. The second image is what happens when you run annotate function and this shows a breakdown of both the source code and the assembly code generated by the source code. This assembly code is then labeled with what percentage of time was spent on each action. This can be a bit misleading however especially with smaller sample sizes as what is actually happening is that perf is interrupting the program every so often to sample which command is being run, so near misses can occur and some lines may be missed entirely dispute clearly being run. My particular test took around 1 minute and so was enough data to give a reasonable assessment of the code. As we can see comparing the results above with the results bellow the program spends its time within this function doing different things depending on the hardware,with ARM being above and x86 being below. Because of this different types of optimization will only work on certain machines and we will have to look into what specifically we will want to change in the next part.


figure 3

figure 4

figure 5

Wednesday, April 8, 2020

Project Part 0: The Before

In part 1 of the project blogs I discussed how I picked FFmpeg for the project but glossed over the process of deciding on that software. In general I don't use linux a lot in my daily life, I have windows on my laptop and desktop and prefer it for my usual use, with using things like winscp and putty to deal with linux tasks for school. Though with this task it is obviously needed to find a software that works in linux.

Luckily there are many tools in windows for running linux applications either with a virtual machine or with some sort of linux faking program and these generally work so that was not the hardest part. The hardest part, for me anyways, was picking a software. Since I didn't use linux much I was not very aware of linux software let alone open source software (though plenty of it is so that's helpful). I ended looking at both the github most popular repositories and GNU open source software such as gzip and after a while of digging which also involved looking at previous year's projects I decided on FFmpeg. This is because tasks such as changing formats for an audio or video file are cpu intensive and because I found it easiest to get test data for such tasks.

Lab 5 Simple Loop Program

For this lab, we were given a very simple task, to loop through some numbers and print them out. In any programming language this is one of the first things you learn to do and is super easy. Though assembly is not so simple and as such we had to work at it.

The first problem was getting the number displayed, the counting was easy, start with zero and loop till you reach a number, but getting that number displayed was a bit tricky. Luckily numbers are offset as characters by a set amount per digit and so as long as we could split the digits we could make the numbers into characters. This was far easier in x86 and Aarm64 vs 6502 as we had division operation we could do to separate the digits and thus get the characters.

After properly getting the characters we had to figure out how to add them to the display. After some trial and error and a bit of googling it was discovered that we could simply make the string for the loop have extra blank characters and replace the memory in those addresses with out characters thus giving us the proper output of a loop, printing out numbers.

Wednesday, April 1, 2020

Project Part 1

For this project we are tasked with selecting an open source software and trying to optimize some part of it. For this I chose FFmpeg as it is a file conversion tool which is open source, I know how to get a large data input for bench marking and it is CPU reliant.

So the first Thing I did was build the source code, this was notably more difficult on windows since it is not make to be run on Windows natively but with a some third party downloads, most importantly MSYS2, I was able to get it working. On the linux machine aarchie I had far fewer problems as the tools required for the build were already installed and ready to go. So I built the software on both X86_64 and aarch64 and tested them out.

In testing the first thing I did was run it with default configurations. After the build was completed I ran it using my test data, a download of a video from youtube which was just over an hour long which had already been converted to an mp3 of 83.3MB, which I was not converting to a ogg file. Here were the results,

X86_64 Ryan desktop benchmark
Test1:
real    0m14.516s
user    0m0.015s
sys     0m0.000s

Test2:
real    0m14.087s
user    0m0.000s
sys     0m0.015s


Test3:
real    0m14.235s
user    0m0.000s
sys     0m0.000s


aarch64 aarchie benchmark

Test1:
real    1m50.783s
user    1m48.931s
sys     0m1.415s

Test2:
real    1m50.571s
user    1m49.255s
sys     0m0.897s

Test3:
real    1m50.683s
user    1m49.034s
sys     0m1.236s





This is with minimal background tasks running on my home windows computer, and then on aarchie. As can be seen my computer is a a bit more powerful for this purpose but the results of the benchmark seem consistent.

Now After this was done I cleaned the make and rebuilt using -O3 to see if tweaking the compiler optimization would speed up the program. Here are those results.


X86_64 Ryan desktop O3
Test1:
real    0m14.047s
user    0m0.000s
sys     0m0.015s

Test2:
real    0m14.014s
user    0m0.015s
sys     0m0.000s

Test3:
real    0m14.044s
user    0m0.000s
sys     0m0.000s


aarch64 aarchie O3

Test1:
real    1m50.490s
user    1m48.994s
sys     0m1.086s

Test2:
real    1m50.401s
user    1m48.777s
sys     0m1.196s

Test3:
real    1m50.386s
user    1m48.742s
sys     0m1.236s

As can be seen the results were negligible, though I am unsure if this is due to an improper use of the optimization settings with the makefile or due to the makefile already optimizing the output. Either way if I am going to find a way to considerably speed up FFmpeg I am going to do more than a compiler optimization.