We have reached the third and final part of the project in which the goal is to optimize a piece of open source software, is part one we selected and bench marked the software and in part 2 we did some profiling to see what the slow parts of the program were. Building off of these two steps in this step I will look for ways to better optimize FFmpeg.
This task as I have learned is not an easy one, the first step was to decide on which method would best speed up this program. Looking back at my first test I did and seeing the gcc optimization fail to accomplish anything this would not be easy. The next thing I did was check for individual optimization files, and this led to some more promising results. While using the aarm64 machine aarchie, FFmpeg has 147 files in the libavcodec/aarch64 folder, and the only reason the number is this high is because many of the files generated output files when compiled. This is in contrast to the 194 files within the x86 directory, none of which being output files. This means there have been many more optimizations done specifically for x86. Additionally when doing research on the SIMD optimization of FFmpeg I found this page https://github.com/FFmpeg/FFmpeg/blob/master/doc/optimization.txt which explicitly states that the best way to look to optimize for non x86 systems is to look at what has already been done there.
So I went and checked down all the files looking for .S or .s extensions in the aarch64 directory and looking for .asm in the x86 directory. This gave me two lists which I could use to compare and see what is missing for either. Due to my profiling showing that a function called flac_encode_frame was taking up most of the time I started with the only two files which referenced to flac in the x86 directory, this being flacdsp.asm and flac_dsp_gpl.asm. I was not able to work out exactly what gpl stands for but i believe that dsp stands for display which is not helpful considering I am looking at audio files in my testing. Though this seems like a dead end it still sets me down the right path of checking out these files.
At this point I go back to my profiling and decide to find where in the code that this function is being called. I find the overarching function in flacenc.c. This file is used for encoding using flac, but file itself does not contain the offending code and so I do some digging. Luckily for me there are some excellent online resources for searching functions and files for FFmpeg namely https://ffmpeg.org/doxygen/0.11/index.html which although out of date still gave me the ability to go back into the up to date files and find the offending functions. Which I found in golomb.h
golomb.h is an interesting file. After doing some googling I found that Golomb Encoding is a tpye of encoding which is very much optimized for dealing with small input numbers. Apparently there is a subtype of this encoding known as Rice Coding which is popular with audio data compression which is exactly what I am looking for. So looking at the code which is taking up the most time out of any function for my test cases reveals the following two functions.
/**
* write unsigned golomb rice code (jpegls).
*/
static inline void set_ur_golomb_jpegls(PutBitContext *pb, int i, int k,
int limit, int esc_len)
{
int e;
av_assert2(i >= 0);
e = (i >> k) + 1;
if (e < limit) {
while (e > 31) {
put_bits(pb, 31, 0);
e -= 31;
}
put_bits(pb, e, 1);
if (k)
put_sbits(pb, k, i);
} else {
while (limit > 31) {
put_bits(pb, 31, 0);
limit -= 31;
}
put_bits(pb, limit, 1);
put_bits(pb, esc_len, i - 1);
}
}
/**
* write signed golomb rice code (ffv1).
*/
static inline void set_sr_golomb(PutBitContext *pb, int i, int k, int limit,
int esc_len)
{
int v;
v = -2 * i - 1;
v ^= (v >> 31);
set_ur_golomb(pb, v, k, limit, esc_len);
}
These functions are used to do the encoding of the audio. As we have learned this type of coding is popular because it is optimized for smaller input values, as such seeing int as the datatype for both v and e to me shows there is room for improvment.
ldr w19,[x23],#4
mvn w19, w19, lsl #1
eor w19, w19, w19, asr #31
These three lines represent these two lines of code
v = -2 * i - 1;
v ^= (v >> 31);
It is with confidence then That I can now say what my strategy would be for optimizing this code. Firstly I would test to see if I could turn that int v into a int16_t without any issues. If I am able to do this, then it would be possible to manually vecotrize this code, specifically mvn and eor both work with vecotrized data and so this would have potential to cut these to operations which in testing accounted for about 8 percent of the runtime down by a fraction.
This would work by changing the register from full words to quarter words and thus allowing us to run commands on four times the values at the same time. This should in theory quarter the amount of time the program spends on these commands. Which would for our test add up to approximately 6 seconds shaved off of our time an over 6% increase in performance. Though this is assuming two key things which I would have to test for. Firstly It is assuming that I can turn the ints into 16 bit ints. If this failed there may still be ways to only use 16 bit ints sometimes but then the question becomes if doing the check and making the adjustment is slower than the current method. The second assumption is that these lines of code will be faster even after converting between word size, considering these values are then going to be used later on in the program after being encoded.
In conclusion I believe I have found a viable line of thinking towards improving performance of the audio encoding in FFmpeg on aarch64 systems, though I would need to test the viability of my approaches before being sure they are either possible or more efficient in practice.
No comments:
Post a Comment