SPO600 Blog

Saturday, April 18, 2020

Project Part 3: Optimization

We have reached the third and final part of the project in which the goal is to optimize a piece of open source software, is part one we selected and bench marked the software and in part 2 we did some profiling to see what the slow parts of the program were. Building off of these two steps in this step I will look for ways to better optimize FFmpeg.

This task as I have learned is not an easy one, the first step was to decide on which method would best speed up this program. Looking back at my first test I did and seeing the gcc optimization fail to accomplish anything this would not be easy. The next thing I did was check for individual optimization files, and this led to some more promising results. While using the aarm64 machine aarchie, FFmpeg has 147 files in the libavcodec/aarch64 folder, and the only reason the number is this high is because many of the files generated output files when compiled. This is in contrast to the 194 files within the x86 directory, none of which being output files. This means there have been many more optimizations done specifically for x86. Additionally when doing research on the SIMD optimization of FFmpeg I found this page https://github.com/FFmpeg/FFmpeg/blob/master/doc/optimization.txt which explicitly states that the best way to look to optimize for non x86 systems is to look at what has already been done there.

So I went and checked down all the files looking for .S or .s extensions in the aarch64 directory and looking for .asm in the x86 directory. This gave me two lists which I could use to compare and see what is missing for either. Due to my profiling showing that a function called flac_encode_frame was taking up most of the time I started with the only two files which referenced to flac in the x86 directory, this being flacdsp.asm and flac_dsp_gpl.asm. I was not able to work out exactly what gpl stands for but i believe that dsp stands for display which is not helpful considering I am looking at audio files in my testing. Though this seems like a dead end it still sets me down the right path of checking out these files.

At this point I go back to my profiling and decide to find where in the code that this function is being called. I find the overarching function in flacenc.c. This file is used for encoding using flac, but file itself does not contain the offending code and so I do some digging. Luckily for me there are some excellent online resources for searching functions and files for FFmpeg namely https://ffmpeg.org/doxygen/0.11/index.html which although out of date still gave me the ability to go back into the up to date files and find the offending functions. Which I found in golomb.h

golomb.h is an interesting file. After doing some googling I found that Golomb Encoding is a tpye of encoding which is very much optimized for dealing with small input numbers. Apparently there is a subtype of this encoding known as Rice Coding which is popular with audio data compression which is exactly what I am looking for. So looking at the code which is taking up the most time out of any function for my test cases reveals the following two functions.

/**
* write unsigned golomb rice code (jpegls).
*/
static inline void set_ur_golomb_jpegls(PutBitContext *pb, int i, int k,
                                        int limit, int esc_len)
{
    int e;

    av_assert2(i >= 0);

    e = (i >> k) + 1;
    if (e < limit) {
        while (e > 31) {
            put_bits(pb, 31, 0);
            e -= 31;
        }
        put_bits(pb, e, 1);
        if (k)
            put_sbits(pb, k, i);
    } else {
        while (limit > 31) {
            put_bits(pb, 31, 0);
            limit -= 31;
        }
        put_bits(pb, limit, 1);
        put_bits(pb, esc_len, i - 1);
    }
}

/**
* write signed golomb rice code (ffv1).
*/
static inline void set_sr_golomb(PutBitContext *pb, int i, int k, int limit,
                                 int esc_len)
{
    int v;

    v = -2 * i - 1;
    v ^= (v >> 31);

    set_ur_golomb(pb, v, k, limit, esc_len);
}

These functions are used to do the encoding of the audio. As we have learned this type of coding is popular because it is optimized for smaller input values, as such seeing int as the datatype for both v and e to me shows there is room for improvment.

ldr w19,[x23],#4
mvn w19, w19, lsl #1
eor w19, w19, w19, asr #31

These three lines represent these two lines of code

    v = -2 * i - 1;
    v ^= (v >> 31);

It is with confidence then That I can now say what my strategy would be for optimizing this code. Firstly I would test to see if I could turn that int v into a int16_t without any issues. If I am able to do this, then it would be possible to manually vecotrize this code, specifically mvn and eor both work with vecotrized data and so this would have potential to cut these to operations which in testing accounted for about 8 percent of the runtime down by a fraction.

This would work by changing the register from full words to quarter words and thus allowing us to run commands on four times the values at the same time. This should in theory quarter the amount of time the program spends on these commands. Which would for our test add up to approximately 6 seconds shaved off of our time an over 6% increase in performance. Though this is assuming two key things which I would have to test for. Firstly It is assuming that I can turn the ints into 16 bit ints. If this failed there may still be ways to only use 16 bit ints sometimes but then the question becomes if doing the check and making the adjustment is slower than the current method. The second assumption is that these lines of code will be faster even after converting between word size, considering these values are then going to be used later on in the program after being encoded.

In conclusion I believe I have found a viable line of thinking towards improving performance of the audio encoding in FFmpeg on aarch64 systems, though I would need to test the viability of my approaches before being sure they are either possible or more efficient in practice.

Wednesday, April 15, 2020

Project Part 2: Profiling

In this Part of the project out goal was to take out program and profile it, seeing where the program was spending most of its time. Towards this goal we learned to utilize two tools which I used for my own profiling, gprof and perf.

gprof is a profiling tool which inserts itself into the program with the command -pg, which created a gmon.out file allowing us to see how our program runs. This file can then be read though it isn't the easiest to understand and is better when represented graphically. This data can be converted into a graph with:
gprof ./ffmpeg_g | gprof2dot | dot
This code pipes out gprof output into a program which converts it into a form that can then be piped into a graphical format. I was having issues getting a graph to appear in its own window using the added parameters -T x11, and so I took the output and pasted it onto a web interpreter http://www.webgraphviz.com/. This website took that output and generated the graph seen in figure 5 which shows that flac_encode_frame is where the program ends up spending a lot of its time. This function is most likely the part of the program where the file is re encoded into the new format.

After Finding this I ran perf. This tool allows us to look at which lines of the assembly code are taking up the most time to run on each system and so give us an idea of ways to speed them up.
to run perf first I reconfigured the make to remove -pg then I ran a make clean and then a make. After this I ran the program but with the extra call to perf which looked like:

perf record ./ffmpeg_g -i ../OneHourBenchmark.mp3 OneHourBenchmarkOut.ogg
which gave the following 2 images.

figure 1

figure 2

The first of these images is a screenshot of the highest use functions of code in the program as well as their percentage of time spent there and what file they are a part of. The second image is what happens when you run annotate function and this shows a breakdown of both the source code and the assembly code generated by the source code. This assembly code is then labeled with what percentage of time was spent on each action. This can be a bit misleading however especially with smaller sample sizes as what is actually happening is that perf is interrupting the program every so often to sample which command is being run, so near misses can occur and some lines may be missed entirely dispute clearly being run. My particular test took around 1 minute and so was enough data to give a reasonable assessment of the code. As we can see comparing the results above with the results bellow the program spends its time within this function doing different things depending on the hardware,with ARM being above and x86 being below. Because of this different types of optimization will only work on certain machines and we will have to look into what specifically we will want to change in the next part.

figure 3

figure 4

figure 5

Wednesday, April 8, 2020

Project Part 0: The Before

In part 1 of the project blogs I discussed how I picked FFmpeg for the project but glossed over the process of deciding on that software. In general I don't use linux a lot in my daily life, I have windows on my laptop and desktop and prefer it for my usual use, with using things like winscp and putty to deal with linux tasks for school. Though with this task it is obviously needed to find a software that works in linux.

Luckily there are many tools in windows for running linux applications either with a virtual machine or with some sort of linux faking program and these generally work so that was not the hardest part. The hardest part, for me anyways, was picking a software. Since I didn't use linux much I was not very aware of linux software let alone open source software (though plenty of it is so that's helpful). I ended looking at both the github most popular repositories and GNU open source software such as gzip and after a while of digging which also involved looking at previous year's projects I decided on FFmpeg. This is because tasks such as changing formats for an audio or video file are cpu intensive and because I found it easiest to get test data for such tasks.

Lab 5 Simple Loop Program

For this lab, we were given a very simple task, to loop through some numbers and print them out. In any programming language this is one of the first things you learn to do and is super easy. Though assembly is not so simple and as such we had to work at it.

The first problem was getting the number displayed, the counting was easy, start with zero and loop till you reach a number, but getting that number displayed was a bit tricky. Luckily numbers are offset as characters by a set amount per digit and so as long as we could split the digits we could make the numbers into characters. This was far easier in x86 and Aarm64 vs 6502 as we had division operation we could do to separate the digits and thus get the characters.

After properly getting the characters we had to figure out how to add them to the display. After some trial and error and a bit of googling it was discovered that we could simply make the string for the loop have extra blank characters and replace the memory in those addresses with out characters thus giving us the proper output of a loop, printing out numbers.

Wednesday, April 1, 2020

Project Part 1

For this project we are tasked with selecting an open source software and trying to optimize some part of it. For this I chose FFmpeg as it is a file conversion tool which is open source, I know how to get a large data input for bench marking and it is CPU reliant.

So the first Thing I did was build the source code, this was notably more difficult on windows since it is not make to be run on Windows natively but with a some third party downloads, most importantly MSYS2, I was able to get it working. On the linux machine aarchie I had far fewer problems as the tools required for the build were already installed and ready to go. So I built the software on both X86_64 and aarch64 and tested them out.

In testing the first thing I did was run it with default configurations. After the build was completed I ran it using my test data, a download of a video from youtube which was just over an hour long which had already been converted to an mp3 of 83.3MB, which I was not converting to a ogg file. Here were the results,

X86_64 Ryan desktop benchmark
Test1:
real    0m14.516s
user    0m0.015s
sys     0m0.000s

Test2:
real    0m14.087s
user    0m0.000s
sys     0m0.015s

Test3:
real    0m14.235s
user    0m0.000s
sys     0m0.000s

aarch64 aarchie benchmark
Test1:
real    1m50.783s
user    1m48.931s
sys     0m1.415s

Test2:
real    1m50.571s
user    1m49.255s
sys     0m0.897s

Test3:
real    1m50.683s
user    1m49.034s
sys     0m1.236s

This is with minimal background tasks running on my home windows computer, and then on aarchie. As can be seen my computer is a a bit more powerful for this purpose but the results of the benchmark seem consistent.

Now After this was done I cleaned the make and rebuilt using -O3 to see if tweaking the compiler optimization would speed up the program. Here are those results.

X86_64 Ryan desktop O3
Test1:
real    0m14.047s
user    0m0.000s
sys     0m0.015s

Test2:
real    0m14.014s
user    0m0.015s
sys     0m0.000s

Test3:
real    0m14.044s
user    0m0.000s
sys     0m0.000s

aarch64 aarchie O3
Test1:
real    1m50.490s
user    1m48.994s
sys     0m1.086s

Test2:
real    1m50.401s
user    1m48.777s
sys     0m1.196s

Test3:
real    1m50.386s
user    1m48.742s
sys     0m1.236s

As can be seen the results were negligible, though I am unsure if this is due to an improper use of the optimization settings with the makefile or due to the makefile already optimizing the output. Either way if I am going to find a way to considerably speed up FFmpeg I am going to do more than a compiler optimization.

Sunday, March 15, 2020

Lab 4 - Pick Two pt.2

In Part 1 we looked at making an adder in 6502 assembly, this was one of the two tasks we chose for this lab, the other was a screen colour selector. The screen colour selector I personally found was a lot easier to code than the adder and took far lass code to implement. So without any delay here is the whole source code before we dive in to how it works.

; ROM routines
define        SCINIT        $ff81 ; initialize/clear screen
define        CHRIN        $ffcf ; input character from keyboard
define        CHROUT        $ffd2 ; output character to screen
define        SCREEN        $ffed ; get screen size
define        PLOT        $fff0 ; get/set cursor coordinates

        jsr SCINIT
        ldy #$00

initColours:
    lda colours,y
        beq doneInit
        jsr CHROUT
        iny
        bne initColours
doneInit:
    ldy #$00
    ldx #$00
    CLC
    jsr PLOT

    SEC
    jsr PLOT
    jsr flipSelect


checkIn:
    SEC
    jsr PLOT
    jsr CHRIN

    cmp #$80
    beq up
    cmp #$82
    bne checkIn
down:
    cpy #$0f
    beq checkIn
    jsr flipSelect
    iny
    jsr flipSelect
    jsr drawScreen
    jmp checkIn

up:
    cpy #$00
    beq checkIn
    jsr flipSelect
    dey
    jsr flipSelect
    jsr drawScreen
    jmp checkIn


flipSelect:
    ldx #$00
    CLC
    jsr PLOT
    SEC
    jsr PLOT

flipLoop:
    cmp #$20
    beq doneFlip
    eor #$80
    jsr CHROUT
    SEC
    jsr PLOT
    clc
    bcc flipLoop

doneFlip:
    rts

drawScreen:
    tya
    pha
    lda #$00     ; set pointer at $10 to $0200
        sta $10
        lda #$02
        sta $11
    pla

        ldx #$06     ; max value for $11

        ldy #$00     ; index

drawLoop:
    sta ($10),y ; store colour
        iny          ; increment index
        bne drawLoop ; branch until page done

        inc $11      ; increment high byte of pointer
        cpx $11      ; compare with max value
        bne drawLoop ; continue if not done

    rts




colours:
dcb "B","L","A","C","K",10
dcb "W","H","I","T","E",10
dcb "R","E","D",10
dcb "C","Y","A","N",10
dcb "P","U","R","P","L","E",10
dcb "G","R","E","E","N",10
dcb "B","L","U","E",10
dcb "Y","E","L","L","O","W",10
dcb "O","R","A","N","G","E",10
dcb "B","R","O","W","N",10
dcb "L","I","G","H","T",95,"R","E","D",10
dcb "D","A","R","K",95,"G","R","E","Y",10
dcb "G","R","E","Y",10
dcb "L","I","G","H","T",95,"G","R","E","E","N",10
dcb "L","I","G","H","T",95,"B","L","U","E",10
dcb "L","I","G","H","T",95,"G","R","E","Y",00

This code, similar to the code in pt. 1 uses a main loop as the body of the program, though this one is a bit different and also isn't named main. Though this is getting ahead of ourselves the first thing the program does is initialize the screen, it does this by going through the colour names which are stored in memory and printing them to the screen, which was fairly easy to do considering the ROM routine CHROUT can read newline properly allowing the whole thing to be one block of memory.

The main loop for this program is called checkIn, this loop is checking for an input and when it receives it updates the screen accordingly, both up and down inputs work about the same so let's just look at up. When the up arrow in pressed checkIn calls the subroutine up. This subroutine will then check if up is valid (ie: not the top of the screen) and if so it will remove the selection, then select the proper line before changing the screen colour then returning to the checkIn loop.

Now of course up and down both call their own subroutines which I will explain now. flipSelect is a pretty interesting subroutine, it simply flips the high bit of every character in whatever line is in y. which will be the currently selected line. So the first time it is called it flips it off to deselect the line, then the second time it is called it flips it on selecting the new line. The drawScreen subroutine is one we have used a lot in the course so far. It simply takes what is currently in y, as established this will be the current selected element, and it will fill the screen with that colour. We made sure to align the colours on the screen with their places in memory so their y values line up with their colour values. This results in the correct colour being displayed

putting these few simple subroutines together along with the Rom routines and we have a very compact and easy to understand bit of code which can allow you to select a colour and display it on the screen.

Lab 4 - Pick Two pt.1

For this lab we were tasked with creating two programs from a list in 6502 assembly code. These tasks ranged in difficulty though some things were made far easier by the introduction of ROM routines which I will explain a bit later. The two tasks which out group decided on were the calculator and the colour selector. We chose these two for a couple of reasons, the largest being that we all felt most confident in out own ability to get done these two tasks over any others. So with our tasks in hand we set out to get an understanding on ROM Routines.

ROM routines are basically snippets of code saved in the memory of the chip, In order to access these you simply need to start a subroutine and the right address and it will run the subroutine as if it were code which you wrote. The Routines given to us did things which previously requited many lines of code now in just one with smart use of the various registers in order to supply input to these routines.

Now before getting to the code for the adder it is important to note that we were never able to get the blinking cursor to work properly, though the rest of the program works for sure. Now here is the full source code which will be explained bellow.

; ROM routines
define        SCINIT        $ff81 ; initialize/clear screen
define        CHRIN        $ffcf ; input character from keyboard
define        CHROUT        $ffd2 ; output character to screen
define        SCREEN        $ffed ; get screen size
define        PLOT        $fff0 ; get/set cursor coordinates

define        NUMBERA        $10;
define        NUMBERB        $20;

        jsr SCINIT

mainLoop:
    ldy #$00
    jsr char1
    jsr input
    jsr storeA
    ldy #$00
    jsr char2
    jsr input
    jsr storeB
    ldy #$00
    jsr charR
    jsr printAdd
    jmp mainLoop


input:
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT



inLoop:
    SEC
    jsr PLOT
    jsr CHRIN

charCheck:
    cmp #$00
    beq inLoop

    cmp #$81
    beq right

    cmp #$83
    beq left

    cmp #$0d
    beq next

drawNum:
    cmp #$30
    bcc inLoop

    clc
    cmp #$3a
    bcs inLoop

    jsr CHROUT

    SEC
    jsr PLOT
    cpx #$17
    bne inLoop
    dex
    CLC
    jsr PLOT
    jmp inLoop

left:    cpx #$15
    beq inLoop
    jsr CHROUT
    jmp inLoop

right:    cpx #$16
    beq inLoop
    jsr CHROUT
    jmp inLoop

next:
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    CLC
    SBC #$2F

    ASL
    ASL
    ASL
    ASL

    PHA


    ldx #$16
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    CLC
    SBC #$2F
    PHA

    ldx #$00
    iny
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    PLA
    TAX
    PLA

    rts

storeA:
    sta NUMBERA
    txa
    eor NUMBERA
    sta NUMBERA
    rts


storeB:
    sta NUMBERB
    txa
    eor NUMBERB
    sta NUMBERB
    rts

printAdd:
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    SED
    lda NUMBERA
    adc NUMBERB
    CLD
    pha

    bcc outputAddition
    ldx #$14
    CLC
    jsr PLOT
    SEC
    jsr PLOT
    lda #$31
    jsr CHROUT

outputAddition:
    pla
    pha
    LSR
    LSR
    LSR
    LSR
    clc
    adc #$30
    jsr CHROUT

    pla
    and #$0F
    clc
    adc #$30
    jsr CHROUT

    SEC
    jsr PLOT
    ldx #$00
    iny
    CLC
    jsr PLOT

    rts






char1: lda firstDigit,y
        beq charRet
        jsr CHROUT
        iny
        bne char1

char2: lda secondDigit,y
        beq charRet
        jsr CHROUT
        iny
        bne char2

charR: lda result,y
        beq charRet
        jsr CHROUT
        iny
        bne charR

charRet:
    rts

firstDigit:
dcb "E","N","T","E","R",32,"F","I","R","S","T",32,"D","I","G","I","T",":",32,32,32,"0","0"
dcb 00

secondDigit:
dcb "E","N","T","E","R",32,"S","E","C","O","N","D",32,"D","I","G","I","T",":",32,32,"0","0"
dcb 00

result:
dcb "R","E","S","U","L","T",":"
dcb 00

The Adder is actually quite a simple program especially thanks to ROM routines. The writing went through a few revisions before settling on the final result, this is due to poor user of subroutines in previous incarnations as well as unfamiliarity with the ROM routines. Now speaking of the ROM routines let's discuss what they are accomplishing for us here.

there are three important ROM routines to making this code as compact as it is, the first if CHROUT, this will spit out a character onto the screen and then move the cursor over one, very simple but extremely useful. It does this by simply checking the accumulator for the value and then putting it at the current cursor location, the next routine used is the compliment to CHROUT, CHRIN. As the name implies CHRIN takes a character input, this input is stored in the accumulator which means it can be used in conjunction with CHROUT to print input to the screen. The last important ROM routines we have is PLOT. This routine has two functions depending on the state of the carry flag. Either it gets the current cursor position and returns the value of the character there, or it sets the current cursor position based on what is currently in x and y.

Using these Tools the code goes through a main loop which utilizes a few subroutines in order to keep it as readable as possible. This main loop is only 12 lines but does all the work of the program vie the subroutines it calls. Let's take a closer look at it.

mainLoop:
    ldy #$00
    jsr char1
    jsr input
    jsr storeA
    ldy #$00
    jsr char2
    jsr input
    jsr storeB
    ldy #$00
    jsr charR
    jsr printAdd
    jmp mainLoop

So the first step is to set y to a known value, this allows char1 to work properly as it will print onto the screen the instructions to enter the first input and the start location is whatever y is. Next we get the first user input. This is done through quite complex code which I will explain further down. following this it stores the first number then it does it all again for the second number. After this it prints the result text followed by doing the actual addition and printing those results, this whole thing then loops allowing the program to keep taking inputs.

Now let's break down a couple of the more important components there, namely input, storeX and printAdd. input is how we get out user input and as we are simply making a calculator only certain characters are allowed we do this by limiting the values that can be read by CHRIN, if it gets anything else we simply ask it to try again. This ensures we are either getting a number, an arrow key, or the enter key. When a character is input, it will also do a check on whether or not the input field is on the second digit in order to ensure you can only write two digits. Next storeA and storeB which take the numbers provided to them and store them in different addresses, fairly simple. Lastly printAdd which does the actual adding. This subroutine switches to decimal mode before adding the two values stored in the previous subroutines together. It then checks to see if there was a carry and if there was it draws a 1 before the number, before then drawing the output of the addition.

All of this put together and we have a working, though not perfect Adder in 6502 assembly.