Saturday, April 18, 2020

Project Part 3: Optimization

We have reached the third and final part of the project in which the goal is to optimize a piece of open source software, is part one we selected and bench marked the software and in part 2 we did some profiling to see what the slow parts of the program were. Building off of these two steps in this step I will look for ways to better optimize FFmpeg.

This task as I have learned is not an easy one, the first step was to decide on which method would best speed up this program. Looking back at my first test I did and seeing the gcc optimization fail to accomplish anything this would not be easy. The next thing I did was check for individual optimization files, and this led to some more promising results. While using the aarm64 machine aarchie, FFmpeg has 147 files in the libavcodec/aarch64 folder, and the only reason the number is this high is because many of the files generated output files when compiled. This is in contrast to the 194 files within the x86 directory, none of which being output files. This means there have been many more optimizations done specifically for x86. Additionally when doing research on the SIMD optimization of FFmpeg I found this page https://github.com/FFmpeg/FFmpeg/blob/master/doc/optimization.txt which explicitly states that the best way to look to optimize for non x86 systems is to look at what has already been done there.  

So I went and checked down all the files looking for .S or .s extensions in the aarch64 directory and looking for .asm in the x86 directory. This gave me two lists which I could use to compare and see what is missing for either. Due to my profiling showing that a function called flac_encode_frame was taking up most of the time I started with the only two files which referenced to flac in the x86 directory, this being flacdsp.asm and flac_dsp_gpl.asm. I was not able to work out exactly what gpl stands for but i believe that dsp stands for display which is not helpful considering I am looking at audio files in my testing. Though this seems like a dead end it still sets me down the right path of checking out these files. 

At this point I go back to my profiling and decide to find where in the code that this function is being called. I find the overarching function in flacenc.c. This file is used for encoding using flac, but file itself does not contain the offending code and so I do some digging. Luckily for me there are some excellent online resources for searching functions and files for FFmpeg namely  https://ffmpeg.org/doxygen/0.11/index.html which although out of date still gave me the ability to go back into the up to date files and find the offending functions. Which I found in golomb.h

golomb.h is an interesting file. After doing some googling I found that Golomb Encoding is a tpye of encoding which is very much optimized for dealing with small input numbers. Apparently there is a subtype of this encoding known as Rice Coding which is popular with audio data compression which is exactly what I am looking for. So looking at the code which is taking up the most time out of any function for my test cases reveals the following two functions.

/**
 * write unsigned golomb rice code (jpegls).
 */

static inline void set_ur_golomb_jpegls(PutBitContext *pb, int i, int k,
                                        int limit, int esc_len)
{
    int e;

    av_assert2(i >= 0);

    e = (i >> k) + 1;
    if (e < limit) {
        while (e > 31) {
            put_bits(pb, 31, 0);
            e -= 31;
        }
        put_bits(pb, e, 1);
        if (k)
            put_sbits(pb, k, i);
    } else {
        while (limit > 31) {
            put_bits(pb, 31, 0);
            limit -= 31;
        }
        put_bits(pb, limit, 1);
        put_bits(pb, esc_len, i - 1);
    }
}

/**
 * write signed golomb rice code (ffv1).
 */
static inline void set_sr_golomb(PutBitContext *pb, int i, int k, int limit,
                                 int esc_len)
{
    int v;

    v  = -2 * i - 1;
    v ^= (v >> 31);

    set_ur_golomb(pb, v, k, limit, esc_len);
}


These functions are used to do the encoding of the audio. As we have learned this type of coding is popular because it is optimized for smaller input values, as such seeing int as the datatype for both v and e to me shows there is room for improvment.

ldr  w19,[x23],#4                                                                
mvn  w19, w19, lsl #1  
eor  w19, w19, w19, asr #31   

 These three lines represent these two lines of code


    v  = -2 * i - 1;
    v ^= (v >> 31);


It is with confidence then That I can now say what my strategy would be for optimizing this code. Firstly I would test to see if I could turn that int v into a int16_t without any issues. If I am able to do this, then it would be possible to manually vecotrize this code, specifically mvn and eor both work with vecotrized data and so this would have potential to cut these to operations which in testing accounted for about 8 percent of the runtime down by a fraction.

This would work by changing the register from full words to quarter words and thus allowing us to run commands on four times the values at the same time. This should in theory quarter the amount of time the program spends on these commands. Which would for our test add up to approximately 6 seconds shaved off of our time an over 6% increase in performance. Though this is assuming two key things which I would have to test for. Firstly It is assuming that I can turn the ints into 16 bit ints. If this failed there may still be ways to only use 16 bit ints sometimes but then the question becomes if doing the check and making the adjustment is slower than the current method. The second assumption is that these lines of code will be faster even after converting between word size, considering these values are then going to be used later on in the program after being encoded. 

In conclusion I believe I have found a viable line of thinking towards improving performance of the audio encoding in FFmpeg on aarch64 systems, though I would need to test the viability of my approaches before being sure they are either possible or more efficient in practice.

Wednesday, April 15, 2020

Project Part 2: Profiling

In this Part of the project out goal was to take out program and profile it, seeing where the program was spending most of its time. Towards this goal we learned to utilize two tools which I used for my own profiling, gprof and perf.

gprof is a profiling tool which inserts itself into the program with the command -pg, which created a gmon.out file allowing us to see how our program runs. This file can then be read though it isn't the easiest to understand and is better when represented graphically. This data can be converted into a graph with:
    gprof ./ffmpeg_g | gprof2dot | dot
This code pipes out gprof output into a program which converts it into a form that can then be piped into a graphical format. I was having issues getting a graph to appear in its own window using the added parameters -T x11, and so I took the output and pasted it onto a web interpreter http://www.webgraphviz.com/. This website took that output and generated the graph seen in figure 5 which shows that flac_encode_frame is where the program ends up spending a lot of its time. This function is most likely the part of the program where the file is re encoded into the new format.


After Finding this I ran perf. This tool allows us to look at which lines of the assembly code are taking up the most time to run on each system and so give us an idea of ways to speed them up.
to run perf first I reconfigured the make to remove -pg then I ran a make clean and then a make. After this I ran the program but with the extra call to perf which looked like:

perf record ./ffmpeg_g -i ../OneHourBenchmark.mp3 OneHourBenchmarkOut.ogg
which gave the following 2 images.
figure 1

figure 2

The first of these images is a screenshot of the highest use functions of code in the program as well as their percentage of time spent there and what file they are a part of. The second image is what happens when you run annotate function and this shows a breakdown of both the source code and the assembly code generated by the source code. This assembly code is then labeled with what percentage of time was spent on each action. This can be a bit misleading however especially with smaller sample sizes as what is actually happening is that perf is interrupting the program every so often to sample which command is being run, so near misses can occur and some lines may be missed entirely dispute clearly being run. My particular test took around 1 minute and so was enough data to give a reasonable assessment of the code. As we can see comparing the results above with the results bellow the program spends its time within this function doing different things depending on the hardware,with ARM being above and x86 being below. Because of this different types of optimization will only work on certain machines and we will have to look into what specifically we will want to change in the next part.


figure 3

figure 4

figure 5

Wednesday, April 8, 2020

Project Part 0: The Before

In part 1 of the project blogs I discussed how I picked FFmpeg for the project but glossed over the process of deciding on that software. In general I don't use linux a lot in my daily life, I have windows on my laptop and desktop and prefer it for my usual use, with using things like winscp and putty to deal with linux tasks for school. Though with this task it is obviously needed to find a software that works in linux.

Luckily there are many tools in windows for running linux applications either with a virtual machine or with some sort of linux faking program and these generally work so that was not the hardest part. The hardest part, for me anyways, was picking a software. Since I didn't use linux much I was not very aware of linux software let alone open source software (though plenty of it is so that's helpful). I ended looking at both the github most popular repositories and GNU open source software such as gzip and after a while of digging which also involved looking at previous year's projects I decided on FFmpeg. This is because tasks such as changing formats for an audio or video file are cpu intensive and because I found it easiest to get test data for such tasks.

Lab 5 Simple Loop Program

For this lab, we were given a very simple task, to loop through some numbers and print them out. In any programming language this is one of the first things you learn to do and is super easy. Though assembly is not so simple and as such we had to work at it.

The first problem was getting the number displayed, the counting was easy, start with zero and loop till you reach a number, but getting that number displayed was a bit tricky. Luckily numbers are offset as characters by a set amount per digit and so as long as we could split the digits we could make the numbers into characters. This was far easier in x86 and Aarm64 vs 6502 as we had division operation we could do to separate the digits and thus get the characters.

After properly getting the characters we had to figure out how to add them to the display. After some trial and error and a bit of googling it was discovered that we could simply make the string for the loop have extra blank characters and replace the memory in those addresses with out characters thus giving us the proper output of a loop, printing out numbers.

Wednesday, April 1, 2020

Project Part 1

For this project we are tasked with selecting an open source software and trying to optimize some part of it. For this I chose FFmpeg as it is a file conversion tool which is open source, I know how to get a large data input for bench marking and it is CPU reliant.

So the first Thing I did was build the source code, this was notably more difficult on windows since it is not make to be run on Windows natively but with a some third party downloads, most importantly MSYS2, I was able to get it working. On the linux machine aarchie I had far fewer problems as the tools required for the build were already installed and ready to go. So I built the software on both X86_64 and aarch64 and tested them out.

In testing the first thing I did was run it with default configurations. After the build was completed I ran it using my test data, a download of a video from youtube which was just over an hour long which had already been converted to an mp3 of 83.3MB, which I was not converting to a ogg file. Here were the results,

X86_64 Ryan desktop benchmark
Test1:
real    0m14.516s
user    0m0.015s
sys     0m0.000s

Test2:
real    0m14.087s
user    0m0.000s
sys     0m0.015s


Test3:
real    0m14.235s
user    0m0.000s
sys     0m0.000s


aarch64 aarchie benchmark

Test1:
real    1m50.783s
user    1m48.931s
sys     0m1.415s

Test2:
real    1m50.571s
user    1m49.255s
sys     0m0.897s

Test3:
real    1m50.683s
user    1m49.034s
sys     0m1.236s





This is with minimal background tasks running on my home windows computer, and then on aarchie. As can be seen my computer is a a bit more powerful for this purpose but the results of the benchmark seem consistent.

Now After this was done I cleaned the make and rebuilt using -O3 to see if tweaking the compiler optimization would speed up the program. Here are those results.


X86_64 Ryan desktop O3
Test1:
real    0m14.047s
user    0m0.000s
sys     0m0.015s

Test2:
real    0m14.014s
user    0m0.015s
sys     0m0.000s

Test3:
real    0m14.044s
user    0m0.000s
sys     0m0.000s


aarch64 aarchie O3

Test1:
real    1m50.490s
user    1m48.994s
sys     0m1.086s

Test2:
real    1m50.401s
user    1m48.777s
sys     0m1.196s

Test3:
real    1m50.386s
user    1m48.742s
sys     0m1.236s

As can be seen the results were negligible, though I am unsure if this is due to an improper use of the optimization settings with the makefile or due to the makefile already optimizing the output. Either way if I am going to find a way to considerably speed up FFmpeg I am going to do more than a compiler optimization.

Sunday, March 15, 2020

Lab 4 - Pick Two pt.2

In Part 1 we looked at making an adder in 6502 assembly, this was one of the two tasks we chose for this lab, the other was a screen colour selector. The screen colour selector I personally found was a lot easier to code than the adder and took far lass code to implement. So without any delay here is the whole source code before we dive in to how it works.

; ROM routines
define        SCINIT        $ff81 ; initialize/clear screen
define        CHRIN        $ffcf ; input character from keyboard
define        CHROUT        $ffd2 ; output character to screen
define        SCREEN        $ffed ; get screen size
define        PLOT        $fff0 ; get/set cursor coordinates
  
        jsr SCINIT
        ldy #$00

initColours:
    lda colours,y
        beq doneInit
        jsr CHROUT
        iny
        bne initColours
doneInit:
    ldy #$00
    ldx #$00
    CLC
    jsr PLOT

    SEC
    jsr PLOT
    jsr flipSelect
   


checkIn:
    SEC
    jsr PLOT
    jsr CHRIN

    cmp #$80
    beq up
    cmp #$82
    bne checkIn
down:
    cpy #$0f
    beq checkIn
    jsr flipSelect
    iny
    jsr flipSelect
    jsr drawScreen
    jmp checkIn

up:
    cpy #$00
    beq checkIn
    jsr flipSelect
    dey
    jsr flipSelect
    jsr drawScreen
    jmp checkIn
   

flipSelect:
    ldx #$00
    CLC
    jsr PLOT
    SEC
    jsr PLOT
   
flipLoop:
    cmp #$20
    beq doneFlip
    eor #$80
    jsr CHROUT
    SEC
    jsr PLOT
    clc
    bcc flipLoop
   
doneFlip:
    rts

drawScreen:
    tya
    pha
    lda #$00     ; set pointer at $10 to $0200
        sta $10
        lda #$02
        sta $11
    pla
     
        ldx #$06     ; max value for $11
     
        ldy #$00     ; index

drawLoop:
    sta ($10),y  ; store colour
        iny          ; increment index
        bne drawLoop ; branch until page done
     
        inc $11      ; increment high byte of pointer
        cpx $11      ; compare with max value
        bne drawLoop ; continue if not done

    rts
   
    
   

colours:
dcb "B","L","A","C","K",10
dcb "W","H","I","T","E",10
dcb "R","E","D",10
dcb "C","Y","A","N",10
dcb "P","U","R","P","L","E",10
dcb "G","R","E","E","N",10
dcb "B","L","U","E",10
dcb "Y","E","L","L","O","W",10
dcb "O","R","A","N","G","E",10
dcb "B","R","O","W","N",10
dcb "L","I","G","H","T",95,"R","E","D",10
dcb "D","A","R","K",95,"G","R","E","Y",10
dcb "G","R","E","Y",10
dcb "L","I","G","H","T",95,"G","R","E","E","N",10
dcb "L","I","G","H","T",95,"B","L","U","E",10
dcb "L","I","G","H","T",95,"G","R","E","Y",00


This code, similar to the code in pt. 1 uses a main loop as the body of the program, though this one is a bit different and also isn't named main. Though this is getting ahead of ourselves the first thing the program does is initialize the screen, it does this by going through the colour names which are stored in memory and printing them to the screen, which was fairly easy to do considering the ROM routine CHROUT can read newline properly allowing the whole thing to be one block of memory.

The main loop for this program is called checkIn, this loop is checking for an input and when it receives it updates the screen accordingly, both up and down inputs work about the same so let's just look at up. When the up arrow in pressed checkIn calls the subroutine up. This subroutine will then check if up is valid (ie: not the top of the screen) and if so it will remove the selection, then select the proper line before changing the screen colour then returning to the checkIn loop.

Now of course up and down both call their own subroutines which I will explain now. flipSelect is a pretty interesting subroutine, it simply flips the high bit of every character in whatever line is in y. which will be the currently selected line. So the first time it is called it flips it off to deselect the line, then the second time it is called it flips it on selecting the new line. The drawScreen subroutine is one we have used a lot in the course so far. It simply takes what is currently in y, as established this will be the current selected element, and it will fill the screen with that colour. We made sure to align the colours on the screen with their places in memory so their y values line up with their colour values. This results in the correct colour being displayed

putting these few simple subroutines together along with the Rom routines and we have a very compact and easy to understand bit of code which can allow you to select a colour and display it on the screen.

Lab 4 - Pick Two pt.1

For this lab we were tasked with creating two programs from a list in 6502 assembly code. These tasks ranged in difficulty though some things were made far easier by the introduction of ROM routines which I will explain a bit later. The two tasks which out group decided on were the calculator and the colour selector. We chose these two for a couple of reasons, the largest being that we all felt most confident in out own ability to get done these two tasks over any others. So with our tasks in hand we set out to get an understanding on ROM Routines.

ROM routines are basically snippets of code saved in the memory of the chip, In order to access these you simply need to start a subroutine and the right address and it will run the subroutine as if it were code which you wrote. The Routines given to us did things which previously requited many lines of code now in just one with smart use of the various registers in order to supply input to these routines.

Now before getting to the code for the adder it is important to note that we were never able to get the blinking cursor to work properly, though the rest of the program works for sure. Now here is the full source code which will be explained bellow.

; ROM routines
define        SCINIT        $ff81 ; initialize/clear screen
define        CHRIN        $ffcf ; input character from keyboard
define        CHROUT        $ffd2 ; output character to screen
define        SCREEN        $ffed ; get screen size
define        PLOT        $fff0 ; get/set cursor coordinates

define        NUMBERA        $10;
define        NUMBERB        $20;

        jsr SCINIT
   
mainLoop:
    ldy #$00
    jsr char1
    jsr input
    jsr storeA
    ldy #$00
    jsr char2
    jsr input
    jsr storeB
    ldy #$00
    jsr charR
    jsr printAdd
    jmp mainLoop
   

input: 
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT

   

inLoop:
    SEC
    jsr PLOT
    jsr CHRIN


charCheck:   
    cmp #$00
    beq inLoop

    cmp #$81
    beq right
   
    cmp #$83
    beq left

    cmp #$0d
    beq next

drawNum:
    cmp #$30
    bcc inLoop
   
    clc
    cmp #$3a
    bcs inLoop
   
    jsr CHROUT

    SEC
    jsr PLOT
    cpx #$17
    bne inLoop
    dex
    CLC
    jsr PLOT
    jmp inLoop


left:    cpx #$15
    beq inLoop
    jsr CHROUT
    jmp inLoop

right:    cpx #$16
    beq inLoop
    jsr CHROUT
    jmp inLoop

next:
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT
    SEC
    jsr PLOT


    CLC
    SBC #$2F

    ASL
    ASL
    ASL
    ASL

    PHA
   

    ldx #$16
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    CLC
    SBC #$2F
    PHA

    ldx #$00
    iny
    CLC
    jsr PLOT
    SEC
    jsr PLOT

    PLA
    TAX
    PLA


    rts

storeA:
    sta NUMBERA
    txa
    eor NUMBERA
    sta NUMBERA
    rts
   

storeB:
    sta NUMBERB
    txa
    eor NUMBERB
    sta NUMBERB
    rts

printAdd:
    SEC
    jsr PLOT
    ldx #$15
    CLC
    jsr PLOT
    SEC
    jsr PLOT
   
    SED
    lda NUMBERA
    adc NUMBERB
    CLD
    pha

    bcc outputAddition
    ldx #$14
    CLC
    jsr PLOT
    SEC
    jsr PLOT
    lda #$31
    jsr CHROUT
   
outputAddition:
    pla
    pha
    LSR
    LSR
    LSR
    LSR
    clc
    adc #$30
    jsr CHROUT

    pla
    and #$0F
    clc
    adc #$30
    jsr CHROUT

    SEC
    jsr PLOT
    ldx #$00
    iny
    CLC
    jsr PLOT
   
    rts

   

   
   

char1:  lda firstDigit,y
        beq charRet
        jsr CHROUT
        iny
        bne char1

char2:  lda secondDigit,y
        beq charRet
        jsr CHROUT
        iny
        bne char2

charR:  lda result,y
        beq charRet
        jsr CHROUT
        iny
        bne charR

charRet:
    rts



firstDigit:
dcb "E","N","T","E","R",32,"F","I","R","S","T",32,"D","I","G","I","T",":",32,32,32,"0","0"
dcb 00


secondDigit:
dcb "E","N","T","E","R",32,"S","E","C","O","N","D",32,"D","I","G","I","T",":",32,32,"0","0"
dcb 00

result:
dcb "R","E","S","U","L","T",":"
dcb 00


The Adder is actually quite a simple program especially thanks to ROM routines. The writing went through a few revisions before settling on the final result, this is due to poor user of subroutines in previous incarnations as well as unfamiliarity with the ROM routines. Now speaking of the ROM routines let's discuss what they are accomplishing for us here.

there are three important ROM routines to making this code as compact as it is, the first if CHROUT, this will spit out a character onto the screen and then move the cursor over one, very simple but extremely useful. It does this by simply checking the accumulator for the value and then putting it at the current cursor location, the next routine used is the compliment to CHROUT, CHRIN. As the name implies CHRIN takes a character input, this input is stored in the accumulator which means it can be used in conjunction with CHROUT to print input to the screen. The last important ROM routines we have is PLOT. This routine has two functions depending on the state of the carry flag. Either it gets the current cursor position and returns the value of the character there, or it sets the current cursor position based on what is currently in x and y.

Using these Tools the code goes through a main loop which utilizes a few subroutines in order to keep it as readable as possible. This main loop is only 12 lines but does all the work of the program vie the subroutines it calls. Let's take a closer look at it.

mainLoop:
    ldy #$00
    jsr char1
    jsr input
    jsr storeA
    ldy #$00
    jsr char2
    jsr input
    jsr storeB
    ldy #$00
    jsr charR
    jsr printAdd
    jmp mainLoop

So the first step is to set y to a known value, this allows char1 to work properly as it will print onto the screen the instructions to enter the first input and the start location is whatever y is. Next we get the first user input. This is done through quite complex code which I will explain further down. following this it stores the first number then it does it all again for the second number. After this it prints the result text followed by doing the actual addition and printing those results, this whole thing then loops allowing the program to keep taking inputs.

Now let's break down a couple of the more important components there, namely input, storeX and printAdd. input is how we get out user input and as we are simply making a calculator only certain characters are allowed we do this by limiting the values that can be read by CHRIN, if it gets anything else we simply ask it to try again. This ensures we are either getting a number, an arrow key, or the enter key. When a character is input, it will also do a check on whether or not the input field is on the second digit in order to ensure you can only write two digits. Next storeA and storeB which take the numbers provided to them and store them in different addresses, fairly simple. Lastly printAdd which does the actual adding. This subroutine switches to decimal mode before adding the two values stored in the previous subroutines together. It then checks to see if there was a carry and if there was it draws a 1 before the number, before then drawing the output of the addition.

 All of this put together and we have a working, though not perfect Adder in 6502 assembly.

Friday, January 31, 2020

Lab 3 - Pong pt.2

After the initial dive into the Pong problem and with a new understanding from the experiments done a tough choice had to be made. And with that I started over. My first goal with this new start was just to get a bouncing ball

The Code for the bouncing ball iteration no longer exists but it worked moving the ball in a two step process, the first step said yes the ball should move, the second said which way it should move. This was accomplished by first splitting the movement into the X and Y and then either incriminating or decrementing depending on which surface the ball last bounced off of. The first version of this as quite basic and did not clear the old ball position and only worked at a 45 degree angle resulting in a diamond being drawn on the screen but it was a start.

The screen not being cleared was beginning to bother me and so I implemented a very simple solution, first I copied the screen clear code from the etch-a-sketch program and had it happen on every game loop.

 clear:    lda table_low    ; clear the screen
     sta POINTER
     lda table_high
     sta POINTER_H

     ldy #$00
     tya

 c_loop:    sta (POINTER),y
     iny
     bne c_loop

     inc POINTER_H
     ldx POINTER_H
     cpx #$06
     bne c_loop 

 
second I added a delay so that the ball was drawn for longer than it was not making it so that it was less likely to blink out of existence. This had a few issues but I finally had a single ball bouncing around a screen.

Finally The game was coming together and only needed a few more elements to make it functional, the first and most important being the paddle, for this I added a new collision procedure which checked to see if the ball was on a pixel which counted as being on the paddle and if so to make it bounce. I also made it into more of a game by changing the bottom collision to game over and making it so that the x and y velocity got randomized whenever the paddle was hit. Here is the code for that itteration

; zero-page variable locations
define ROW        $20    ; current row
define COL        $21    ; current column
define DELTAX        $30    ; current Delta X
define DELTAY        $31    ; current Delta Y
define BOUNCEX        $35    ; checks if X has bounced
define BOUNCEY        $36    ; checks if Y has bounced
define VELX        $38
define VELY        $39   
define    POINTER        $10    ; ptr start of row
define    POINTER_H    $11
define PADDLEL        $40
define PADDLER        $41   

; constants
define    DOT        $01    ; dot colour
define    PADDLE        $07    ; black colour


    ldy #$00    ; put help text on screen
print:    lda help,y
    beq setup
    sta $f000,y
    iny
    bne print

setup:    lda #$0f    ; set initial ROW,COL
     sta ROW
    lda #$00
     sta COL
    lda #$20
    sta VELX
    lda #$20
    sta VELY
    lda #$0C
    sta PADDLEL
    lda #$14
    sta PADDLER
   


draw:    lda ROW        ; ensure ROW is in range 031
     and #$1f
     sta ROW

     lda COL        ; ensure COL is in range 031
     and #$1f
     sta COL

     ldy ROW        ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy COL        ; store CURSOR at POINTER plus COL
     lda #DOT
     sta (POINTER),y

   
drawPaddle:
     ldy #$1f    ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy PADDLEL    ; store CURSOR at POINTER plus COL
     lda #PADDLE
   
paddleLoop:
     sta (POINTER),y
    iny
    cpy PADDLER
    bne paddleLoop


colidR:    lda COL
    cmp #$1F
    bne colidL
    sta BOUNCEY

colidL:    lda COL
    cmp #$00
    bne colidD
    sta BOUNCEY
   
colidD:    lda ROW
    cmp #$1F
    bne colidU
    CLC
    jmp gameover

colidU:    lda ROW
    cmp #$00
    bne colidP
    sta BOUNCEX

colidP:    CLC
    lda ROW
    cmp #$1E
    bne ballX
   
    lda COL
    cmp PADDLEL

    bcc ballX
   
    cmp PADDLER
    bcs ballX

    sta BOUNCEX
   
    lda $fe        ;randomize vel when hitting paddle
    cmp #$80    ;ensure vel isn't too high
    bcc velx
    adc #$81
velx:
    sta VELX
    lda $fe
    sta VELY
   
   


ballX:    lda VELX
    adc DELTAX
    sta DELTAX
    bcc ballY
    CLC

    lda BOUNCEX
    cmp #$00
    bne decROW
incROW:    inc ROW
    CLC
    bcc ballY

decROW:    dec ROW

ballY:   
    lda VELY
    adc DELTAY
    sta DELTAY
   
    bcc getkey
    CLC

    lda BOUNCEY
    cmp #$00
    bne decCOL

incCOL:    inc COL
    CLC
    bcc getkey

decCOL:    dec COL

   
getkey:    lda $ff        ; get a keystroke

     ldx #$00    ; clear out the key buffer
     stx $ff

     cmp #$83    ; check key == LEFT
     bne checkR

    ldy PADDLEL
    cpy #$00
    beq checkR

     dec PADDLEL
    dec PADDLER
     jmp delaya

checkR:    cmp #$81    ; check key == RIGHT
     bne delaya

    ldy PADDLER
    cpy #$20
    beq delaya

     inc PADDLEL
    inc PADDLER
   


delaya:    ldy #$00     ; Delay processor so that ball doesn't flash at top of screen
    ldx #$00
delay:    iny
    cpy #$FF
    bne delay

    ldy #$00
    inx
    cpx #$06
    bne delay

 clear:    lda table_low    ; clear the screen
     sta POINTER
     lda table_high
     sta POINTER_H

     ldy #$00
     tya

 c_loop:    sta (POINTER),y
     iny
     bne c_loop

     inc POINTER_H
     ldx POINTER_H
     cpx #$06
     bne c_loop
   


done:    clc        ; repeat
     jmp draw

gameover:
    brk
; these two tables contain the high and low bytes
; of the addresses of the start of each row

table_high:
dcb $02,$02,$02,$02,$02,$02,$02,$02
dcb $03,$03,$03,$03,$03,$03,$03,$03
dcb $04,$04,$04,$04,$04,$04,$04,$04
dcb $05,$05,$05,$05,$05,$05,$05,$05,

table_low:
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0

; help message on character screen

 help:
 dcb "A","r","r","o","w",32,"k","e","y","s"
 dcb 32,"d","r","a","w",32,"/",32,"'","C","'"
 dcb 32,"k","e","y",32,"c","l","e","a","r","s"
 dcb 00


As you can probably tell there is still some left over fragments from the etch-a-sketch which need to be removed but all and all this code will make a working game of pong with the 6502. It draws and moves a ball, receives keyboard input and move the paddle. This is not where I decided to stop with this program though.

; zero-page variable locations
define ROW        $20    ; current row
define COL        $21    ; current column
define DELTAX        $30    ; current Delta X
define DELTAY        $31    ; current Delta Y
define BOUNCEX        $35    ; checks if X has bounced
define BOUNCEY        $36    ; checks if Y has bounced
define VELX        $38
define VELY        $39   
define    POINTER        $10    ; ptr start of row
define    POINTER_H    $11
define PADDLEL        $40
define PADDLER         $41   
define SCORE        $24
define HIT        $23

; constants
define    DOT        $01    ; dot colour
define    PADDLE        $07    ; black colour


    ldy #$00    ; put help text on screen
print:    lda help,y
    beq setup
    sta $f000,y
    iny
    bne print

setup:    lda #$0f    ; set initial ROW,COL
     sta ROW
    lda #$00
     sta COL
    lda #$20
    sta VELX
    lda #$20
    sta VELY
    lda #$0B
    sta PADDLEL
    lda #$15
    sta PADDLER
    lda #$00
    sta SCORE
   


draw:    lda ROW        ; ensure ROW is in range 031
     and #$1f
     sta ROW

     lda COL        ; ensure COL is in range 031
     and #$1f
     sta COL

     ldy ROW        ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy COL        ; store CURSOR at POINTER plus COL
     lda #DOT
     sta (POINTER),y

   
drawPaddle:
     ldy #$1f    ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy PADDLEL    ; store CURSOR at POINTER plus COL
     lda #PADDLE
   
paddleLoop:
     sta (POINTER),y
    iny
    cpy PADDLER
    bne paddleLoop


colidR:    lda COL
    cmp #$1F
    bne colidL
    sta BOUNCEY

colidL:    lda COL
    cmp #$00
    bne colidD
    sta BOUNCEY
   
colidD:    lda ROW
    cmp #$1F
    bne colidU
    CLC
    jmp gameover

colidU:    lda ROW
    cmp #$00
    bne colidP
    sta BOUNCEX

colidP:    CLC
    lda ROW
    cmp #$1E
    bne incScore
   
    lda COL
    cmp PADDLEL

    bcc incScore
   
    cmp PADDLER
    bcs incScore

    sta BOUNCEX
    inc HIT
   
    lda $fe        ;randomize vel when hitting paddle
    cmp #$80    ;ensure vel isn't too high
    bcc velx
    adc #$81
   

velx:
    sta VELX
    lda $fe
    sta VELY

incScore:
    CLC
    lda ROW
    cmp #$1D
    bne delaya
    lda HIT
    cmp #$00
    beq delaya
    lda #$00
    sta HIT
    SED
    CLC
    lda SCORE
    adc #$01
    sta SCORE
    CLD     

delaya:    ldy #$00     ; Delay processor to slow down game
    ldx #$00
delay:    iny
    cpy #$FF
    bne delay

    ldy #$00
    inx
    cpx #$08
    bne delay


ballX:    lda VELX
    adc DELTAX
    sta DELTAX
    bcc ballY

    ldy ROW        ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy COL        ; store CURSOR at POINTER plus COL
     lda #00
     sta (POINTER),y

    CLC

    lda BOUNCEX
    cmp #$00
    bne decROW


incROW:    inc ROW
    CLC
    bcc ballY

decROW:    dec ROW

ballY:   
    lda VELY
    adc DELTAY
    sta DELTAY
   
    bcc getkey

    ldy ROW        ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy COL        ; store CURSOR at POINTER plus COL
     lda #00
     sta (POINTER),y

    CLC

    lda BOUNCEY
    cmp #$00
    bne decCOL

incCOL:    inc COL
    CLC
    bcc getkey

decCOL:    dec COL

   
getkey:    lda $ff        ; get a keystroke

     ldx #$00    ; clear out the key buffer
     stx $ff

     cmp #$83    ; check key == LEFT
     bne checkR

    ldy PADDLEL
    cpy #$00
    beq checkR

    ldy #$1f    ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy PADDLER    ; store CURSOR at POINTER plus COL
    dey
     lda #00
     sta (POINTER),y

     dec PADDLEL
    dec PADDLER
     jmp done

checkR:    cmp #$81    ; check key == RIGHT
     bne done

    ldy PADDLER
    cpy #$20
    beq done

    ldy #$1f    ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H

     ldy PADDLEL    ; store CURSOR at POINTER plus COL
     lda #00
     sta (POINTER),y

     inc PADDLEL
    inc PADDLER




   


done:   
    ldy #$0
scorePrint:
    lda score,y
    beq scoreNum
    sta $f0F0,y
    iny
    bne scorePrint

scoreNum:
    lda SCORE
    and #$F0
    LSR
    LSR
    LSR
    LSR
    TAY
    lda number,y
    sta $f0f8

    lda SCORE
    and #$0F
    TAY
    lda number,y
    sta $f0f9
   
    lda SCORE

    clc        ; repeat
     jmp draw

gameover:

    brk

 clear:    lda table_low    ; clear the screen
     sta POINTER
     lda table_high
     sta POINTER_H

     ldy #$00
     tya

 c_loop:    sta (POINTER),y
     iny
     bne c_loop

     inc POINTER_H
     ldx POINTER_H
     cpx #$06
     bne c_loop

    jmp setup


; these two tables contain the high and low bytes
; of the addresses of the start of each row

table_high:
dcb $02,$02,$02,$02,$02,$02,$02,$02
dcb $03,$03,$03,$03,$03,$03,$03,$03
dcb $04,$04,$04,$04,$04,$04,$04,$04
dcb $05,$05,$05,$05,$05,$05,$05,$05,

table_low:
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0

; help message on character screen

 help:
 dcb "A","r","r","o","w",32,"k","e","y","s"
 dcb 32,"C","o","n","t","r","o","l",32,"p","a","d"
 dcb "d","l","e"
 dcb 00

score:
dcb "S","C","O","R","E",":",32
dcb 00

number:
dcb "0","1","2","3","4","5","6","7"
dcb "8","9","A","B","C","D","E","F"
dcb 00

 


This is where I decided to cut off coding for this task with quite a few tweeks and improvments to the code. The First big change is the removing of the flicker, I did this by only removing the ball when it moves and only removing one pixel from the paddle when it mves. Next I added a score feature which was an interesting task to tackel. Firstly I had to find a way to incriment the score which was accomplished by checking first to see in the paddle had been hit and second to see if the ball was off the paddle and if both those things were true the score could be incrimented. Secondly I chose to make the score decimal instead of Hex since that is the number system most people are used to. Luckily through so research I was able to find out about decimal mode on the 6502, which allows numbers to be stored in a byte as two decimal digits taking up 4 bits per digit. thus allowing the score to be properly relayed to the player. All together this made the pong app both easier on the eyes and more enjoyable since progress was tacked.

The process of building this app has furthered my understanding of assembly programming quite a bit. I feel that in gerneral in order to get a grasp for many of the concepts they just have to be played with. Some of the odd quirks and how the computer actually works with the bits is something which is difficult to learn without experienceing it and I feel this task accomplished that.  

Thursday, January 30, 2020

Lab 3 - Pong pt.1

For this Lab we have begun to build off of our knowledge in 6502 assembly in order to make a more robust program. We were given five options in tasks to do and with very little extra help had to figure out how to achieve an effective result. The five options were to create a bouncing graphic (think dvd logo), to create a numeric display which displayed two digits, to create the game pong, to create a kaleidoscope where one quadrant is mirrored in the other three, lastly and most challenging to draw a line between to points that can be moved around in real time.

Our group chose to work on Pong since it seemed like an enjoyable app to create and at least a couple of us had a bit of a grasp on how they wanted to tackle the problem. We started off by looking at some example code that was provided for us for a fairly unrelated program but it allowed us to get some good ideas for how to create out code (Link to Example Code). This code specifically helped us with Three things.
  1. How to turn a screen made of pages into coordinates
  2. How to use those coordinates to draw on the screen
  3. How to take keyboard input
Using this code we began to experiment and attempt to get a ball moving across the screen this resulted in the following code.

 ; zero-page variable locations
 define DOTROW        $20    ; current row
 define    DOTCOL        $21    ; current column
 define DOTDELTAX    $30    ; current Delta X
 define DOTDELTAY    $31    ; current Delta Y
 define    POINTER        $10    ; ptr: start of row
 define    POINTER_H    $11

 ; constants
 define    DOT        $01    ; dot colour
 define    CURSOR        $04    ; black colour


     ldy #$00    ; put help text on screen
 print:    lda help,y
     beq setup
     sta $f000,y
     iny
     bne print

 setup:    lda #$0f    ; set initial ROW,COL
     sta DOTROW
    lda #$02
     sta DOTCOL
    lda #$20    ;set angle to 45
    sta DOTDELTAX
    sta DOTDELTAY
   


 game:    lda DOTROW        ; ensure ROW is in range 0:31
     and #$1f
     sta DOTROW

     lda DOTCOL        ; ensure COL is in range 0:31
     and #$1f
     sta DOTCOL

     ldy DOTROW        ; load POINTER with start-of-row
     lda table_low,y
     sta POINTER
     lda table_high,y
     sta POINTER_H


    pha        ; save A

     lda #DOT    ; set current position to DOT
     sta (POINTER),y

     pla        ; restore A

DotMovA:lda DOTCOL
    inc DOTCOL
    lda DOTCOL

DotMovB:lda DOTROW
    inc DOTROW
    lda DOTROW



 done:    clc        ; repeat
     bcc game


 ; these two tables contain the high and low bytes
 ; of the addresses of the start of each row

 table_high:
 dcb $02,$02,$02,$02,$02,$02,$02,$02
 dcb $03,$03,$03,$03,$03,$03,$03,$03
 dcb $04,$04,$04,$04,$04,$04,$04,$04
 dcb $05,$05,$05,$05,$05,$05,$05,$05,

 table_low:
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0

 ; help message on character screen

 help:
 dcb "A","r","r","o","w",32,"k","e","y","s"
 dcb 32,"d","r","a","w",32,"/",32,"'","C","'"
 dcb 32,"k","e","y",32,"c","l","e","a","r","s"
 dcb 00  


The code above is greatly unaltered from the etch-a-sketch code. All it does is use that code to draw a line across the screen continuously but It allowed us to learn quite a lot about how to get a ball moving across the screen since it practically is that just without the previous position being removed thus a line is drawn. So from that base it is quite simple to begin to work out how a ball will move properly such as in pong which will be talked about in the next Blog. 

Sunday, January 26, 2020

Lab 2 - 6502 Experiments

The next few entries on this blog are going to be detailing our look into learning assembly code on the 6502 processor.

we were provided the following code


 lda #$00 ; set a pointer at $40 to point to $0200
 sta $40
 lda #$02
 sta $41

 lda #$07 ; colour

 ldy #$00 ; set index to 0

loop: sta ($40),y ; set pixel

 iny  ; increment index
 bne loop ; continue until done the page

 inc $41  ; increment the page
 ldx $41  ; get the page
 cpx #$06 ; compare with 6
 bne loop ; continue until done all pages
 
 
This code will Fill the page with the colour Yellow by looping through the each address of a
page and setting it to yellow, and then incriminating the page until the screen is filled.

When we insert the command tya into the code at the start of the loop the screen fills with
strips of colour. This is because that command will transfer the value of y into a, this value
will loop every 16 colours since there are only 16 colour values, and the screen is 32 pixels so
it loops perfectly and lines up. this is also why the colours repeat.
 
Adding in the lsr command now will shift the bits in the colour to the right, and as such
remove the least significant digit. This results in an effective division by 2 and so the colours
appear twice as thick. Adding more will result in further division and as such further thickening
 
instead using asl we will multiply by two instead this reduces the unique values which the
colours can be but they remain 1 pixel thick.

Next we will see what happens when we add more iny. This will result in an interesting
change in which the y values skips ahead 5 times each loop. this will miss the esacpe value
and overflow and will continue doing so until the page is filled in an interesting grainy way.

The final experiment which was done in this lab was to see if we could get 4 lines drawn
across the edges of the screen 

 lda #$00 ; set a pointer at $40 to point to $0200
 sta $40
 lda #$02
 sta $41

 lda #$05 ; colour

 ldy #$00 ; set index to 0

 
loopa: 
 sta ($40),y ; set pixel

 iny  ; increment index

 cpy #$20 ; compare with 32
 bne loopa ; continue until done the page

 ldy #$00

loops: 
 CLC
 lda #$07 
 sta ($40),y ; set pixel
 TYA
 adc #$1f
 TAY

 lda #$04
 sta ($40),y ; set pixel
 iny

 cpy #$00
 bne loops

 inc $41  ; increment the page
 ldx $41  ; get the page
 cpx #$06 ; compare with 6
 bne loops ; continue until done all pages
 

 ldy #$E0
 lda #$0e
 dec $41
 
loopb: 
 sta ($40),y ; set pixel

 iny  ; increment index

 cpy #$00 ; compare with 32
 bne loopb ; continue until done the page
 
This code will write 4 lines across the 4 edges of the screen. It does so with 3 loops.
 
The first loop will loop across the addresses at the top of the screen inserting the colour into
those addresses
 
the second loop which is also the most involved will insert a pixel into the first address of a 
line and then add to the cursor the 31 which brings it to the last pixel of the line, drawing a
different colour and then adding one again to start at the beginning once more. Once it has
gotten to the end of a page it will increment to the next page, resulting in two vertical lines.
 
lastly now that we are on the last page we can draw the final line at the bottom by starting at
the first pixel on the last line and looping through till the end of the line.
 
These tree loops result in the three lines being successfully drawn. 

Friday, January 17, 2020

Lab 1 - Open Source Research

In my search for open source Software I decided to look into two which I have used in the past and continue to use to this day. Those being Firefox and GIMP.

Firefox being the software I'm currently using to display this page has come a long way in large part thanks to its open source community. There is a vast number of people bug hunting and bug fixing, as well as a fairly understandable code review process. The example which I looked at can be seen here:
https://phabricator.services.mozilla.com/D48202
This is a simple bug fix which was approved back in October of 2019, It was written by a single contributor and reviewed by a single reviewer, being either the module owner or a designated peer, before being accepted onto the main branch.

The other piece of software, an image editing tool known as GIMP also takes the open source approach. To get a contribution added to GIMP, you must like Firefox make a fork then make a merge request. This request is then viewed by a developer at GIMP for review and any tweaks that need to be made will be made as well as receiving community feedback. After the code has been finalized or approved in its current state by the developer, the code is merged.
https://gitlab.gnome.org/GNOME/gimp/merge_requests/195

These two ways of merging contributions are similar but show the difference in scale of the two projects. Firefox most likely receives far more merge requests than GIMP, thus forcing them to spread out their commit privilege to the community where as GIMP can afford to only allow employees to merge.