gprof is a profiling tool which inserts itself into the program with the command -pg, which created a gmon.out file allowing us to see how our program runs. This file can then be read though it isn't the easiest to understand and is better when represented graphically. This data can be converted into a graph with:
gprof ./ffmpeg_g | gprof2dot | dot
This code pipes out gprof output into a program which converts it into a form that can then be piped into a graphical format. I was having issues getting a graph to appear in its own window using the added parameters -T x11, and so I took the output and pasted it onto a web interpreter http://www.webgraphviz.com/. This website took that output and generated the graph seen in figure 5 which shows that flac_encode_frame is where the program ends up spending a lot of its time. This function is most likely the part of the program where the file is re encoded into the new format.
After Finding this I ran perf. This tool allows us to look at which lines of the assembly code are taking up the most time to run on each system and so give us an idea of ways to speed them up.
to run perf first I reconfigured the make to remove -pg then I ran a make clean and then a make. After this I ran the program but with the extra call to perf which looked like:
perf record ./ffmpeg_g -i ../OneHourBenchmark.mp3 OneHourBenchmarkOut.ogg
which gave the following 2 images.
figure 1 |
figure 2 |
The first of these images is a screenshot of the highest use functions of code in the program as well as their percentage of time spent there and what file they are a part of. The second image is what happens when you run annotate function and this shows a breakdown of both the source code and the assembly code generated by the source code. This assembly code is then labeled with what percentage of time was spent on each action. This can be a bit misleading however especially with smaller sample sizes as what is actually happening is that perf is interrupting the program every so often to sample which command is being run, so near misses can occur and some lines may be missed entirely dispute clearly being run. My particular test took around 1 minute and so was enough data to give a reasonable assessment of the code. As we can see comparing the results above with the results bellow the program spends its time within this function doing different things depending on the hardware,with ARM being above and x86 being below. Because of this different types of optimization will only work on certain machines and we will have to look into what specifically we will want to change in the next part.
figure 3 |
figure 4 |
figure 5 |
No comments:
Post a Comment