Optimising compilers mindset for a Python programmer

Question

Coming from a mostly Python background, I am now learning both C and x86-64 assembly. I used C indirectly via Cython previously but I am now learning C proper in addition to assembly.

My basic question is in what sort of a mindset should I put myself when it comes to optimising compilers. Should I just let a compiler do its job but, once I am sufficiently proficient in assembly, start to check and confirm the assembly output? Is that what responsible C programmers wanting to write high-performance code do?

The question was triggered because I wanted to check what gcc 7.5.0 would optimise the code below to. In particular, I ran objdump to find out how accessing an array twice at the same index would be optimised on various levels.

On -O3 there were some instructions I have not learnt yet, e.g. movaps XMMWORD PTR [rsp+0x10],xmm0
Levels -O2 and -O1 were somewhat clearer but still I did not understand it fully
On level -O0 I believe I could see a rather straightforward translation of the code where I think messages[idx] was indeed accessed twice

My question is not when these levels should be used. I just ask the more experienced programmers if this is what you do, run code with high optimisations and check assembly output to make sure everything is as expected? Is that the natural workflow for people who want to truly know what machine code a compiler produces?

I understand that the example below is a trivial kind of an opportunity for optimisations but have you just learnt that certain optimisations occur for sure and you do not think about them anymore? There is not a lot of information about what kind of transformations and optimisations can take place, not to mention the fact that compilers leave no notes or messages for programmers to understand what was optimised and why, so I just cannot imagine any other way than simply learning it all in practice. Thanks.

#include <stddef.h>
#include <stdio.h>

int main(int argc, char ** argv)
{
    size_t len_messages = 9;
    int messages[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};

    for(size_t idx=0; idx < len_messages; idx++) {
        printf("Accessing here %d and there %d\n", messages[idx], messages[idx]);
    }

    return 0;
}

I think this kind-of depends on the environment you work in. I don't think about optimization at all -- I just say -O3 and let the compiler do its thing -- unless there seems to be a problem. And there rarely is, in my domain -- the compiler usually generates pretty good code. In many domains I suspect you'll have to be a lot more proactive. To be honest, I suspect you'll get opinions on this, but no knock-down answers. — Kevin Boone
– Kevin Boone, Commented Sep 18, 2020 at 13:03
I rarely look at the assembly code, even though I care about optimization a lot. It is far from easy to look at two lumps of assembly code and say: this is faster than that -- modern processors are very complex. Moreover a key determinant of performance is how nicely the program plays with the memory system (all those caches!) and this is often easier to see from a higher level view. For me, and I suspect many others, the time in optimisation gets spent in looking at profiler outputs and experimenting with higher level 'algorithms' — dmuir
– dmuir, Commented Sep 18, 2020 at 13:24
Rather than manually checking the assembly, you should first run your code through a profiler. Look for hot spots, then focus on algorithmic complexity, cache coherency, etc. in those areas first. Only after you are confident that your design is optimal should you look at the assembly (if it is even still necessary at that point). — 0x5453
– 0x5453, Commented Sep 18, 2020 at 13:27
Only insane (or unfortunate) people look at the assembly for all of the high level language code they write. — user3185968
– user3185968, Commented Sep 18, 2020 at 13:28
I will read the PDF @BasileStarynkevitch, thank you for that, but in the meantime, I can already humbly suggest that you try a more toned down approach to typography - the document uses several fonts, font sizes, colours and text effects on any given page and it is difficult to read it - I am sure that the content is great though and I will familiarise myself with it. — user14222280
– user14222280, Commented Sep 18, 2020 at 17:05

Brendan · Accepted Answer · 2020-09-18 16:14:17Z

2

My basic question is in what sort of a mindset should I put myself when it comes to optimising compilers. Should I just let a compiler do its job but, once I am sufficiently proficient in assembly, start to check and confirm the assembly output?

Mostly no.

Different pieces of code influence performance by different amounts - a piece of code that's only used once during initialization won't influence performance much, and a piece of code in the middle of a loop that's being frequently executed may have an extreme impact on performance. Optimizing with assembly costs developer time and portability; and often those extra costs can't be justified by negligable performance improvements of code that isn't executed often.

For this reason the main tactic is to use a profiler to determine where the most important (for performance) pieces of code are; and investigate performance improvements for those pieces only.

However "investigate performance improvements" still doesn't necessarily mean going directly to assembly. You think about improving the algorithm, improving data structures and cache locality, improving parallelism ("more threads!"), etc.

After all of that you might look at the assembly the compiler generates and see if you can find a way to improve/optimize it by hand. You also might not.

The reason you still might not use assembly language is that different CPUs are different. You can optimize for one CPU (whatever your computer has) and make the software significantly slower on other CPUs (whatever the end users who run your software have); or you can rely on features (e.g. AVX512) that may not exist. Of course this also means that the results you got from profiling aren't as useful as you might think (good enough for a crude estimate and never usable as an accurate representation applicable to all CPUs).

To get around that you might need multiple different versions in assembly language for different CPUs - one for "64-bit Intel with AVX-512", one for "64-bit Intel with AVX2", one for "64-bit Intel without any AVX", 2 more versions for AMD because you found out that a few instructions take longer on AMD and a few other instructions are faster on AMD; then another collection of different versions for 64-bit ARM, then PowerPC, then ...

Basically; it's rare to optimize in assembly. For a "heavily pounded" library (e.g. MPEG decoder, big number library, ...) it can make a lot of sense, and for a few performance critical parts of a large program it might be justified; but apart from that it's likely that you have far more important things to do with your time.

answered Sep 18, 2020 at 16:14

Brendan

37.7k2 gold badges45 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user14222280 Over a year ago

This is an interesting answer and I agree with you but the question was more along the lines of "how do you know that the compiler optimises things as you would expect it given that it happens silently" rather than "when to optimise in assembly". I think you started to answer from the former perspective but then you finished with the latter anyway :-) If you could could just add a note to the effect that in your experience, unless one's low-level code is math-heavy or perhaps unless one writes a compiler him- or herself, it is rare to check what a modern compiler does?

Brendan Over a year ago

@Terry: That's not how any of it works.. If you enable optimizations ("-O3") you know the compiler tried its best and you know that the compiler's best may be "worse than ideal" and you simply don't care much (and know that the compiler's best may be better or worse than whatever your expectation happened to be at the time).If you don't enable optimizations then you know the compiler didn't try (and can expect the result to be awful).

Brendan Over a year ago

@Terry: Note that this could be considered "delegation" - you're delegating the responsibility of optimization to the compiler (and the compiler developers) so that you can say "LOL, not my problem anymore!".

user14222280 Over a year ago

I wish I could accept two answers - in the end I accepted the one from @rurban because he introduced me to a new tool along the way. Thanks again Brendan, your answer was very helpful too.

rurban · Accepted Answer · 2020-09-18 19:55:01Z

1

I rarely look at disassembly alone. Mostly I decompile the function with Ghidra to see what's going on with the optimizer. You get a much bigger and better picture then. In a more familiar language, where you still can see the generated assembly.

answered Sep 18, 2020 at 19:55

rurban

4,16127 silver badges28 bronze badges

Collectives™ on Stack Overflow

Optimising compilers mindset for a Python programmer

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related