注意，这两个技术并不是避免时机不成熟的优化。并不是把冒泡排序变成快速排序（算法优化）。也不是语言或是编译器的优化。也不是把 i*4写成i<<2 的优化。
我们知道，程序运行时的90%的时间是用在了10%的代码上。我发现这并不准确。一次又一次地，我发现，几乎所有的程序会在1%的代码上花了99% 的运行时间。但是，是哪个1%？一个好的Profiler可以告诉你这个答案。就算我们需要使用100个小时在这1%的代码上进行优化，也比使用100个 小时在其它99%的代码上优化产生的效益要高得多得多。
几年前，我有一个同事，Mary Bailey，她在华盛顿大学教矫正代数（remedial algebra），有一次，她在黑板上写下：
x + 3 = 5
__ + 3 = 5
ESI += x;
下一个例子，一个D用户张贴了一个 benchmark 来显示 dmd (Digital Mars D 编译器)在整型算法上的很糟糕，而ldc (LLVM D 编译器) 就好很多了。对于这样的结果，其相当的有意见。我迅速地看了一下汇编，发现两个编译器编译出来相当的一致，并没有什么明显的东西要对2：1这么大的不同而 负责。但是我们看到有一个对long型整数的除法，这个除法调用了运行库。而这个库成为消耗时间的杀手，其它所有的加减法都没有速度上的影响。出乎意料 地，benchmark 和算法代码生成一点关系也没有，完全就是long型整数的除法的问题。这暴露了在dmd的运行库中的long型除法的实现很差。修正后就可以提高速度。所 以，这和编译器没有什么关系，但是如果不看汇编，你将无法发现这一切。
常规的做法是制胜法宝是挑选一个最佳的算法而不是进行微优化。虽然这种做法是无可异议的，但是有两件事情是学校没有教给你而需要你重点注意的。第一 个也是最重要的，如果你优化的算法没没有参与到你程序性能中的算法，那么你优化他只是在浪费时间和精力，并且还转移了你的注意力让你错过了应该要去优化的部分。第二点，算法的性能总和处理的数据密切相关的，就算是冒泡排序有那么多的笑柄，但是如果其处理的数据基本是排好序的，只有其中几个数据是未排序的， 那么冒泡排序也是所有排序算法里性能最好的。所以，担心没有使用好的算法而不去测量，只会浪费时间，无论是你的还是计算机的。
Overlooked Essentials For Optimizing Code
Sep 10, 2010
I've been programming for 35 years now, and I've done a lot of work optimizing programs for speed (an example), and watching others optimize. Two essential techniques are consistently ignored.
Nope, it isn't avoiding premature optimization. It isn't replacing bubble sort with quicksort (i.e. algorithmic improvements). It's not what language used, nor is it how good the compiler is. It isn't writing i<<2 instead of i*4.
- Using a profiler
- Looking at the assembly code being executed
The people who do those are successful in writing fast code, the ones who do not are not. Let me explain.
Using A Profiler
The old programming saw is that a program spends 90% of its time in 10% of the code. I've found that to not be true. Over and over, I've found that programs spends 99% of its time in 1% of the code. But which 1%? A profiler will tell you. Spending 100 hours of dev time on that 1% will yield real benefits, while 100 hours on the other 99% will not produce much of anything worthwhile.
What's the problem? Don't people use profilers? Nope. One place I worked at had a fancy expensive profiler that was still in its shrink wrap 3 years after purchase. Why don't people use profilers? I don't really know. I once got into a heated exchange with a colleague who insisted he knew where the bottlenecks were; after all, he was an experienced professional. I finally ran the profiler myself on his project, and of course the bottleneck was in a completely unexpected place.
Consider auto racing. The team that wins has sensors and logging on just about everything they can stick a sensor on. You can race using seat-of-the-pants tuning and have a jolly good time on the track, but you won't win and you won't even be competitive. You won't know if your poor speeds are caused by the engine, the exhaust, the aerodynamics, the tire pressure, or the driver. Why should programming be any different? You can't improve what you can't measure.
There are lots of profilers available. You can get ones that look at the hierarchy of function calls, function times, times broken down for each statement, and even at the instruction level. I've seen too many programmers eschew profilers, preferring instead to whittle away their time with useless and misdirected "optimizations" and getting trounced by their competitors.
Looking At The Assembly Code
Years ago, I had a colleague, Mary Bailey, who taught remedial algebra at the University of Washington. She told me once that when she wrote on the board:
x + 3 = 5
and asked her students to "solve for x", they couldn't answer. But, if she wrote:
__ + 3 = 5
and asked the students to "fill in the blank" all of them could do it. It seems that the magic word "x" seemed to cause them to reflexively think "x means algebra, I don't understand algebra, I can't do this."
Assembler is the algebra of the programming world. If someone asks me "was my function inlined by the compiler" or "if I write i*4, will the compiler optimize it to a left shift" I'll suggest they look at the asm output of the compiler. The reaction is how rude and unhelpful could I be? The person will follow up by saying he doesn't know assembler. Even C++ experts will say this.
Assembler is the simplest language (especially compared with C++!). For example,
is (expressed in C style):
ESI += x;
Details vary among CPUs, but that's how it works. It's not even really necessary to know that. Just looking at the assembler output and comparing it to the source code will tell a LOT.
How does this help optimization? For example, I knew a programmer years ago who thought he'd discovered a new, faster algorithm to do X. I'm being deliberately vague to protect him. He had the benchmarks to prove it, and wrote a nice article about it. But then someone looked at the assembler output of the regular way, and his new fast way. It turns out that the way he'd written his improved version had allowed the compiler to replace two DIV instructions with one. This had really nothing to do with his algorithm. But DIV is an expensive instruction, and this was in the inner loop, and so his algorithm appeared to be faster. The regular implementation could also be recoded slightly to use only one DIV, too, and it would perform just as fast as the new algorithm. He had discovered nothing.
For my next example, a D user posted a benchmark showing that dmd (Digital Mars D compiler) was lousy at integer arithmetic, while ldc (LLVM D compiler) was much better. Being very concerned about such a result, I promptly looked at the assembler output. It was pretty much equivalent, nothing stood out as being accountable for a 2:1 difference. But there was a long divide in there, done with a call to a runtime library function. That function call completely dominated the timing results, all the adds and subtracts in the benchmark had no significant impact on the speed. Unexpectedly, the benchmark wasn't about arithmetic code generation at all, it was about long division only. It turns out that dmd's runtime library function had a crummy implementation of long division in it. Fixing that brought the speed up to par. It wasn't the code generation at fault at all, but this was not discoverable without looking at the assembler.
Looking at the assembler often gives unexpected insight into why a program performs as it does. Unexpected function calls, unanticipated bloat, things that shouldn't be there, etc., all are exposed when looking at it. It isn't necessary to be an assembler crackerjack to be able to pick that up.
If you feel the need for speed, the way to get it is to use a profiler and be willing to examine the assembler for the bottlenecks. Only then is it time to think about better algorithms, faster languages, etc.
Conventional wisdom has it that choosing the best algorithm trumps any micro-optimizations. Though that is undeniably true, there are two caveats that don't get taught in schools. First and most importantly, choosing the best algorithm for a part of the program that has no participation to the performance profile has a negative effect on optimization because it wastes your time that could be better invested in making actual progress, and diverts attention from the parts that matter. Second, algorithms' performance always varies with the statistics of the data they operate on. Even bubble sort, the butt of all jokes, is still the best on almost-sorted data that has only a few unordered items. So worrying about using good algorithms without measuring where they matter is a waste of time - your's and computer's.
Just like ordering speed parts from an auto racing catalog isn't going to put you anywhere near the winner's circle (even if you get them installed right), without profiling, you won't know where the problems are without a profiler. Without looking at the assembler, you may know where the problem is, but often won't know why.
Thanks to Bartosz Milewski, David Held, and Andrei Alexandrescu for their helpful comments on a draft of this.