Saturday, July 25, 2009

Debugging Method Names

What's in a name? More specifically, a program identifier? Proper identifiers make the program easier to understand and maintain; and it is no secret that maintenance costs often make the lion's share of software development costs. Thus, as every beginning programmer should know, carefully choosing method names is an important part of the software engineer's work.

In Debugging Method Names (appeared in ECOOP '09), the authors -- Einar W. Høst and Bjarte M. Østvold -- apply statistical analysis to method names, finding correlation between certain verbs that appear as prefixes in method names (find, is, get, set, add, remove, etc.) and programmatic attributes of methods (has loops, creates objects of the containing type, returns void, etc.). In a way, they apply Wittgenstein Sprachspiel ("language game") to the language of programs: a word means whatever it is used for, rather than having an intrinsic meaning. And thus, one can easily find that methods called setSomething are highly correlated with a void return type, accepting one or more parameters, and updating a field. While this is hardly surprising, deeper connections also appear when the authors analyze a very large codebase, consisting of over one million methods.

Once the statistical correlations are located, outliers can be detected. And indeed, the authors show how the method was used to detect actual "naming bugs" in the analyzed programs. For example, consider the following method, from AspectJ's source code:

/**
* @return field object with given name, or null
*/
public Field name(String name) {
for (Iterator e = this.field_vec.iterator(); e.hasNext();) {
Field f = (Field) e.next();
if (f.getName().equals(name))
return f;
}
return null;
}


The authors' software, after learning about method names and with no previous assumptions about names, claims that it expects this method to be findSomething, whereas it is actually named containsField. The analysis is right on spot -- a more proper name for this method would have been findField.

Other detected problems include an equals method that creates objects of the compared type (this is an actual bug, not just a naming problem), setters named isBooleanProperty (where Java programmers would expect setBooleanProperty), and more. The latter example is actually an indication of the system's strength: if the analyzed codebase had been Objective-C, this wouldn't have been detected as a problem, since in that language, boolean setters are indeed named with an initial is.

Are the outliers always bugs? Hardly. Manual inspection of a subset (an admittedly small one) shows that about 30% of the purported "bugs" aren't really bugs. The authors state that overly complex getter methods, as well as methods which include logging code, tend to confuse the system. Still, this seems to be like the first step towards applying statistical NLP to software engineering, with very promising results.

Saturday, July 11, 2009

Producing Wrong Data Without Doing Anything Obviously Wrong

You'd expect code compiled with the gcc option -O3 to be faster than -O2 code, right? After all, the compiler spends more time on optimizing your code, and these are well-known optimizations that were deeply studied. So, would it surprise you if the code ended up being, say, 7% slower?

You test the code a day later, and suddenly, the very same code is 10% faster with -O3 as compared to -O2. What's going on here?

Let's see. Perhaps you've added or removed some totally unrelated environment variable?

Sounds odd, right? But this is exactly the behavior documented by the authors of Producing Wrong Data Without Doing Anything Obviously Wrong! (appeared in ASPLOS '09). They explain that a change in the size of the shell's environment can change the binary's alignment, causing noticeable differences in performance. They also show, although it is less surprising, that changes in the link order of the binary can have a measurable effect on performance.

The paper's claim is obviously not "play around with the environment size to find the performance sweet-spot". It is that measurement bias, a phenomenon well-known in experimental sciences, also exists in computer science experiments. They show that it exists regardless of the CPU, the benchmark used, or the compiler employed. Measurement bias in computer science is real, significant, commonplace, and unpredictable.

And thus, when developers try to measure the effect of an optimization, or the effectiveness of a new algorithm (when compared to an existing one), they should never be content with measurements -- even multiple measurements -- in a single settings. Change configurations; change platforms; play around with anything you can; randomize your setup; see if the performance difference that you suggest is real, or an artifact of the conincidental configuration you were using.

Will it work? Will the authors' "call for action" (section 8 in their paper) yield any results? I must admit I'm pessimistic. I think the only chance for this to have any effect is if conference reviewers will start demanding that experiment reports in papers include details about the measures taken to overcome measurement bias.

Until this happens, I cannot help but wonder how many times skillful craftsmen had devised great optimizations and applied them to their code, only to find out the code ends up being slower. The programmer shrug, not seeing what went wrong, and revert to the original, "more efficient", implementation. If only they had deleted one environment variable...