In Debugging Method Names (appeared in ECOOP '09), the authors -- Einar W. Høst and Bjarte M. Østvold -- apply statistical analysis to method names, finding correlation between certain verbs that appear as prefixes in method names (find, is, get, set, add, remove, etc.) and programmatic attributes of methods (has loops, creates objects of the containing type, returns void, etc.). In a way, they apply Wittgenstein Sprachspiel ("language game") to the language of programs: a word means whatever it is used for, rather than having an intrinsic meaning. And thus, one can easily find that methods called setSomething are highly correlated with a void return type, accepting one or more parameters, and updating a field. While this is hardly surprising, deeper connections also appear when the authors analyze a very large codebase, consisting of over one million methods.
Once the statistical correlations are located, outliers can be detected. And indeed, the authors show how the method was used to detect actual "naming bugs" in the analyzed programs. For example, consider the following method, from AspectJ's source code:
/**
* @return field object with given name, or null
*/
public Field name(String name) {
for (Iterator e = this.field_vec.iterator(); e.hasNext();) {
Field f = (Field) e.next();
if (f.getName().equals(name))
return f;
}
return null;
}
The authors' software, after learning about method names and with no previous assumptions about names, claims that it expects this method to be findSomething, whereas it is actually named containsField. The analysis is right on spot -- a more proper name for this method would have been findField.
Other detected problems include an equals method that creates objects of the compared type (this is an actual bug, not just a naming problem), setters named isBooleanProperty (where Java programmers would expect setBooleanProperty), and more. The latter example is actually an indication of the system's strength: if the analyzed codebase had been Objective-C, this wouldn't have been detected as a problem, since in that language, boolean setters are indeed named with an initial is.
Are the outliers always bugs? Hardly. Manual inspection of a subset (an admittedly small one) shows that about 30% of the purported "bugs" aren't really bugs. The authors state that overly complex getter methods, as well as methods which include logging code, tend to confuse the system. Still, this seems to be like the first step towards applying statistical NLP to software engineering, with very promising results.