Understanding Code Changes With AST Differencing

On Wednesday 12 December 2012, 9:00-9:30 INRIA Lille room B31 (new building), Jean-Rémy Falleri (Associate Professor @ ENSEIRB-MATMECA) will give a talk on “Understanding Code Changes With AST Differencing”

 

Abstract:
Finding the differences between two versions of a source code file is
a very common operation in software development. It is used in various
tasks such as performing a code review or understanding a bug fix.
Many automated tools have been introduced over the years to assist the
developers in this tedious task. For instance, \texttt{diff} is very
famous among the developers. This tool consider source code files as a
sequence of text lines and find a minimal cost edition sequence that
transform a sequence into another. The actions are: insert a line,
delete a line and update a line.

diff is very useful: it works without any adaptation on any
textual programming language. Nevertheless, its limitation is very
obvious: a source code is not really a sequence of lines, it always
has a tree structure, usually called abstract syntax tree (AST). Many
approaches have adapted the idea of textual differencing to trees.
These approaches find an minimal cost edition sequence between two
trees. The editing actions on a tree are: insert a node, delete a node
and update the label of a node.

Unfortunately, the best known algorithms to find such a sequence have
a $O(n^3)$ time complexity (n beeing the number of nodes), which is
expensive for real world ASTs that contain more than tens of thousands
nodes. More importantly, these approaches do not take into account a
very frequent edit action used in software development: moving a node.
If this action is taken into account, the problem of finding an
minimal cost edition sequence is NP-hard. AST differencing including
moves requires therefore good heuristics to be used in practice.

AST differencing including move actions has not been investigated a
lot. The main approaches in the literature are either language
specifics and/or does not work on raw ASTs and/or have not been
empirically validated. We propose a novel algorithm that computes an
edition sequence between two ASTs including move actions. This
algorithm is very fast and works on raw ASTs with almost no
configuration.