Clone Digger

 discovers duplicate code in Python and Java

Documentation

See my slides on EuroPython 2008.

Mail list

You can ask any questions concerning Clone Digger in the SF.net mail list and in the Google Groups. The second option is preferred because SF.net mailing lists sometimes fail.

Integration with Hudson

See a short manual by Pavol Zibrita to learn how to get Clone Digger and Hudson (continuous integration tool) to work together.

Eclipse Plugin

Eclipse Plugin for Clone Digger has been implemented during Google Summer of Code 2008. See the manual to learn how install it and how to work with it.

Algorithm

The algorithm behind Clone Digger is described in the paper "Duplicate code detection using anti-unification".

We will give the general idea below. The algorithm consists of the following phases:

  • Construction of abstract syntax tree (AST). Every source file is transformed into AST representation. AST for Python code are built using standard CPython module Compiler. ASTs for Java are built using ANTLR. The support of other programming languages can be added feasibly: internal compiler information or parser generator (ANTLR, JavaCC, etc) can be used to produce XML of AST, the rest is simple.
  • Clusterization of statements. The concept of anti-unification is the key point of our algorithm. We use it to define the distance function between statements and to compute the middle value of two statements. This middle value is the most concrete abstraction of statements. Having these two operations (distance and middle value) we apply simple clustering algorithm. Similar statements are put into the same cluster.
  • Finding pairs of sequences of statements, which have similar statements in corresponding positions. After the previous phase each statement is marked with its cluster ID. Therefore a set of sequences of statements can be seen as a set of strings of their marks. We search for all common substrings using suffix trees.
  • The final check. The matching pairs of sequences, which have similar statements in corresponding positions, are now globally checked for similarity. Found clones are reported in the HTML. Each pair is reported statement by statement with a highlighting of differences (see examples page).

Supported clone types

Strictly speaking, Clone Digger reports two sequences of statements as clone if one of them can be obtained from the other by replacing some AST substrees with another subtrees (the maximum total size of subtrees is given as a threshold parameter). Clone Digger:

  • ignores whitespaces and comments (if they are not presented in the AST),
  • supports clone parametrisation,
  • allows changes of constants and variable and function names,
  • supports more complex clones,

Clone Digger will not find:

  • gapped clones,
  • clones with inserted and/or deleted statements.

It should be noted that due to the fuzziness of the clusterization phase it isn't guaranteed that all such clones will be found.

Criticism

How about the Occam's razor? Why the new complex algorithm is better than existing solutions? The answer is the following: our algorithm takes all advantages of the AST representation and handles complex clones. For example, the simplest solutions support only renaming of the variables and changes in whitespaces and comments. However the presented algorithm in only one of the spectrum of clone detection algorithms and it can't cover all the possible cases.

Similar approaches

Papers and tools related to clone detection are listed on the Clone Detection Literature page at the University of Alabama at Birmingham.

In particular you can take a look at CloneDR, a proprietary clone detection tool, which is based on ASTs and supports automatic refactoring. Also you may be interested in the approach used in Asta, a prototype of Clone Detection system developed in Microsoft Research.

Preparation of source files

Clone Digger can't determine which source files are essential and which are not. Therefore you have to clean up the source directory and to remove unimportant files by yourself. I suggest removing:

  • Automatically generated sources. They can't be refactored and therefore the clones found in them are useless. All the more so automatically generated sources tend to have long sequences of statements, and this leads to the O(n^2) blowup in the suffix tree based algorithm.
  • Tests.
  • Third party libraries.

Running Clone Digger

Clone Digger is written in Python and thus platform-independent. You will need Java virtual machine in order to handle Java source files. Clone Digger produces a HTML file with a list of clone candidates.

To run Clone Digger type:

python clonedigger.py [PATH TO SOURCE TREE]

Use --help option to get the list of available options.