Home > Code Analysis > Duplicate Detection and Consolidation overview

Duplicate Detection and Consolidation overview

December 20th, 2011

This is a blog series overview of the Duplicate Detection and Consolidation (DDC) feature shipped in DevExpress CodeRush Pro productivity tool for the Visual Studio IDE.

Background

Duplicate code, or sometimes referred to as a clone, is a program source code fragment that is very similar to another code fragment. A code clone may occur more than twice, either within a single program or across different programs owned or maintained by the same group of developers. Code duplication is considered an expensive practice that should be avoided because it complicates the maintenance and evolution of the software.

There are many reasons why duplicate code can appear in source code:

  • Clipboard Inheritance. Copying and pasting code (and then modifying it as needed) is faster and easier than writing similar code from scratch.
  • Coding Guidelines. Sometimes the application or the team’s coding guidelines may call for a frequently-needed code fragment to appear throughout an application, such as error handling, logging, or wiring up user interface displays. The fragment will intentionally appear throughout the code to maintain the style.
  • Performance Enhancements. Some code clones may exist for perceived or real performance gains. Systems with incredibly tight time constraints are often hand-optimized by replicating frequent computations.
  • Big Tasks, Little Time. It is generally difficult for a single developer to understand all the code in a large software system. Pressure to ship features in a short period of time can incentivize the use of example-oriented programming, where a developer copies and adapts existing code already developed.
  • Design Patterns. Also, the repeated use of design patterns can lead to similar functionality spread across a software system. For example, applying the composite design pattern more than once can lead to multiple distinct classes, each implementing the Leaf and Composite parts. If these implementations share similar properties to support navigation or search (such as access to the Parent or Children), this can lead to similar code located in distinct classes, all designed to implement navigation/search/validation functionality in a similar way, each a specialist for working with one of the different Leaf/Composite pairs.
  • The Cost of Crossing Project Boundaries. Because it’s more expensive to consolidate duplicate functionality across project boundaries, when functionality from one high-level project is needed in another, the cost of consolidating in that moment (which could include an architectural change that needs approve) acts as an incentive to duplicate the code and ensure its survival.
  • Generated Code. Code generators can create significant quantities of code in a short period of time, and sometimes that code includes repeated patterns of functionality.
  • Unintentionally. Over time, the spec may call for two functionally similar blocks of code (for example, find best and worst performers from a list), or it may be that two developers were involved in implementing the similar logic, or perhaps it was one developer working at two distinct times in the code. Regardless, it is possible to independently and unintentionally create functionally duplicate code.

Why do I need DDC?

Research in software maintenance has shown that many programs contain a significant amount of duplicate code, estimated to be somewhere between 5% and 20% [1]. Code duplication can significantly increase software maintenance costs because it:

  • Increases the time it takes for developers to get up to speed due to the additional code bulk.
  • Makes the code harder to ready by obscuring the purpose of each duplicate.
  • Forms an effective barrier to enhancing the software since similar changes must be made to all the clones or the clones must consolidated into a single block of code before introducing change.
  • Negatively impacts your company’s reputation due to update anomalies (e.g., a bug is fixed in an update only to be rediscovered later by customers). Inconsistent updates (e.g., fixing only one of several cloned bugs) can easily turn into unexpected program behavior, where the software only works some of the time.
  • Increase the amount of test cases needed, which can lead to more code duplication (in the test cases).

So, we might be interested in finding and consolidating all duplicate code for the following reasons:

  • Decreasing software maintenance costs. If one is sure that the code segment where a bug is found occurs only once in the system, one can be con?dent that the bug has been eradicated from the system.
  • Repairing design-?aws. Code duplication may indicate design problems like missing use of specific programming techniques, such as inheritance.
  • Reducing code size. Refactoring duplicated code into a single code block reduces the overall size of the code base and result in a faster compile.

Detection

Different algorithms have been proposed to detect duplicate code. One of the primary challenges in detection algorithms is that it is not known beforehand which code fragments can be found multiple times. For example, in a long method, only a subset of that method may be duplicated. Comparing all possible subsets of all methods against each other means you’re talking about a huge number of comparisons that grows significantly as the solution size scales.

DDC uses an algorithm that is conceptually independent of the programming language of the source code being analyzed, working at the level of abstract syntax trees (ASTs). This means it can find functionally duplicate code inside the source code written in different languages (e.g. CSharp and Visual Basic). It scans through the code of the entire solution and reports all duplicates found.

Duplicate detection ignores comments, whitespace, curly braces, variable and parameter names, etc., as you would expect from a modern tool. However CodeRush’s duplicate detection algorithm goes beyond this to also find code that is functionally similar. The level of similarity defines the size of duplicate code blocks found, whether it is an entire method or a code block consisting of only a few lines.

Consolidation

A primary purpose of duplicate code detection is removing duplication from the system, through refactoring, for improving the code quality of the software system. The process of eliminating duplicate code is called code consolidation, and it consists of a single block of code that replaces all code duplications. With consolidation, you can decrease complexity, reduce potential sources of errors emerging from duplicated code, increase readability, and increase system flexibility.

DDC allows you to consolidate most of the duplicate code found automatically, in a single click.

Want to learn more?

[1] A Survey on Software Clone Detection Research. Chanchal Kumar Roy and James R. Cordy. September, 2007.

See also Mark Miller‘s blog – Duplicate Detection and Consolidation in CodeRush for Visual Studio.

—–
Products: CodeRush Pro
Versions: 11.2 and up
VS IDEs: any
Updated: Dec/20/2011
ID: C147

Similar Posts: