Skip to content

Tutorial of ccfx

Jacek Banaszczyk edited this page Dec 30, 2018 · 9 revisions

Author: Toshihiro Kamiya
Created: 2008/Nov/6
Contact: info@ccfinder.net
Copyright: 2005-2010 © Toshihiro Kamiya. All rights reserved.

TOC

  1. Outline of command line of ccfx
  2. Detection and printing of code clones
  3. Metrics and filtering
    1. Calculation of metrics
    2. Filtering by metric values
  4. File list
    1. Generating a file list
  5. File group
    1. Code clone detection between versions with file group
  6. Appendix
    1. Execution modes of ccfx
    2. Preprocess Script Names
    3. Options of execution-mode d
    4. Options of execution-mode p

Outline of command line of ccfx

Tool ccfx is a CLI tool, that is, supposed to be invoked by user at command line. The tool has several functions including detection of code clones and filtering code clones with clone-set metrics or file metrics.

The tool ccfx also works as a back end of GUI tool GemX, that is, GemX internally calls ccfx in order to detect code clones and analyze them.

You can utilize the options of ccfx by invoking it directly at command line; some of such options are unavailable via GemX. Also, you can invoke ccfx in a batch file, in order to combine it with other command-line tools.

The first argument of ccfx command line is an execution-mode specifier. The available execution modes are: d, p, m, s, and f. As for detail of each mode, see Execution modes of ccfx. You can obtain the list of execution modes by the following command line:

ccfx -h

Also, you can obtain the help message of each execution mode by such as command-lines:

ccfx d -h
ccfx m -h

The following sections present each execution mode, along with the supposed scenario in clone detection and analysis.

Detection and printing of code clones

This section presents how to detect code clones, when a directory is given, which stores target source files.

In this example, the source files are written in Java programming language, and the target directory is c:\target\src

In order to detect code clones with ccfx, use execution-mode d (d stands for detection). The following command line will let ccfx detect code clones from the source files and the result will be stored in a file a.ccfxd in the current directory:

ccfx d java -dn c:\target\src

The argument java means that target source files are written in Java. That is, with this argument, ccfx will search files having extension ".java" in (sub-directories of) the target directories, and apply a preprocessing for Java programming language. (The preprocessing is a kind of normalization, which depends on the syntax of each programming language, to improve correctness of clone detection. The detection algorithm itself is independent from programming languages; a single detection algorithm is used for source code written in any programming language.) Appendix includes a table of the preprocess script, programming language, and extensions of source files, as a section Preprocess Script Names.

The option -dn is roughly to specify a directory which stores the target source files. The detected code clones will be stored to a file a.ccfxd, by default. In order to specify name of the output file, use option -o file. The details of the options are shown in Options of execution-mode d.

The clone-data file (output file) is a binary file. In order to print them as text, use execution-mode p of ccfx (p stands for pretty printing):

ccfx p a.ccfxd

The execution-mode p has some options, which enable to extract part of information from a clone-data file. For example, each of target source file has a file ID (a kind of serial number), and a table containing each file path and its file ID will be obtained by the following command line:

ccfx p -ln a.ccfxd

The details of options are shown in Options of execution-mode p.

Metrics and filtering

Tool ccfx has functions to calculate some kind of metrics for code clone and ones for source files. The user can use these metrics to perform some filtering of code clones or source files.

Calculation of metrics

Just the same as the preceding section, the source files are written in Java, the target directory is c:\target\src. Additionally, the code-clone detection has already done, and the detected clones are stored in a clone-data file a.ccfxd.

In order to calculate metrics, use execution-mode m. There are two categories of metrics: clone metrics and file metrics. The command line below is to extract clone metrics

ccfx m a.ccfxd -c

The command line below is to extract file metrics

ccfx m a.ccfxd -f

In execution-mode m command lines, you have to type clone-data file before option -c or -f.

The output files are specified with option -o file. You can specify option -o for each option -c or -f. So a command line below means extracting both clone metrics and file metrics:

ccfx m a.ccfxd -c -o clonemetrics.tsv -f -o filemetrics.tsv

Each of clone metrics and file metrics is printed out as a tab-separated text file, so you can see the values by opening the files with a spreadsheet application.

In the output of file metrics, each input source file is denoted by file ID, that is, a serial number of each target file (as described previously, use command line ccfx p -ln a.ccfxd for checking the IDs).

Filtering by metric values

Filtering of clone-data file with the metrics requires the following two steps; at first, making a list of file IDs (or clone iDs), which should remain in the data. Secondly, modify the clone-data file using the list.

Step 1

As an example, considering removal of source files who don't related to code clone. Assume that file metrics is already calculated as a file filemetrics.tsv. Also, the predicate (to identify the files to be remaining) is "the file is including code fragments of a code clone?", that is, CVR > 0.0 as an expression using clone metrics.

The command line below is to extract the set of file IDs that satisfy this predicate

picosel from filemetrics.tsv select FID where `CVR.gt. 0.0 > remainingfiles.txt`

In this command line, FID means the name of column in file metrics (All names are printed in the first line of the metrics file). The CVR means the metric that shows ratio of the tokens that are covered by any code clone. The .gt. means an operator "greater than".

The expression after the where in the command line is the condition of remaining files. The operators below are used in expression.

Operator Meaning
.eq. == equals to
.ge. ≥ greater than or equals to
.gt. > greater than
.le. ≤ smaller than or equals to
.lt. < smaller than

Also, and is used to concatenate conditions. For example, in order to select the source files that are including repeated sections heavily RNR < 0.1 and having much amount of code clones between the file and the other file RSA > 0.9

picosel from filemetrics.tsv select FID where RNR.lt. 0.1 and RSA.gt. 0.9 > remainingfiles.txt

This example is to filter source files with file metrics. In order to filter code clone with clone metrics, in the above command line, replace the FID with CID, and the condition with the condition including clone metrics.

When the file remainingfiles.txt is successfully generated, the first step is done. In the second step, this generated file will be used to determine which source files (or code clones) will be remained.

Step 2

The execution-mode s is used to perform filtering with the list of file IDs (or clone IDs) which will be remained. (the s stands for subset or scope).

The following command line will do a filtering by file ID and save the result to a file filtered.ccfxd:

ccfxd s a.ccfxd -o filtered.ccfxd -fi remainfiles.txt

Here, the option -fi file means to keep the source files with one of the file IDs (which appear in remainingfiles.txt) and also to remove the other source files from the clone data file.

In order to do filtering by clone ID, use option -ci file, in place of the option "-fi file".

Summary of command lines for filtering by file metrics

Do filtering to the input clone-data file a.ccfxd, and save the result to a clone-data file filtered.ccfxd:

ccfx m a.ccfxd -f -o filemetrics.tsv
picosel -o remainfiles.txt from filemetrics.tsv select FID where "CONDITION"
ccfx s a.ccfxd -o filtered.ccfxd -fi remainfiles.txt

Summary of command lines for filtering by clone metrics

Do filtering to the input clone-data file a.ccfxd, and save the result to a clone-data file filtered.ccfxd:

ccfx m a.ccfxd -c -o clonemetrics.tsv
picosel -o remainclones.txt from clonemetrics.tsv select CID where "CONDITION"
ccfx s a.ccfxd -o filtered.ccfxd -ci remainclones.txt

File list

This section presents how to generate a file list (that is a list of the input source files of code-clone detection), and how to use a file list in detection and analysis of code clones.

A file list is used to specify paths of target source files, in an explicit way, one-by-one. Such an explicit specification of files is useful in the following cases.

  • Excluding some source files from a list of the input source file
    The CCFinderX (ccfx) doesn't have a capability to identify a tool-generated source files (Because there is no standard method to marking or identifying such tool-generated source files. I am looking forward to java.annotation.Geneted in Java programming language or similar programming-language level solutions. )
  • Includes some source files having a non-standard extension
    By default, a source file with special extension will not be regarded as a target (in execution-mode d's option -d, or file searching in execution-mode f). If you are using such special extensions (for example,.inl in VC++), and you want to include such files in the target of clone detection, use a file list in order to specify these files explicitly.
  • Modification of the order in source files
    By default, the order of source files is a kind of lexical order, with comparing paths of source files encoding in UTF-8. For example, when you want to place some two directories in the near hood in a clone scatter plot, you can edit the file list.

Generating a file list

Just the same as the preceding section, the source files are written in Java, the target directory is c:\target\src

In order to find out Java source files in the target directory and save the file list as a file filelist.txt, type following command line:

ccfx f java -a -l n c:\target\src -o filelist.txt

Here the option -a is to specify storing each file path as an absolute path in the result file list. The option -l n is to add a preprocessed-directory option to the file list, that is, a line, which describes an option -n, will be inserted as the first line of the file list. The -n line in a file list will work as if it will be a command-line option -n of execution-mode d, in a clone detection afterwards.

A clone detection itself will be done without these options (-a and -l n). However, as a preparation for display the clone-data file with GemX afterwards, and in order to prevent the preprocessed files from existing in the same directory of the target source files, these options are recommended.

The file list is a text file, so you can edit it with a text editor and freely add or delete names of source files. As a matter of course, any text file in the same format will be used as a file list, even if the file is not generated by execution-mode f.

When a file list is ready, use option -i of execution-mode d like the following command line, in order to detect code clones from the source files that are listed in the file list:

ccfx d java -i filelist.txt

When you specify multiple file lists in the command line, ccfx will work as if a file list that is a concatenation of them:

ccfx d java -i filelist1.txt -i filelist2.txt

You can also specify option -is in a file list, in addition to a path of source file, option -n. When a file list including a line, which is including only -is, the source files before the line and the source files after the line will belong the distinct file groups. As for the file group, see the next section.

File group

File groups are used for separating the target source files into some groups and detecting code clones only between the groups.

The execution-mode d has two options, which are related with file group. In order to separate source files into groups, use option -is. In order to detect code clones only between the groups and not to detect code clones between two files in the same group or code clones within a file, use option -w.

Code clone detection between versions with file group

This subsection presents an example where the target source code is two versions of a product, and detecting code clones between versions (and not detecting code clones inside each version).

The source files of the older version are stored in a directory c:\oldsrc, and ones of the newer version is c:\newsrc. The following command line will detect code clones only between versions:

ccfx d java -dn c:\oldsrc -is -dn c:\newsrc -w f-w-g+

Here, each argument means:

  • the first d means execution-mode d.
  • the next java means the target source file is written in Java.
  • the next -dn c:\oldsrc is to specify searching source files from the directory. In this case, the older versions of source files.
  • the next -is is a group separator, that is, the source files before this option and the source files after this option will belong to the distinct groups.
  • the next -dn c:\newsrc is to specify searching source files from the directory. In this case, the newer versions of source files.
  • the last -w w-f-g+ means "do not detect code clones within a file", "do not detect code clones between files in the same file group", and "detect code clones between files from the distinct file groups".

By comparing two versions with code clone, you can analyze them from view point of similarity, rather than difference. For example, you can observe the case where a code fragment was copied-and-pasted many times and has been spread over the product, or the case where duplicated code in the older versions has been cleaned up in the newer version.

Appendix

Execution modes of ccfx

Execution Mode Short Description
d Detection Input: paths of target source files. Output: a clone-data file.
p Pretty printing Prints out contents of a clone-data file in a text format.
m Metrics Calculates and prints out metrics about each code clone or metrics about each source file, from a given clone-data file.
s Filtering Input: A clone-data file and a list of file IDs (or a list of code-clone IDs). Output: A clone-data file, which is filtered with the condition.
f File-list generation Searches source files from the specified directories.

Preprocess Script Names

The execution-mode d (detection of code clones) requires a name of preprocess script at the first argument of command line. The name of preprocess script will be also stored in a clone-data file, so you can see name of preprocess script for a given clone-data file, with execution-mode p. The applicable names are:

Name of Preprocess Script Programming Language Extensions of Source Files
cobol Cobol .cbl,.cob,.cobol
cpp C/C++ .h,.hh,.hpp,.hxx,.c,.cc,.cpp,.cxx
csharp C# .cs
java Java .java
visualbasic Visual Basic .vb,.bas,.frm
plaintext Text file .txt

Options of execution-mode d

The execution-mode d has the following options to change conditions in code-clone detection. The list below contains commonly-used options.

  • -b number
    The minimum length of the detected code clones. The unit of length is token (i.e., metric LEN). The default is -b 50. The code fragments whose LEN is smaller than this value will not be detected as a code clone.
  • -t number
    The minimum number of kinds of tokens in code fragments (metric TKS). The default is -t 12. For example, A code fragment "A = 1; B = 1 + 2; C = 1 + 2 + 3; D = 1 + 2 + 3 + 4;" consists of tokens, which are classified one of the following four kinds: identifier, "=", integer literal, ";". As a result, this code fragments will not be detected as a code clone with the default option -t 12.
  • -w range_specifier
    This option is used to specify to detect inner-file clones and/or inter-file clones. The inter-file clone means that two code fragments of the code clone appear in the distinct two source files. The inner-file clone means the two code fragments of the code clone appear in the same source file. By giving argument f- to option -w (that is, -w f- ), the inter-file clones will not be detected. By giving argument w- to option -w (that is, -w w- ), the inner-file clones will not be detected. You can also give argument f+ or w+ explicitly, in order to specify to detect inter-file clones or inner-file clones, respectively. The default is -w f+w+, that is, detect both inter-file clones and inner-file clones. Option -w has yet another parameter, g+, g-, whose usages are shown in File group.
  • -dn directory
    This option means to specify both option -d and -n to the directory.
  • -d directory
    This option is used to specify a target directory, that is, the target source files will be searched under the directory. When the directory is specified by an absolute path, each name of source file in the clone data (output file) will be stored as an absolute path. When the directory is specified by a relative path, each name of source files in the clone data will be stored as a relative path. Note that GemX requires absolute path to show clones in the clone-data file.
  • -n directory
    This option is to specify a directory for intermediate data files (preprocess-result files, or simply preprocessed files). When ccfx runs without option -n, it will generate a preprocessed file (.ccfxprep) for each of input source files, and will put the preprocessed file in the same directory of the corresponding source file. An argument of option -n should be the directory of source files or the parent directory of it. Otherwise, the option will be neglected.
  • -i file_list
    When this option is specified, the ccfx will read paths of source files from the file list. The file list is a text file, which contains a path of source file in each line. The option -i will be useful when you have to detect clones from part of source files in a directory, or detect clones from source files having special extensions. See section File list for the detail.
  • --threads=number
    This option is to specify number of worker threads in code-clone detection. On multi-core CPU, the detection time will be shortened by this option.

Options of execution-mode p

The execution-mode p has some options, not so much as mode d. This sub-section presents some of them, which will be useful in scripting, such as extracting some kind statistics data from a clone-data file by a script.

  • -l
    Extracting a list of path of each source file from a clone-data file.
  • -ln
    Extracting a list of path and file ID of each source file from a clone-data file.
  • -a
    Reverse pretty printing. That is, generate a (binary) clone-data file from a text file, which is generated with pretty printing.