-
Notifications
You must be signed in to change notification settings - Fork 2
Tutorial of ccfx
Author: Toshihiro Kamiya
Created: 2008/Nov/6
Contact: info@ccfinder.net
Copyright: 2005-2010 © Toshihiro Kamiya. All rights reserved.
- Outline of command line of ccfx
- Detection and printing of code clones
- Metrics and filtering
- File list
- File group
- Appendix
Tool ccfx is a CLI tool, that is, supposed to be invoked by user at command line. The tool has several functions including detection of code clones and filtering code clones with clone-set metrics or file metrics.
The tool ccfx
also works as a back end of GUI tool GemX, that is, GemX internally calls ccfx in order to detect code clones and analyze them.
You can utilize the options of ccfx by invoking it directly at command line; some of such options are unavailable via GemX. Also, you can invoke ccfx in a batch file, in order to combine it with other command-line tools.
The first argument of ccfx
command line is an execution-mode specifier. The available execution modes are: d
, p
, m
, s
, and f
. As for detail of each mode, see Execution modes of ccfx. You can obtain the list of execution modes by the following command line:
ccfx -h
Also, you can obtain the help message of each execution mode by such as command-lines:
ccfx d -h
ccfx m -h
The following sections present each execution mode, along with the supposed scenario in clone detection and analysis.
This section presents how to detect code clones, when a directory is given, which stores target source files.
In this example, the source files are written in Java programming language, and the target directory is c:\target\src
In order to detect code clones with ccfx, use execution-mode d (d stands for detection). The following command line will let ccfx detect code clones from the source files and the result will be stored in a file a.ccfxd in the current directory:
ccfx d java -dn c:\target\src
The argument java
means that target source files are written in Java. That is, with this argument, ccfx will search files having extension ".java" in (sub-directories of) the target directories, and apply a preprocessing for Java programming language. (The preprocessing is a kind of normalization, which depends on the syntax of each programming language, to improve correctness of clone detection. The detection algorithm itself is independent from programming languages; a single detection algorithm is used for source code written in any programming language.) Appendix includes a table of the preprocess script, programming language, and extensions of source files, as a section Preprocess Script Names.
The option -dn
is roughly to specify a directory which stores the target source files. The detected code clones will be stored to a file a.ccfxd
, by default. In order to specify name of the output file, use option -o file
. The details of the options are shown in Options of execution-mode d
.
The clone-data file (output file) is a binary file. In order to print them as text, use execution-mode p of ccfx (p stands for pretty printing):
ccfx p a.ccfxd
The execution-mode p
has some options, which enable to extract part of information from a clone-data file. For example, each of target source file has a file ID (a kind of serial number), and a table containing each file path and its file ID will be obtained by the following command line:
ccfx p -ln a.ccfxd
The details of options are shown in Options of execution-mode p
.
Tool ccfx has functions to calculate some kind of metrics for code clone and ones for source files. The user can use these metrics to perform some filtering of code clones or source files.
Just the same as the preceding section, the source files are written in Java, the target directory is c:\target\src
. Additionally, the code-clone detection has already done, and the detected clones are stored in a clone-data file a.ccfxd
.
In order to calculate metrics, use execution-mode m. There are two categories of metrics: clone metrics and file metrics. The command line below is to extract clone metrics
ccfx m a.ccfxd -c
The command line below is to extract file metrics
ccfx m a.ccfxd -f
In execution-mode m
command lines, you have to type clone-data file before option -c
or -f
.
The output files are specified with option -o file
. You can specify option -o
for each option -c
or -f
. So a command line below means extracting both clone metrics and file metrics:
ccfx m a.ccfxd -c -o clonemetrics.tsv -f -o filemetrics.tsv
Each of clone metrics and file metrics is printed out as a tab-separated text file, so you can see the values by opening the files with a spreadsheet application.
In the output of file metrics, each input source file is denoted by file ID, that is, a serial number of each target file (as described previously, use command line ccfx p -ln a.ccfxd
for checking the IDs).
Filtering of clone-data file with the metrics requires the following two steps; at first, making a list of file IDs (or clone iDs), which should remain in the data. Secondly, modify the clone-data file using the list.
As an example, considering removal of source files who don't related to code clone. Assume that file metrics is already calculated as a file filemetrics.tsv
. Also, the predicate (to identify the files to be remaining) is "the file is including code fragments of a code clone?", that is, CVR > 0.0
as an expression using clone metrics.
The command line below is to extract the set of file IDs that satisfy this predicate
picosel from filemetrics.tsv select FID where `CVR.gt. 0.0 > remainingfiles.txt`
In this command line, FID means the name of column in file metrics (All names are printed in the first line of the metrics file). The CVR means the metric that shows ratio of the tokens that are covered by any code clone. The .gt.
means an operator "greater than".
The expression after the where
in the command line is the condition of remaining files. The operators below are used in expression.
Operator | Meaning |
---|---|
.eq. |
== equals to |
.ge. |
≥ greater than or equals to |
.gt. |
> greater than |
.le. |
≤ smaller than or equals to |
.lt. |
< smaller than |
Also, and is used to concatenate conditions. For example, in order to select the source files that are including repeated sections heavily RNR < 0.1
and having much amount of code clones between the file and the other file RSA > 0.9
picosel from filemetrics.tsv select FID where RNR.lt. 0.1 and RSA.gt. 0.9 > remainingfiles.txt
This example is to filter source files with file metrics. In order to filter code clone with clone metrics, in the above command line, replace the FID with CID, and the condition with the condition including clone metrics.
When the file remainingfiles.txt is successfully generated, the first step is done. In the second step, this generated file will be used to determine which source files (or code clones) will be remained.
The execution-mode s is used to perform filtering with the list of file IDs (or clone IDs) which will be remained. (the s stands for subset or scope).
The following command line will do a filtering by file ID and save the result to a file filtered.ccfxd:
ccfxd s a.ccfxd -o filtered.ccfxd -fi remainfiles.txt
Here, the option -fi file
means to keep the source files with one of the file IDs (which appear in remainingfiles.txt) and also to remove the other source files from the clone data file.
In order to do filtering by clone ID, use option -ci file
, in place of the option "-fi file".
Do filtering to the input clone-data file a.ccfxd, and save the result to a clone-data file filtered.ccfxd:
ccfx m a.ccfxd -f -o filemetrics.tsv
picosel -o remainfiles.txt from filemetrics.tsv select FID where "CONDITION"
ccfx s a.ccfxd -o filtered.ccfxd -fi remainfiles.txt
Do filtering to the input clone-data file a.ccfxd, and save the result to a clone-data file filtered.ccfxd:
ccfx m a.ccfxd -c -o clonemetrics.tsv
picosel -o remainclones.txt from clonemetrics.tsv select CID where "CONDITION"
ccfx s a.ccfxd -o filtered.ccfxd -ci remainclones.txt
This section presents how to generate a file list (that is a list of the input source files of code-clone detection), and how to use a file list in detection and analysis of code clones.
A file list is used to specify paths of target source files, in an explicit way, one-by-one. Such an explicit specification of files is useful in the following cases.
- Excluding some source files from a list of the input source file
The CCFinderX (ccfx) doesn't have a capability to identify a tool-generated source files (Because there is no standard method to marking or identifying such tool-generated source files. I am looking forward to java.annotation.Geneted in Java programming language or similar programming-language level solutions. ) - Includes some source files having a non-standard extension
By default, a source file with special extension will not be regarded as a target (in execution-mode d's option -d, or file searching in execution-mode f). If you are using such special extensions (for example,.inl in VC++), and you want to include such files in the target of clone detection, use a file list in order to specify these files explicitly. - Modification of the order in source files
By default, the order of source files is a kind of lexical order, with comparing paths of source files encoding in UTF-8. For example, when you want to place some two directories in the near hood in a clone scatter plot, you can edit the file list.
Just the same as the preceding section, the source files are written in Java, the target directory is c:\target\src
In order to find out Java source files in the target directory and save the file list as a file filelist.txt
, type following command line:
ccfx f java -a -l n c:\target\src -o filelist.txt
Here the option -a
is to specify storing each file path as an absolute path in the result file list. The option -l
n is to add a preprocessed-directory option to the file list, that is, a line, which describes an option -n
, will be inserted as the first line of the file list. The -n
line in a file list will work as if it will be a command-line option -n
of execution-mode d
, in a clone detection afterwards.
A clone detection itself will be done without these options (-a
and -l n
). However, as a preparation for display the clone-data file with GemX afterwards, and in order to prevent the preprocessed files from existing in the same directory of the target source files, these options are recommended.
The file list is a text file, so you can edit it with a text editor and freely add or delete names of source files. As a matter of course, any text file in the same format will be used as a file list, even if the file is not generated by execution-mode f
.
When a file list is ready, use option -i
of execution-mode d like the following command line, in order to detect code clones from the source files that are listed in the file list:
ccfx d java -i filelist.txt
When you specify multiple file lists in the command line, ccfx will work as if a file list that is a concatenation of them:
ccfx d java -i filelist1.txt -i filelist2.txt
You can also specify option -is
in a file list, in addition to a path of source file, option -n
. When a file list including a line, which is including only -is
, the source files before the line and the source files after the line will belong the distinct file groups. As for the file group, see the next section.
File groups are used for separating the target source files into some groups and detecting code clones only between the groups.
The execution-mode d
has two options, which are related with file group. In order to separate source files into groups, use option -is
. In order to detect code clones only between the groups and not to detect code clones between two files in the same group or code clones within a file, use option -w
.
This subsection presents an example where the target source code is two versions of a product, and detecting code clones between versions (and not detecting code clones inside each version).
The source files of the older version are stored in a directory c:\oldsrc
, and ones of the newer version is c:\newsrc
. The following command line will detect code clones only between versions:
ccfx d java -dn c:\oldsrc -is -dn c:\newsrc -w f-w-g+
Here, each argument means:
- the first
d
means execution-mode d. - the next
java
means the target source file is written in Java. - the next
-dn c:\oldsrc
is to specify searching source files from the directory. In this case, the older versions of source files. - the next
-is
is a group separator, that is, the source files before this option and the source files after this option will belong to the distinct groups. - the next
-dn c:\newsrc
is to specify searching source files from the directory. In this case, the newer versions of source files. - the last
-w w-f-g+
means "do not detect code clones within a file", "do not detect code clones between files in the same file group", and "detect code clones between files from the distinct file groups".
By comparing two versions with code clone, you can analyze them from view point of similarity, rather than difference. For example, you can observe the case where a code fragment was copied-and-pasted many times and has been spread over the product, or the case where duplicated code in the older versions has been cleaned up in the newer version.
Execution Mode | Short | Description |
---|---|---|
d |
Detection | Input: paths of target source files. Output: a clone-data file. |
p |
Pretty printing | Prints out contents of a clone-data file in a text format. |
m |
Metrics | Calculates and prints out metrics about each code clone or metrics about each source file, from a given clone-data file. |
s |
Filtering | Input: A clone-data file and a list of file IDs (or a list of code-clone IDs). Output: A clone-data file, which is filtered with the condition. |
f |
File-list generation | Searches source files from the specified directories. |
The execution-mode d
(detection of code clones) requires a name of preprocess script at the first argument of command line. The name of preprocess script will be also stored in a clone-data file, so you can see name of preprocess script for a given clone-data file, with execution-mode p
. The applicable names are:
Name of Preprocess Script | Programming Language | Extensions of Source Files |
---|---|---|
cobol | Cobol | .cbl,.cob,.cobol |
cpp | C/C++ | .h,.hh,.hpp,.hxx,.c,.cc,.cpp,.cxx |
csharp | C# | .cs |
java | Java | .java |
visualbasic | Visual Basic | .vb,.bas,.frm |
plaintext | Text file | .txt |
The execution-mode d
has the following options to change conditions in code-clone detection. The list below contains commonly-used options.
-
-b number
The minimum length of the detected code clones. The unit of length is token (i.e., metric LEN). The default is -b 50. The code fragments whose LEN is smaller than this value will not be detected as a code clone. -
-t number
The minimum number of kinds of tokens in code fragments (metric TKS). The default is -t 12. For example, A code fragment "A = 1; B = 1 + 2; C = 1 + 2 + 3; D = 1 + 2 + 3 + 4;" consists of tokens, which are classified one of the following four kinds: identifier, "=", integer literal, ";". As a result, this code fragments will not be detected as a code clone with the default option -t 12. -
-w range_specifier
This option is used to specify to detect inner-file clones and/or inter-file clones. The inter-file clone means that two code fragments of the code clone appear in the distinct two source files. The inner-file clone means the two code fragments of the code clone appear in the same source file. By giving argument f- to option -w (that is, -w f- ), the inter-file clones will not be detected. By giving argument w- to option -w (that is, -w w- ), the inner-file clones will not be detected. You can also give argument f+ or w+ explicitly, in order to specify to detect inter-file clones or inner-file clones, respectively. The default is -w f+w+, that is, detect both inter-file clones and inner-file clones. Option -w has yet another parameter, g+, g-, whose usages are shown in File group. -
-dn directory
This option means to specify both option -d and -n to the directory. -
-d directory
This option is used to specify a target directory, that is, the target source files will be searched under the directory. When the directory is specified by an absolute path, each name of source file in the clone data (output file) will be stored as an absolute path. When the directory is specified by a relative path, each name of source files in the clone data will be stored as a relative path. Note that GemX requires absolute path to show clones in the clone-data file. -
-n directory
This option is to specify a directory for intermediate data files (preprocess-result files, or simply preprocessed files). When ccfx runs without option -n, it will generate a preprocessed file (.ccfxprep) for each of input source files, and will put the preprocessed file in the same directory of the corresponding source file. An argument of option -n should be the directory of source files or the parent directory of it. Otherwise, the option will be neglected. -
-i file_list
When this option is specified, the ccfx will read paths of source files from the file list. The file list is a text file, which contains a path of source file in each line. The option -i will be useful when you have to detect clones from part of source files in a directory, or detect clones from source files having special extensions. See section File list for the detail. -
--threads=number
This option is to specify number of worker threads in code-clone detection. On multi-core CPU, the detection time will be shortened by this option.
The execution-mode p has some options, not so much as mode d. This sub-section presents some of them, which will be useful in scripting, such as extracting some kind statistics data from a clone-data file by a script.
-
-l
Extracting a list of path of each source file from a clone-data file. -
-ln
Extracting a list of path and file ID of each source file from a clone-data file. -
-a
Reverse pretty printing. That is, generate a (binary) clone-data file from a text file, which is generated with pretty printing.
© 2009-2010 AIST
© 2018 Jacek Banaszczyk
- What's New
- CCFinder in Articles
- Contact
-
Documents
- What's CCFinderX?
- Installation instractions
- Tutorial of GUI front-end GemX
- Tutorial of CLI Tool ccfx
- Troubleshooting
- Acknowledgment