The R source code comparison is based on similarity coefficients for the names used in R programs or expressions. Use cases would be the detection of
In the first case, detection of similar code sequences can lead to better code quality if similar code is embedded in a function rather than repeatedly in different places. In the second case, cheating is looked for.
The goal, however, is not perfect detection of similar code sequences, but rather to give clues as to where similar code sequences might be.
We have several steps to take:
A thanks to Masatoshi Nishimura to his blog post The Best Document Similarity Algorithm: A Beginner’s Guide which is very inspiring.
files <- ... # get file names from somehere, e.g. list.files()
prgs <- sourcecode(files, title=basename(files))
docs <- documents(prgs, type="names")
sim <- similarities(docs) # you may use alternatively tfidf()
dfsim <- matrix2dataframe(sim)
head(dfsim, n=25)
browse(prgs, dfsim, n=6) # creates and opens a HTML file
files <- ... # get file names from somehere, e.g. list.files()
# load all expressions with at least `minlines` lines
prgs <- sourcecode(files, title=basename(files), minlines=0)
docs <- documents(prgs, type="names")
sim <- similarities(docs) # you may use alternatively tfidf()
sim <- same_file(sim) # do not compare expressions within one file
dfsim <- matrix2dataframe(sim)
head(dfsim, n=25)
browse(prgs, dfsim, n=6) # creates and opens a HTML file
The makers of the package SimilaR (R Source Code Similarity Evaluation) have provided some sample files for testing:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files))
#>
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/aa.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/aa1.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/bucketSort1.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/bucketSort1_addLines.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/bucketSort1_variables.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/isPrime2.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/isPrime2_addLines.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/isPrime2_callReverse.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kendall4.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kendall4_variables.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kombinuj1.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kombinuj1_variables.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kwantyle1.R
#> /tmp/Rtmpfvwsak/Rinst9db96d3fd94c/rscc/examples/kwantyle1_variables.R
names(prgs)
#> [1] "aa.R" "aa1.R"
#> [3] "bucketSort1.R" "bucketSort1_addLines.R"
#> [5] "bucketSort1_variables.R" "isPrime2.R"
#> [7] "isPrime2_addLines.R" "isPrime2_callReverse.R"
#> [9] "kendall4.R" "kendall4_variables.R"
#> [11] "kombinuj1.R" "kombinuj1_variables.R"
#> [13] "kwantyle1.R" "kwantyle1_variables.R"
The parameter title
sets another title for the program instead of files
.
The parameter silent=TRUE
suppresses the output of the parsed files. If an error occurs during parsing, the file will not be loaded and will not be included in the following steps.
If you want to consider expressions and not the whole R file, you have to set the parameter minlines
. sourcecode
checks whether an expression in the source file has more than minlines
lines. If so, the expression is kept for further analysis. The name of the list items in prgs
is then title[number]
. For example, you could access the expression named prgs[["aa.R[1]"]]
.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), minlines=3, silent=TRUE)
names(prgs)
#> [1] "aa.R[1]" "aa1.R[1]"
#> [3] "bucketSort1.R[1]" "bucketSort1_addLines.R[1]"
#> [5] "bucketSort1_variables.R[1]" "isPrime2.R[1]"
#> [7] "isPrime2_addLines.R[1]" "isPrime2_callReverse.R[1]"
#> [9] "kendall4.R[1]" "kendall4_variables.R[1]"
#> [11] "kombinuj1.R[1]" "kombinuj1_variables.R[1]"
#> [13] "kwantyle1.R[1]" "kwantyle1_variables.R[1]"
The next step is to create documents, based on the names of variables, functions and operators from the parsed source codes.
docs <- documents(prgs)
# create term document frequency table
freq_table(docs)[1:8,1:8]
#> vword
#> vdoc asd asd1 asd2 asd3 asd4 asd5 asd6 asd7
#> aa.R[1] 1 0 0 0 0 0 0 0
#> aa1.R[1] 1 0 0 0 0 0 0 0
#> bucketSort1.R[1] 0 0 0 0 0 0 0 0
#> bucketSort1_addLines.R[1] 0 0 0 0 0 0 0 0
#> bucketSort1_variables.R[1] 0 0 0 0 0 0 0 0
#> isPrime2.R[1] 0 0 0 0 0 0 0 0
#> isPrime2_addLines.R[1] 0 0 0 0 0 0 0 0
#> isPrime2_callReverse.R[1] 0 1 1 1 1 1 1 1
With the type
parameter you can distinguish between different types of names and with ignore.case
the names are either treated case-insensitive or not.
type
With the type
parameter you can distinguish between different types of names:
cat(as.character(prgs[[1]])) # source code
#> asd <- function(x) {
#> for (i in x) {
#> cat(i)
#> x[i] <- 3
#> }
#> }
all.vars(prgs[[1]]) # type="v", default
#> [1] "asd" "i" "x"
all.names(prgs[[1]]) # type="n"
#> [1] "<-" "asd" "function" "{" "for" "i"
#> [7] "x" "{" "cat" "i" "<-" "["
#> [13] "x" "i"
setdiff(all.names(prgs[[1]]), all.vars(prgs[[1]])) # type="f"
#> [1] "<-" "function" "{" "for" "cat" "["
minlen
and ignore.case
With the parameter minlen
you can exclude names that are shorter than minlen
. The default is minlen=2
because the name of an index variable in loops often consists of only one letter, for example for (i in 1:n)
. ignore.case
is either TRUE
or FALSE
. If TRUE
(default), then names will case-insensitive, otherwise case-sensitive.
In cluster analysis has been developed several similarity coefficients. They use words and basically calculate the intersection of words in documents over the union of words.
To calculate similarity coefficients between all source text segments based on the names used:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
similarities(docs)[1:8,1:8]
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.1428571 0.1428571
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#> bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R 0.0000000 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000 0.0000000
#> bucketSort1.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.1111111 0.1111111
#> isPrime2_callReverse.R
#> aa.R 0.0000000
#> aa1.R 0.0000000
#> bucketSort1.R 0.0000000
#> bucketSort1_addLines.R 0.0000000
#> bucketSort1_variables.R 0.0000000
#> isPrime2.R 0.1111111
#> isPrime2_addLines.R 0.1111111
#> isPrime2_callReverse.R 1.0000000
This calculates a matrix of Jaccard coefficients based on the variable names.
If the Jaccard coefficient is used then each entry can be interpreted as the percentage:
\[\frac{\text{number of words used in both documents}}{\text{number of all words used in both documents}}.\] The interpretation will be different if another similarity coefficient is used! But in any case, a higher similarity coefficient corresponds to a larger proportion of variable names in both files (or expressions).
coeff
(similarity)With the parameter coeff
a certain similarity coefficient can be calculated (default: jaccard
).
If you specify two sets with unique names set1
, set2
and one set setfull
with predefined names, four numbers will be calculated (default: setfull <- unique(c(set1,set2))
):
inset1 <- setfull %in% unique(set1)
inset2 <- setfull %in% unique(set2)
p <- length(setfull)
n11 <- sum(inset1 & inset2)
n10 <- sum(inset1 & !inset2)
n01 <- sum(!inset1 & inset2)
n00 <- sum(!inset1 & !inset2)
The following coefficients can be calculated:
braun = n11/max(n01+n11, n10+n11)
,dice = 2*n11/(n01+n10+2*n11)
,jaccard = n11/(n01+n10+n11)
(default),kappa = 1/(1+p/2*(n01+n10)/(n00*n11-n01*n10))
,kulczynski = n11/(n01+n10)
,matching = (n00+n11)/p
,ochiai = n11/sqrt((n11+n10)*(n11+n10))
,phi = (n11*n00-n10*n01)/sqrt((n11+n10)*(n11+n10)*(n00+n10)*(n00+n10))
,russelrao = n11/p
,simpson = n11/min(n01+n11, n10+n11)
,sneath = n11/(n11+2*n01+2*n10)
,tanimoto = (n11+n00)/(n11+2*n01+2*n10+n00)
, andyule = (n11*n00-n01*n10)/(n11*n00-n01*n10)
.If a coefficient name is not found or a NaN
is generated then a zero is returned.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
similarities(docs, coeff="m")[1:8,1:8]
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.1428571 0.1428571
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#> bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R 0.0000000 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000 0.0000000
#> bucketSort1.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.1428571 0.0000000 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.1111111 0.1111111
#> isPrime2_callReverse.R
#> aa.R 0.0000000
#> aa1.R 0.0000000
#> bucketSort1.R 0.0000000
#> bucketSort1_addLines.R 0.0000000
#> bucketSort1_variables.R 0.0000000
#> isPrime2.R 0.1111111
#> isPrime2_addLines.R 0.1111111
#> isPrime2_callReverse.R 1.0000000
Another possibility to find similar source codes as cosine of angles of term frequency–inverse document frequency:
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
tfidf(docs)[1:8,1:8]
#>
#> aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R 1 1 0.0000000 0.0000000
#> aa1.R 1 1 0.0000000 0.0000000
#> bucketSort1.R 0 0 1.0000000 1.0000000
#> bucketSort1_addLines.R 0 0 1.0000000 1.0000000
#> bucketSort1_variables.R 0 0 0.3128966 0.3128966
#> isPrime2.R 0 0 0.0000000 0.0000000
#> isPrime2_addLines.R 0 0 0.0000000 0.0000000
#> isPrime2_callReverse.R 0 0 0.0000000 0.0000000
#>
#> bucketSort1_variables.R isPrime2.R
#> aa.R 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000
#> bucketSort1.R 0.3128966 0.0000000
#> bucketSort1_addLines.R 0.3128966 0.0000000
#> bucketSort1_variables.R 1.0000000 0.0000000
#> isPrime2.R 0.0000000 1.0000000
#> isPrime2_addLines.R 0.0000000 1.0000000
#> isPrime2_callReverse.R 0.0000000 0.2021137
#>
#> isPrime2_addLines.R isPrime2_callReverse.R
#> aa.R 0.0000000 0.0000000
#> aa1.R 0.0000000 0.0000000
#> bucketSort1.R 0.0000000 0.0000000
#> bucketSort1_addLines.R 0.0000000 0.0000000
#> bucketSort1_variables.R 0.0000000 0.0000000
#> isPrime2.R 1.0000000 0.2021137
#> isPrime2_addLines.R 1.0000000 0.2021137
#> isPrime2_callReverse.R 0.2021137 1.0000000
The attribute tfidf
of the result matrix contains the term frequency–inverse document frequency matrix.
Since the matrix of coefficients can not easily be overviewed we have must transform them for visualisation.
matrix2dataframe
With matrix2dataframe
you transform the matrix of coefficients to a data frame where the first column is the row index, the second column the column index and the third column the the matrix entry.
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs)
simm <- similarities(docs, coeff="m")
simdf <- matrix2dataframe(simm)
head(simdf, 10)
#> row col matching
#> 1 aa.R aa1.R 1.0000000
#> 2 bucketSort1.R bucketSort1_addLines.R 1.0000000
#> 3 isPrime2.R isPrime2_addLines.R 1.0000000
#> 4 kombinuj1.R kombinuj1_variables.R 0.3000000
#> 5 kwantyle1.R kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571
#> 7 bucketSort1_addLines.R bucketSort1_variables.R 0.1428571
#> 8 kendall4.R kendall4_variables.R 0.1428571
#> 9 isPrime2.R isPrime2_callReverse.R 0.1111111
#> 10 isPrime2_addLines.R isPrime2_callReverse.R 0.1111111
As you can see the lines are ordered by decreasing values. The parameter decreasing=FALSE
will order it by ascending order.
The output can be interpreted line by line:
aa.R
and aa1.R
are identical. Each variable name used in aa.R
is also used in aa1.R
and vice versa.bucketSort1_addLines.R
and bucketSort1.R
are identical.isPrime2_addLines.R
and isPrime2.R
are identical.kombinuj1_variables.R
and kombinuj1.R
are not identical. In both files, the variable names overlap by 30%.In case of a symmetric similarity matrix only the upper triangle is considered otherwise the whole matrix.
You may use a stripchart
to see all similarity coefficients:
stripchart(simdf[,3], "jitter", pch=19, xlab=names(simdf)[3])
as_igraph
In a second step you can plot the coefficients in a diagram, where thicker edges correspond to higher similarity coefficients.
library("igraph")
#>
#> Attache Paket: 'igraph'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> decompose, spectrum
#> Das folgende Objekt ist maskiert 'package:base':
#>
#> union
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files), silent=TRUE)
docs <- documents(prgs, type="n", minlen=3)
simm <- similarities(docs)
graph <- as_igraph(simm, diag=FALSE)
# color all edges wit a large similarity coefficients in red
E(graph)$color <- ifelse(E(graph)$weight>0.4, "red", "grey")
plot(graph, edge.width=1+3*E(graph)$weight)
box()