Skip to content

Latest commit

 

History

History

Duplicates

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Duplicates size 250MB

Download link (20MB xz-compressed).

Duplicates dataset consists of two parts:

  1. 1989 labeled pairs of Java files.
  2. 633 labeled pairs of Java functions.

Those pairs were labeled by several source{d} employees as "identical", "similar" or "different" in February 2018. We used src-d/code-annotation web application to perform the labeling. The goal of making the dataset was tuning for the best hyperparameters in src-d/apollo, which was the proof-of-concept for src-d/gemini.

Code similarity is quite subjective, and human labelers may contradict each other in some cases. We've set 3 categories instead of 2 to make the choice easier.

Format

SQLite 3 database, the schema is shown below.

db schema

There are 4 tables:

  1. experiments - the labeling sessions. There are only two - files and functions.
  2. users - the people who labeled the pairs of files and functions.
  3. pairs - the data for each pair, including the code strings and UASTv1-s.
  4. assignments - the labels per person per experiment.

Sample code

You need Python 3 with the dependencies installed via pip3 install -r requirements.txt.

from duplicates import DuplicatesDataset
ds = DuplicatesDataset("/Users/sourced/Desktop/duplicates.db")
print(ds.experiments)
print(ds.users)
print(len(ds.assignments))
print(len(ds.pairs))

Origin

The choice of the files was designed in the included notebooks.

Limitations

There were ~4 active human reviewers who did the labeling, they were from the same company, and talked to each other. Hence there can be bias in the labels. Code duplication is subjective, anyway.

License

Code: MIT. Labels: Open Data Commons Open Database License (ODbL). Actual file contents © their authors.