Courses & TutorialsProgrammingSoftware

Awesome Empirical Software Engineering – Massive Collection of Resources

Spread the love
A curated repository of data sets and tools that can be used for conducting evidence-based, data-driven research on software systems.
This research approach is often termed experimental, or empirical software engineering.
Many of the data sets can also be useful in research using search-based software engineering methods.
The repository is named after the Mining Software Repositories (MSR) conference series.
For examples of such work see the MSR conference’s Hall of Fame.



Data Sets

  • AndroidTimeMachine – Graph-based dataset of commit history of 8,431 real-world Android apps.
  • AndroZoo – Collection of Android Applications.
  • Bug Prediction Dataset – Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.
  • Code Reviews – Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
  • CoREBench – Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.
  • Cryptocurrency GitHub Activity and Market Cap Dataset – Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.
  • Defects4J – Collection of 395 reproducible bugs collected with the goal of advancing software testing research.
  • Eclipse AERI stacktraces – Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.
  • Enron Spreadsheets and Emails – All the spreadsheets and emails used in the paper ‘Enron’s Spreadsheets and Related Emails: A Dataset and Analysis’.
  • Findbugs-maven – Set of FindBugs reports for the Java projects of the Maven repository.
  • GHTorrent – Scalable, queriable, offline mirror of data offered through the GitHub REST API.
  • GitHub Bug Dataset – Bug Dataset of 15 Java open-source projects characterized by static source code metrics.
  • GitHub on Google BigQuery – GitHub data accessible through Google’s BigQuery platform.
  • Grammar Zoo – Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.
  • KaVE – Developer tool interaction data.
  • Linux Kernel 4.21 Call Graphs – The Linux Kernel 4.21 Call Graphs produced using CScout.
  • Maven metrics – Collection of software complexity & sizing metrics for the Maven Repository.
  • Maven Dependency Graph – Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.
  • mzdata – Multi-extract and multi-level dataset of Mozilla issue tracking history.
  • npm-miner – The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages.
  • OCL Expressions on GitHub – Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
  • RepoReapers Data Set – Data set containing a collection of engineered software projects from GHTorrent.
  • Software Heritage Graph Dataset – Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation (paper here).
  • STAMINA – (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).
  • Stack Exchange – Anonymized dump of all user-contributed content on the Stack Exchange network.
  • TravisTorrent – Provides free and easy-to-use Traivs CI build analyses.
  • Ultimate Debian Database (UDD) – Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.
  • Unified Bug Dataset – Static source code based datasets which includes the Bugcatchers Bug Dataset, the Bug Prediction Dataset, the Eclipse Bug Dataset, the GitHub Bug Dataset, some datasets from the PROMISE repository.
  • Unix history – Git repository with 46 years of Unix history evolution.


  • astminer – Library and tool for mining of path-based representations of code and other data derived from ASTs.
  • Boa – Domain-specific language and infrastructure that eases mining software repositories.
  • buckwheat – Multi-language tokenizer for extracting identifiers from source code.
  • ckjm – Chidamber and Kemerer Java Metrics.
  • Coming – A Java framework for analyzing code changes and mining instances of change patterns from Git repositories.
  • CryptOSS – Mine GitHub activity and market cap data for cryptocurrency projects.
  • DbDeo – Extract embedded SQL statements and detect database schema smells.
  • Designite – Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.
  • DesigniteJava – Compute source code metrics and detect a variety of implementation and design smells for Java.
  • Diggit – Agile Ruby Tool to analyze Git repositories.
  • GrimoireLab – Free/Libre/Open Source tools for Software Development Analytics.
  • MetricMiner – Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.
  • Maven-miner – Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Neo4j Graph.
  • Perceval – Fetch repository data from tens of back-ends.
  • Puppeteer – Detect configuration smells in Puppet code.
  • PyDriller – Python Framework to analyse Git repositories.
  • qmcalc – Calculate quality metrics from C source code.
  • reaper – Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.
  • RefactoringMiner – Library/API for detection of refactorings in changes of Java code.
  • VulData7 – Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git).

Research Outlets

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button