This research approach is often termed experimental, or empirical software engineering.
Many of the data sets can also be useful in research using search-based software engineering methods.
The repository is named after the Mining Software Repositories (MSR) conference series.
For examples of such work see the MSR conference’s Hall of Fame.
- SIR – Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data.
- PROMISE – About 20 datasets related to software engineering research.
- FLOSSmole – Collaborative collection and analysis of free/libre/open source project data.
- Zenodo – Software data collections in CERN’s open-access repository.
- AndroidTimeMachine – Graph-based dataset of commit history of 8,431 real-world Android apps.
- AndroZoo – Collection of Android Applications.
- Bug Prediction Dataset – Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.
- Code Reviews – Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
- CoREBench – Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.
- Cryptocurrency GitHub Activity and Market Cap Dataset – Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.
- Defects4J – Collection of 395 reproducible bugs collected with the goal of advancing software testing research.
- Eclipse AERI stacktraces – Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.
- Enron Spreadsheets and Emails – All the spreadsheets and emails used in the paper ‘Enron’s Spreadsheets and Related Emails: A Dataset and Analysis’.
- Findbugs-maven – Set of FindBugs reports for the Java projects of the Maven repository.
- GHTorrent – Scalable, queriable, offline mirror of data offered through the GitHub REST API.
- GitHub Bug Dataset – Bug Dataset of 15 Java open-source projects characterized by static source code metrics.
- GitHub on Google BigQuery – GitHub data accessible through Google’s BigQuery platform.
- Grammar Zoo – Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.
- KaVE – Developer tool interaction data.
- Linux Kernel 4.21 Call Graphs – The Linux Kernel 4.21 Call Graphs produced using CScout.
- Maven metrics – Collection of software complexity & sizing metrics for the Maven Repository.
- Maven Dependency Graph – Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.
- mzdata – Multi-extract and multi-level dataset of Mozilla issue tracking history.
- npm-miner – The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages.
- OCL Expressions on GitHub – Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
- RepoReapers Data Set – Data set containing a collection of engineered software projects from GHTorrent.
- Software Heritage Graph Dataset – Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation (paper here).
- STAMINA – (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).
- Stack Exchange – Anonymized dump of all user-contributed content on the Stack Exchange network.
- TravisTorrent – Provides free and easy-to-use Traivs CI build analyses.
- Ultimate Debian Database (UDD) – Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.
- Unified Bug Dataset – Static source code based datasets which includes the Bugcatchers Bug Dataset, the Bug Prediction Dataset, the Eclipse Bug Dataset, the GitHub Bug Dataset, some datasets from the PROMISE repository.
- Unix history – Git repository with 46 years of Unix history evolution.
- astminer – Library and tool for mining of path-based representations of code and other data derived from ASTs.
- Boa – Domain-specific language and infrastructure that eases mining software repositories.
- buckwheat – Multi-language tokenizer for extracting identifiers from source code.
- ckjm – Chidamber and Kemerer Java Metrics.
- Coming – A Java framework for analyzing code changes and mining instances of change patterns from Git repositories.
- CryptOSS – Mine GitHub activity and market cap data for cryptocurrency projects.
- DbDeo – Extract embedded SQL statements and detect database schema smells.
- Designite – Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.
- DesigniteJava – Compute source code metrics and detect a variety of implementation and design smells for Java.
- Diggit – Agile Ruby Tool to analyze Git repositories.
- GrimoireLab – Free/Libre/Open Source tools for Software Development Analytics.
- MetricMiner – Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.
- Maven-miner – Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Neo4j Graph.
- Perceval – Fetch repository data from tens of back-ends.
- Puppeteer – Detect configuration smells in Puppet code.
- PyDriller – Python Framework to analyse Git repositories.
- qmcalc – Calculate quality metrics from C source code.
- reaper – Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.
- RefactoringMiner – Library/API for detection of refactorings in changes of Java code.
- VulData7 – Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git).
- Outlets exclusively devoted to empirical software engineering research
- Outlets that publish empirical software engineering research
- ACM Transactions on Software Engineering and Methodology (TOSEM)
- ESEC/FSE: ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
- ICSE: International Conference on Software Engineering
- IEEE Software magazine
- IEEE Transactions on Software Engineering
- Journal of Systems and Software
- SANER: IEEE International Conference on Software Analysis, Evolution and Reengineering