The Helix Data Set is a compilation of release histories of a number of non-trivial Java Open Source Software Systems. It has been developed to assist researchers in the field of empirical software engineering with a focus on software evolution.

If you use the data set, do drop an email to let us know about any publications, and we will add it into our publications section.

If you use our data for a publication: Citation Information.

Comments, Feedback or Help? -- email Rajesh Vasa (

Whats New -- 24 Jan 2011

  • Metrics data re-generated with new releases.
  • Staging Area introduced to capture systems that are in early stages of evolution (or) systems for which we do not have many releases.

In a nutshell

  • 40+ Open Source Java software systems, 1000+ releases with over 65,000 classes
  • All systems have 100+ classes (most are far larger -- i.e non-trivial)
  • All systems have a minimum of 15 releases with over 18 months of release history
  • Evolution history available as a ZIP file, with consistent meta-data (including License information and a classification of software type).
  • Over 50 different metrics extracted for each release -- available in a simple CSV file format.

Data selection criteria and the metric extraction approach is elaborated in greater detail in Chapter 3 and 4 of R.Vasa's PhD. Thesis (in review).

Collaboration and Contributions

We are keen to build strong and ongoing relationships to ensure that the data set can be kept updated regularly, and to expand the size of the data set.