MSR Mining Challenge 2009

May 16-17, 2009
Vancouver,
Canada

Special track within MSR 2009,
6th IEEE Working Conference on Mining Software Repositories
http://2009.msrconf.org

Co-located with ICSE 2009,
IEEE International Conference on Software Engineering
http://www.cs.uoregon.edu/events/icse09/home/

Organizers

Christian Bird (chair)
Univ. of California, Davis, USA
Katsuro Inoue
University of Osaka, Japan
Michael W. Godfrey
University of Waterloo, Canada
Jim Whitehead
Univ. of California, Santa Cruz, USA

Jury

Israel Herraiz
(Universidad Rey Juan Carlos, Spain)
Emily Hill
(University of Delaware, USA)
Abram Hindle
(University of Waterloo, Canada)
Reid Holmes
(University of Calgary, Canada)
Rahul Premraj
(Saarland University, Germany)
Peter Rigby
(University of Victoria, Canada)

Location

Co-located with ICSE 2009,
Vancouver, Canada

NOTE: new later submissions dates announced!

Overview

Since 2006 the IEEE Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge. The MSR Mining Challenge brings together researchers and practitioners who are interested in applying, comparing, and challenging their mining tools and approaches on software repositories for open source projects. Unlike previous years that have examined a single project or multiple projects in isolation, this year the MSR challenge involves examining the GNOME Desktop Suite of projects. The emphasis this year is on how the projects are related and how they interact.

There will be two challenge tracks: #1: general and #2: prediction. The winner of each track will be given the MSR 2009 Challenge Award.

Challenge #1: General

In this category you can demonstrate the usefulness of your mining tools. The main task will be to find interesting insights by analyzing the software repositories of the projects within the GNOME Desktop Suite of projects. GNOME is very mature, and composed of a number of individual projects (nautilus, epiphany, evolution, etc.) and provides lots of input for mining tools. The idea of this track is that tools can be used to examine a family of projects that are related and similar in nature. It is recommended (though not required) that tools examine multiple projects within the GNOME ecosystem. For instance, examining API usage across all projects, training a predictive model on one project and assessing its accuracy on another, or examining how developers' activity spans multiple projects.

Participation is straightforward:

Select your mining area (one of bug analysis, change analysis, architecture and design, process analysis, team structure, etc.).
Get project data for multiple GNOME projects.
Formulate your mining questions.
Use your mining tool(s) to answer them.
Write up and submit your 4-page challenge report.

The challenge report should describe the results of your work and cover the following aspects: questions addressed, input data, approach and tools used, derived results and interpretation of them, and conclusions. Keep in mind that the report will be evaluated by a jury. Reports must be at most 4 pages long and in the ICSE format.

The submission will be via Easychair (http://www.easychair.org/conferences/?conf=msrchallenge2009). Each report will undergo a thorough review, and accepted challenge reports will be published as part of the MSR 2009 proceedings. Authors of selected papers will be invited to give a presentation at the MSR conference in the MSR Challenge track.

Data

Feel free to use any data source for the Mining Challenge. For your convenience, we provide repository logs, mirrored repositories, bugzilla database dumps, and various other forms of data at msrchallengedata.html.

Challenge #2: Predict

This year, the MSR Mining Challenge prediction will involve predicting the code growth (in terms of raw source code lines) of each project that will occur between February 1st and April 30th, 2009 (both days included). Your job is predicting the change in size of code in terms of lines in source code files using all possible resources.

Project Lines of source added from 2009/2/1 to 2009/4/30

epiphany 2,023

nautilus 3,112

evolution 720

... ...

Participation is as follows:

Pick a team name, e.g., WICKED WARTHOGS.
Come up with predictions for code growth based on some criteria or prediction model. A very simple model, for instance, would be the amount of growth in the past three months.
Annotate the corresponding files with your predictions
- Predict the code growth of these projects projects.txt.
- Write a paragraph (max 200 words) that describes how you computed your predictions.
- Submit everything before Feb 1st (Apia time) by email to msr2009predictions@gmail.com.

The prediction is on a per project basis. Thus for each project in projects.txt, you need to predict the growth in number of source code lines. Your submission should be a text file with each line containing a project name followed by the change in number of source lines as in challengeexample.txt.

Each submission will be scored in the following way. For each project, the difference between the submitted growth and the actual growth will be calculated and then normalized by the size of the project as of February 1st. Thus, if zenity is 2000 lines on February 1st and 2500 lines on April 30th and your prediction is 300 lines, then the value would be (300 - 500) / 2000, or -0.1. The score for each submission is the sum of the squares of these values across all of the projects. A perfect prediction submission would have a score of 0. Lower scores indicate better predictions than higher scores. Using sums of squares rather than simple sums rewards predictions that are more consistent in their accuracy.

Obviously, the team with the best predictions will win. However, to increase the competition, we will organize a set of "benchmark" predictions.

Code Growth Prediction

The predictions for code growth should be made at the project level. We will only examine source code (not makefile's, readme's, documentation, etc.) contained in each project repository as it exists at 12:01 a.m. on February 1st and 11:59 p.m. on April 30th, 2008. Source code files are determined by extension (as defined below) and all lines in a source file will be counted regardless of their content. For the challenge we will consider selected projects within the core GNOME desktop suite. A complete list of the projects is in the file projects.txt. We will provide the tool for officially counting raw source lines in the near future.

To calculate source code lines for a project, only include the source files that reside in the trunk of the repository (some .c files, for example, may be generated during the configure or make stages and we do not include those). Do not include files from the branches or tags directories of the repository. We define source code files as those with the extensions c, cc, cpp, cs, glade, h, java, pl, py, and tcl. In addition, any file that has one of those extensions followed by .in or .template are also considered source files. A simple way to calculate the number of source lines is to execute the following command at the root of the tree

find . -regextype posix-extended -type f -regex ".*\.(c|cc|cpp|cs|glade|h|java|pl|py|tcl)(\.template|\.in)?$" | xargs wc -l | tail -n 1

Frequently Asked Questions

Do I need to give a presentation at the MSR conference? For challenge #1, the jury will select finalists that are expected to give a short presentation at the conference. Then the audience will select a winner. For challenge #2, there is no presentation at the conference. The winners will be determined with statistical methods (correlation analysis) and announced at the conference.
Does the challenge report have to be four pages? No, of course you can submit less than four pages. The page limit was set to ease the presentation of space-intensive results such as visualizations.
Wow, the data set is soooo big! My tool won't finish in time. What can I do? Just run your tool on a subset of the projects. For instance, you could examine only the nautilus file manager and the epiphany web browser. Especially when you are doing visualizations, it is almost impossible to show everything.
Predicting code growth? But, I have no clue how to build prediction models. That's the fun thing about this category: you don't need to build sophisticated models. Of course, some people will, but others will just build simple predictors. In the end, we will see (a) whether we can predict future development events and (b) who does it best.
My cat is a visionary...can I submit its predictions or is the challenge #2 only for tools? Of course, go ahead and submit its predictions as a benchmark. However, your cat will run out of competition—only predictions generated by tools or by humans in a systematic way are eligible to win challenge #2.
For the challenge #2-predict, is it acceptable if our team submit more than one prediction file? Only one submission from a team (person) is allowed.

Important Dates

Submission of predictions: February 7th, 2009 (Apia time)

Submission of reports: March 16th, 2009 (Apia time)

Camera-ready deadline: April 22nd, 2009

Conference date: May 16th - 18th, 2009