The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or clickhereto continue anyway

Analyzing Julia's issue counts over time | Iain Dunning

› Analyzing Julia's issue counts over time

First posted: Aug 11, 2015

In this post well be analyzing the number of issues open on the Julia languages issue tracker. Well be counting both issues (bug reports, ideas, plans) and pull requests (PRs, code that has been submitted for review before merging it into the langauge). What Im mainly interested in how the number of open issues/PRs varies over time, and how that relates to the total number of issues/PRs.

For this job well need three Julia packages, all made by community members:

All of these can be installed with the Julia package manager, e.g. Pkg.add("GitHub"); Pkg.add("JLD") and so on. If you havent used Julia in a while, you might want to run Pkg.update() first so you get the freshest versions of these packages.

First step, load the packages

using GitHub, JLD, Gadfly
using Dates  # Only needed on Julia 0.3.x

We use the issues function of the GitHub.jl package to download every open and closed issue or pull request (PR) for the julia repository - this takes a while, as it needs to download a fair bit of data. Youll want to get an auth token, so that Github wont bounce our request as a spam attack of some sort. You can get this by signing up for a Github account, if you dont already have one, and going to your settings page.

# Replace with your token
TOKEN = "yourauthtokenhere"
# Authenticate with GitHub, so they know we're legit
my_auth = authenticate(TOKEN)
# Pull all open issues...
open_issues = issues(my_auth,"JuliaLang","julia",state="open")
# ... and all closed issues (10x as many of these)
closed_issues = issues(my_auth,"JuliaLang","julia",state="closed")
# Combine them into one vector of issues
all_issues = vcat(open_issues,closed_issues)

Well create a little type that just keeps the creation and close dates. If an issue is open, it doesnt have a close date, so well just use a time far in the future (Jan 1, 2099!) for now. The DateTime function creates a DateTime object from a string (or from manually spelling out a date).

# Define our reduced issue type
type SimpleIssue
    created_at::DateTime
    closed_at::DateTime
end
# Provide a constructor that takes in
# cr   creation date
# cl   close data - might be `nothing` = open
SimpleIssue(cr::String,cl) = SimpleIssue(
    DateTime(cr), 
    cl == nothing ? DateTime(2099,1,1) : DateTime(cl) )

We now use the JLD.jl package to serialize this data to a file in case we want to come back and analyze it later. JLD.jl can save pretty much any Julia thing, even types you define. Read the README for caveats!

save("all_issues.jld","all_issues",
    [SimpleIssue(i.created_at, i.closed_at) for i in all_issues])

Well pretend were revisiting this some time in the future. Loading data is just the reverse of saving it with JLD:

all_issues = load("all_issues.jld", "all_issues");

Now for some actual work. We collect a vector of every date seen - this is basically every day something happened on the issue tracker, which is probably almost every day since the announcement of Julia.

all_create_dts = [Date(i.created_at) for i in all_issues]
all_close_dts = [Date(i.closed_at) for i in all_issues]
all_dates = unique(sort(vcat(all_create_dts,all_close_dts)))
length(all_dates)

1457

Now for the actual counting. Well use a not-particularly-efficient method, but quick enough for the data at hand. For each issue/PR, simply increment a count for each date that the issue/PR was open (the dates between its opening and closing). Well also keep a count of total opened ever versus date, and for every date, the ages for all open issues at that date.

open_at_count  = Dict{Date,Int}()
total_at_count = Dict{Date,Int}()
days_open_at   = Dict{Date,Vector{Int}}()
for d in all_dates
    open_at_count[d]  = 0
    total_at_count[d] = 0
    days_open_at[d]   = Int[]
end
# For each issue/PR...
for iss in all_issues
    create_dt = iss.created_at
    close_dt  = iss.closed_at
    # For every date...
    for d in all_dates
        # If the issue was made before...
        if create_dt <= d
            # Then it existed on this date
            total_at_count[d] += 1
            # If it was closed after this...
            if d <= close_dt
                # Then it is open on this date
                open_at_count[d] += 1
                # Its been open this long
                push!(days_open_at[d], Int(d - Date(create_dt)))
            end
        end
    end
end

To finish, lets plot these quantities versus time using Gadfly - just simple line plots will do.

# Collect results into vectors
open_vec  = [open_at_count[d]  for d in all_dates]
total_vec = [total_at_count[d] for d in all_dates]
# Correct for special last day (currently open)
plot_dates = vcat(all_dates[1:end-1], all_dates[end-1]+Day(1))
# Draw the results as a PNG (default is SVG)
draw(PNG(8inch,4inch),
plot(x=plot_dates,y=total_vec,Geom.line,
        Guide.Title("Total Issues/PR"),
        Guide.xlabel("Date"), Guide.ylabel("Count"))
)

draw(PNG(8inch,4inch),
plot(x=plot_dates,y=open_vec,Geom.line,
        Guide.Title("Open Issues/PR"),
        Guide.xlabel("Date"), Guide.ylabel("Count"))
)

Well now look at what fraction of the issues/PRs are open at any one time. As you can see, it seems to have converged to about 10% - I wonder why? One explanation is that whenever it gets much over 10% then people get the urge to review older issues and fix or close them. When it drops below 10%, people dont care too much. Another explanation is that there is a core of things in the too hard pile at any one time, and the number of those too hard things is going up but at no greater a rate than the overall number of issues.

draw(PNG(8inch,4inch),
plot(x=plot_dates,y=open_vec./total_vec,Geom.line,
        Guide.Title("Open:Total Issues/PR"),
        Guide.xlabel("Date"), Guide.ylabel("Fraction"))
)

For a different perpsective, we can also analyze the distribution of the ages of the open issues/PRs. Id would have guessed this was increasing, and sure enough it seems to be.

p25_age_vec = vcat(0.0,[quantile(days_open_at[d],0.25) for d in all_dates[2:end-1]])
p50_age_vec = vcat(0.0,[quantile(days_open_at[d],0.50) for d in all_dates[2:end-1]])
p75_age_vec = vcat(0.0,[quantile(days_open_at[d],0.75) for d in all_dates[2:end-1]])

draw(PNG(8inch,4inch), plot(
layer(  x=plot_dates[1:end-1],y=p25_age_vec,
        color=fill("25th percentile",length(p25_age_vec)),
        Geom.line),
layer(  x=plot_dates[1:end-1],y=p50_age_vec,
        color=fill("Median",length(p50_age_vec)),
        Geom.line),
layer(  x=plot_dates[1:end-1],y=p75_age_vec,
        color=fill("75th percentile",length(p75_age_vec)),
        Geom.line),
Guide.Title("Age of Open Issues/PR"),
Guide.xlabel("Date"), Guide.ylabel("Age (days)")))

Continue reading on iaindunning.com