Estimating confidence intervals, part 3

Another episode in our continuing series of effort estimation in the small with 90% confidence intervals. I recently finished implementing another feature after doing the effort estimates for each day. Here’s the plot:

 

Effort estimation, 90% confidence intervals

Once again, I underestimated the effort even at the 90% level, although not as badly as last time. Here’s a plot of the error.

Error plot

I also find it takes real mental energy to do these daily effort estimates.

 

 

 

Estimating confidence intervals, part 2

Here is another data point from my attempt to estimate 90% confidence intervals. This plot shows my daily estimates for completing a feature I was working on.

90% confidence intervals

 

The dashed line is the “truth”: it’s what my estimate would have been if I had estimated perfectly each day. The shaded region represents my 90% confidence estimate: I was 90% confident that the amount of time left fell into that region. The solid line is the traditional pointwise effort estimate: it was my best guess as to how many days I had left before the feature would be complete.

For this feature, I significantly underestimated the effort required to complete it. For the first four days, my estimates were so off that my 90% confidence interval didn’t include the true completion time: it was only correct 60% of the time.

This plot shows the error in my estimates for each day:

 

Error in effort estimate

Apparently, I’m not yet a well-calibrated estimator. Hopefully, that will improve with further estimates.

Results from estimating confidence intervals

A few weeks ago, I decided to estimate 90% confidence intervals for each day that I worked on developing a feature.

Here are some results over 10 days from when I started estimating until when the feature was deployed into production.

Effort estimates

The dashed line is the “truth”: it’s what my estimate would have been if I had estimated perfectly each day. The shaded region represents my 90% confidence estimate: I was 90% confident that the amount of time left fell into that region. The solid line is the traditional pointwise effort estimate: it was my best guess as to how many days I had left before the feature would be complete.

If we subtract out the “truth” from the other lines, we can see the error in my estimate for each day:

Error in estimate

Some observations:

  • The 90% confidence interval always included the true value, which gives me hope that this an effective estimation approach.
  • My pointwise estimate underestimated the true time remaining for 9 out of 10 days.
  • My first pointwise estimate started off by a factor of two (estimate of 5 days versus an actual of 10 days), and got steadily better over time.

I generated these plots using IPython and the ggplot library. You can see my IPython notebook on my website with details on how these plots were made.

Reading academic papers on a Kindle Paperwhite

I recently discovered a great little tool called K2pdfopt for reformatting PDFs such as two-column academic papers so that they are easy to read on a Kindle. (On a related note, it’s ESEM paper review season).

To format for my Kindle Paperwhite, I invoke it like this:

k2pdfopt filename.pdf -dev kpw

 

Then I just copy the reformatted PDF file to my Kindle. Works great.

Edit: The author of k2pdfopt, William Menninger, informed me that the "-fc" flag is on by default (no need to specify it), and that you can set "-dev kpw" in the K2PDFOPT environment variable so it doesn’t have to be set on the command line each time.

No Country for IT

Matt Welsh suggests that systems researchers should work on an escape from configuration hell. I’ve felt some of the pain he describes while managing a small handful of servers.

Coming from a software engineering background, I would have instinctively classified the problem of misconfiguration as a software engineering problem instead of a systems one. But, really it’s an IT problem more than anything else. And therein lies the tragedy:  IT is not considered a respectable area of research in the computer science academic community.

ESEM 2013 Industry Track CFP

The Call for Papers for the Industry Track of the International Symposium on Empirical Software Engineering and Measurement (ESEM 2013) is out. I’m serving as chair of the industry track this year.

If you’re reading this and you work in the software development world (and especially if you’re in the Baltimore/DC area), I encourage you to submit a paper that you think would be of interest to software engineering researchers or other developers.

I have a strong suspicion that the software engineering research community doesn’t have a good sense of the kinds of problems that software developers really face. What I’d really like to do with the industry track is bring professional developers and software engineering researchers together to talk about these sorts of problems.

Also, if you’re reading this and you live in the software world, I encourage you to check out what ESEM is about, even if you’re not interested in publishing a paper. This is a conference that’s focused on empirical study and measurement. If you ask me, every software engineering conference should be focused on empirical study. Because, you know, science.

Training is a dirty word

Two posts caught my eye this week. The first was Anil Dash’s The Blue Collar Coder, and the second was Greg Wilson’s Dark Matter, Public Health, and Scientific Computing. Anil wrote about high school students and Greg spoke about scientists, but ultimately they’re both about teaching computer skills to people without a formal background in computing. In other words, training.

In the hierarchy of academia, training is pretty firmly at the bottom. Education at least gets some lip service, being the primary mission of the university and all. But training is a base, vulgar activity. And it’s a real shame, because the problems that Anil and Greg are trying to address are important ones that need solving. Help will need to come from somewhere else.

Relative confidence in scientific theories

One of the challenges of dealing with climate change is that it’s difficult to communicate to the public how much confidence the scientific community has in a particular theory. Here’s a hypothesis: people have a better intuitive grasp of relative comparisons (A is bigger than B) than they do with absolutes (we are 90% confident that “A” is big).

Assuming this hypothesis is true, we could do a broad survey of scientists and use them to rank-order confidence in various scientific theories that the general public is familiar with. Possible examples of theories:

  • Plate tectonics
  • Childhood vaccinations cause autism
  • Germ theory of disease
  • Theory of relativity
  • Cigarette smoking cause lung cancer
  • Diets rich in saturated fats cause heart disease
  • AIDS is caused by HIV
  • The death penalty reduces violent crime
  • Evolution by natural selection
  • Exposure to electromagnetic radiation from high-voltage power lines cause cancer
  • Intelligence is inherited biologically
  • Government stimulus spending reduces unemployment in a recession

Assuming the survey produced a (relatively) stable rank-ordering across these theories, the end goal would be to communicate confidence in a scientific theory by saying: “Scientists are more confident in theory X than they are in theories Y,Z, but not as confident as they are in theories P,Q”.