Understanding Medians (and Averages)
Many of Lex Machina’s new Analytics pages, public reports, and services engagement reports use median figures as a way to help summarize data, showing, for example, the time-to-trial for a particular judge as a median number.
This post explains what a median is, how a median relates to an average, and why we use medians at Lex Machina. More importantly, I will try to show you why median calculations enable you to make better decisions about legal data.
Fig. 1: Sample data
First, what is a median? Imagine a list of numbers, each representing the time-to-trial for a particular case. If we put the list in order, and chose the middle-most number, that number would be the median.* For example, a group of 5 cases may have time-to-trial values of 16 months, 17 months, 18 months, 19 months, and 50 months; the median is 18 months.
* If the list has an even number of times-to-trial, we define median as the arithmetic mean of the middle-most two numbers. More on that in a moment…
Fig. 2: The median is the middle-most value
Before discussing the relative merits of medians and averages, let’s take a moment to clarify what exactly an average is. According to its definition, an “average” may refer to a median calculation, but it most often expresses what is more precisely called an “arithmetic mean”: adding up each of the values and dividing the sum by the number of values. In our example above, the mean would be (16 + 17 + 18 + 19 + 50) / 5 = 24 months.
Fig. 3: The mean (average)
Why use median calculations instead of just averaging? Why have this post?
As the example above shows, medians are less influenced by wild, or “outlier” data points, such as the case that took 50 months to reach trial. Because the median effectively ignores that case, it produces a summary of the data that is more useful: four fifths of the time, the median is within 2 months of actual time-to-trial (for the cases that took 16,17,18, and 19 months). On the other hand the mean calculation (24 months) would be off by at least 5 months for each of those cases.
Fig. 4: Error of median, error of average
Median measurements are accepted as convention among those who frequently use statistics. Medians are used by Social Security Administration, the Census, the Federal Judicial Center, and many others who need to extract meaning from large data sets.
If you have a list of data values and you’re trying to figure out what value to expect for a similar case, the median gives you a more accurate measurement, more often.
Coming up soon, I’ll tackle our boxplots: what they mean, why we use them, and how they can help you to better understand your data.
By Brian C. Howard
Legal Data Scientist, Director of Analytics Services