In today’s post, I’m going to show the Deflategate data from a new perspective. Rather than arguing about whether the Patriots used the Logo gauge, I’ve assumed, for the sake of argument, the NFL’s conclusion that the Non-Logo gauge was used, but gone further (as they ought to have done). I’ve “guessed” the amount of deflation that would be required to yield the observations. And, instead of only considering the overall average, I plotted each data point and how the “guessed” deflation would reconcile each data point.
Some very surprising results emerged, one of which raises the question in the title: did McNally inflate one football in the washroom? If the question doesn’t seem to make sense, read on.
Rather than one guess being applicable to all measurements, I ended up needing four different groups each with a different guessed deflation. A “good” guess (i.e. one that “worked”) for the majority of balls (7) was 0.38 psi – an interesting number that I’ll discuss in the post. A good guess for two balls was zero deflation. But for ball #7, it was necessary to assume that it had been inflated by approximately 0.5 psi in the washroom. One ball was lower than the others (0.76 psi) and remains hard to explain. The Wells Report reasonably drew attention to variability, but did not address the details of actual variability other than arm-waving and did not actually show that erratic washroom deflation was a plausible explanation for observed variability.
While the approach in today’s post doesn’t appear conceptual, statistical algorithms, including linear regression, typically solve inverse problems. The spirit of today’s post is approaching Deflategate as an inverse problem. In doing so, I am aware (as Carrick has forcefully observed) that the underlying physical conditions were poorly defined, but people still need to make decisions using the available information as best they can. I think that the approach in today’s post provides a much more plausible and satisfying explanation of the variation in Patriot pressures than those presented by either Exponent or Snyder or, for that matter, my own previous commentary.
Bear with the explanation of context, as the results are interesting.
Context
The Wells Report reported that the standard deviation of Patriot pressure measurements was 0.40 psi (11 balls), as compared to 0.144 psi (4 balls) for the Colt balls. Using a standard F-test (and Levene test), this difference is not “statistically significant”. Snyder’s second criticism was that this finding should have been an end to this particular line of argument. However, this was not what happened. One of Exponent’s most important technical findings – indeed, it is the finding that directly precedes their conclusions – was that the higher variability of Patriot footballs relative to Colt footballs was most plausibly explained by the footballs not starting the game “at or near the same pressure”:
Specifically, the fluctuations in the halftime pressures of Patriots footballs exceed in magnitude the fluctuations that can be attributed to the combined effects of the various physical, usage, and environmental factors we examined. Therefore, subject to discovery of an as yet unidentified and unexamined factor, the most plausible explanation for the variability in the Patriots halftime measurements is that the 11 Patriots footballs measured by the officials at halftime did not all start the game at or near the same pressure.
This finding, quoted verbatim, was one of the most critical findings of the Wells Report.
I agree with their conclusion in the very narrow language in which it is expressed – language in which each word matters:
the 11 Patriots footballs measured by the officials at halftime did not all start the game at or near the same pressure
On May 6 – the same date as the Wells Report was released, Exponent submitted a report on simulations in which three untrained employees attempted to deflate 12 balls in 1 minute 40 seconds with a standard needle. The subjects, after one try, achieved remarkably consistent results, both between subjects and between balls. The average deflation was 0.76 psi with a standard deviation of 0.11 psi. This finding was not discussed in the Wells Report, but it poses a conundrum: the observed variability in Exponent’s washroom simulations is much too low to explain observed Patriot variability.
Indeed, the specific pattern of Patriot pressure variability, when examined in detail, in my opinion, argues against washroom deflation and towards a quite different explanation of the quite unusual actual pattern of variability.
The “Inverse Problem”
The diagram below summarizes my estimates of how much deflation would be required to yield the observations – all measurements are based on the Non-Logo gauge, which was calibrated as being relatively close to the Master Gauge. There’s a lot of information in the diagram (and I’ll post a script deriving the diagrams as documentation). In each panel, I’ve converted pressures to ball temperature using the Ideal Gas Law. This facilitates direct comparison of information from Colt balls to Patriot balls. I’ve also shown a negative exponential “corridor” from the upper and lower Colt measurements to halftime temperatures for dry and wet footballs. The corridor is presented here as a rough-and-ready but plausible guide.
In the left panel, I’ve showed the implied temperatures (pressures) using the half-time Non-Logo measurements. In the right panel, I’ve shown the implied temperatures (pressures) after applying the hypothesized deflation (inflation) shown for each ellipse in the left panel. In each case, I’ve “guessed” the amount of deflation/inflation to apply. Algorithms for many inverse problems begin with a guess, so this is not quite as ad hoc as it looks.
For the largest group of balls (seven), I “guessed” that the deflation would be 0.38 psi. When the implied temperatures were re-calculated using this guess, they fit nicely within the corridor (red + signs). However, if this deflation were applied to the three balls in the two upper ellipses, they end up well above the corridor and this estimate is insufficient for one ball.
Figure 1. Left panel – Nonlogo ball pressures expressed in deg F using the Ideal Gas Law. The eleven Patriot measurements are divided into three groups indicated by the ellipses.
Next consider the two balls in the ellipse on the upper border of the corridor on the left panel. If these balls had also been deflated by 0.38 psi, their implied temperature in the right panel would be much too high. Indeed, any deflation for these two balls would move them too high in the right panel. For these two balls, I “guessed” that there was zero deflation and, using this assumption, they fit nicely in the right panel.
Now for the two outlier balls – one of which is too “warm” and one of which is too “cold”. The only way of reconciling ball #7 (upper ellipse) is to assume that it had been inflated. I tried a “guess” of 0.5 psi inflation and, under this assumption, the implied temperature (pressure) fit nicely in the corridor in the right panel. For the cold outlier, I guess 0.76 psi deflation and, under this assumption, it too fit into the corridor.
Summarizing the reverse engineering: one can arrive at observed pressures under Non-Logo pregame initialization if one assumes that two balls were left alone, seven balls were deflated by 0.38 psi, one ball deflated by 0.76 psi and one ball inflated by 0.5 psi. These are not the only solutions to the inverse problem, but they give a yardstick.
Exponent’s Deflation Simulations
On May 6 – the same date as the Wells Report was released, Exponent submitted a report on simulations in which three untrained employees attempted to deflate 12 balls in 1 minute 40 seconds with a standard needle. The subjects, after one try, achieved remarkably consistent results, both between subjects and between balls. The average deflation was 0.76 psi with a standard deviation of 0.11 psi.
Discussion
The -0.38 psi Group
First, the largest group of balls imply deflation of 0.38 psi using the Non-Logo gauge – an amount that, by “coincidence”, is exactly equal to the bias between Anderson’s Logo Gauge and Non-Logo gauge. An alternative explanation for these particular observations is that these balls were measured pregame using the Logo gauge. Exponent (and the NFL) rejected the possibility of Logo gauge initialization of Patriot balls, because Anderson’s pregame measurements were more or less consistent with pregame measurements by the two teams themselves. This topic has been discussed before. I’ll return to it after discussing the other groups.
The 0 psi Group
Secondly, notwithstanding my past advocacy of Logo gauge initialization of Patriot balls, it is implausible that the two balls in the 0 psi group were initialized pregame with the Logo Gauge, since their implied temperature would be too warm. (They would move well above the corridor in the right panel.) If these two balls were initialized with the Non-Logo gauge and not deflated, they fit nicely. So is there any evidence or record of two Patriot balls being treated differently? How about this:
Anderson recalls that most of the Patriots footballs measured 12.5 psi, though there may have been one or two that measured 12.6 psi. No air was added to or released from these balls because they were within the permissible range. According to Anderson, two of the game balls provided by the Patriots measured below the 12.5 psi threshold. Yette used the air pump provided by the Patriots to inflate those footballs, explaining that he “purposefully overshot” the range (because it is hard to be precise when adding air), and then gave the footballs back to Anderson, who used the air release valve on his gauge to reduce the pressure down to 12.5 psi.
We already know that none of the officials at half-time paid any attention to which gauge they were using and inattentively switched gauges between Patriot and Colt measurements. Once Anderson put his gauge back into his pocket, it would be random which was gauge was used next. Suppose that Anderson measured eleven Patriot balls using the Logo gauge, finding two underinflated even with this gauge, and then put his gauge back into his pocket, making a fresh draw when it was time to re-gauge the two Patriot balls. But this time, he drew the Non-Logo gauge and deflated the Patriot balls to 12.5 psi (Non-Logo gauge), yielding the two balls in this second group.
The Outliers
Ball #7 in the above solution to the inverse problem has an estimated inflation of 0.5 psi. The reverse engineered (inverse) inflation of 0.5 psi is, by another “coincidence”, is the difference in pregame pressure between Patriot and Colt balls. Indeed, this prompted the guess. Could it be possible that one of the Colt balls ended up among the Patriot balls? When I looked back, it turned out that there was contemporary evidence of such a possibility. A news story shortly after the game reports an interview with D’Qwell Jackson, a Colt cornerback who intercepted a Brady pass in the second quarter. Jackson said that the Patriots were using a Colt football “late in the first half”:
Jackson does, however, recall one interesting moment during the first half that has something to do with the latest controversy. He recalls, during a television timeout, there was an especially long delay that prompted him to approach an official.
The game official mentioned something about their efforts to locate a usable football. Shortly after, Jackson noticed that the Patriots were using the Colts‘ footballs late in the first half. Jackson said it was odd to him that New England couldn’t find a football to use, especially in the AFC Championship Game.
It seems crazy that one of the “Patriot” balls was actually a Colt ball, but the extra pressure in ball #7 requires an explanation. The idea of McNally inflating ball #7 in the washroom seems even crazier.
This leaves the single “cold” outlier. Its implied deflation (Non-Logo initialization) is about 0.76 psi – a typical amount observed in Exponent’s deflation simulation. Ironically this illustrates a major problem in Exponent’s deflation simulations: if the 12 Patriot balls had been deflated according to the Exponent simulations, they would all be in this range. Instead, eleven of 12 balls are way above this level. On the other hand, the pressure (temperature) is too low to be explained simply by the use of the Logo gauge.
Variability in the Wells Report
With the above background on nuances of variability, I’ll now consider how the analysis in the Wells Report and Snyder’s criticism.
The Wells Report reported that the standard deviation of Patriot pressure measurements was 0.40 psi (11 balls), as compared to 0.144 psi (4 balls) for the Colt balls, but were unable to find that this difference was “statistically significant” because the small number of Colt balls meant that standard tests had little power.
In their simulations, Exponent observed that the difference in measurements between extremes of dry and wet balls was relatively small as follows:
“the maximum differential observed between the dry and wet footballs tested under the same conditions was only approximately 0.3 psig”.
This text is somewhat inconsistent with Figure 27 which shows a differential of ~0.5 psi, but that’s a different story. Either way, the observed pairwise differences between Patriot measurements (“fluctuations” in Exponent’s terminology) was much higher than that observed in their simulations – an issue that Snyder did not confront. Exponent presented the table of pairwise differences shown below:
Table 1. Exponent’s table showing pairwise differences in Patriot measurements. Exponent commented as follows: “There are seven pairs of measurements (highlighted in orange and red) in which the drop in pressure between the earlier ball tested and the later ball tested is greater than or equal to 0.75 psig, and there are three pairs of measurements (highlighted in red) in which the drop in pressure between the earlier ball tested and later ball tested is greater than or equal to 1.0 psig.”
From this, they concluded that variations in wetness could not account for the very large variations in Patriot ball pressures and that the Patriot balls measured by officials did not “all start the game at or near the same pressure” (though they didn’t define “near”):
Specifically, the fluctuations in the halftime pressures of Patriots footballs exceed in magnitude the fluctuations that can be attributed to the combined effects of the various physical, usage, and environmental factors we examined. Therefore, subject to discovery of an as yet unidentified and unexamined factor, the most plausible explanation for the variability in the Patriots halftime measurements is that the 11 Patriots footballs measured by the officials at halftime did not all start the game at or near the same pressure.
However, these findings need to be interpreted in light of the analysis in Figure 1. The orange and red cells all occur in rows 1, 6 and 7, rows that correspond to the three balls in the two upper ellipses in Figure 1. In the analysis presented above, these three balls did not start the game at the same pressure as the other eight balls because of different gauging and selection, not because of washroom deflation.
Snyder
Brady’s expert witness, Edward Snyder, argued that it was “improper” to proceed with further comparison of variations, once the initial comparison had not yielded a “statistically significant” result:
Secondly, Exponent looked at the variation and the measurements between the Patriots’ balls and the Colts’ balls at halftime. They compared the variances. And despite conceding that there was no statistically significant difference between the two, they went ahead and drew conclusions, but those conclusions are improper.
Later, Snyder expounded further that it was not “sound practice” to draw conclusions from an analysis that did not result in a finding of “statistical significance”:
Q. Let’s go to the next slide. And what did Exponent conclude as a statistical matter about variability?
A. No statistical — no statistically significant difference.
Q. Did they stop there?
A. No. They continued, which is striking, because, whereas in the difference in difference analysis, they adopted the standard five percent as the benchmark, here, they said, no, we will just continue on and reach conclusions. And it’s right here at the bottom.
So without having found anything that’s nevertheless they have a statement that begins in their report, “therefore.”
Q. And in your experience, as a statistical matter, is it a sound practice to draw conclusions from an analysis which doesn’t reach statistical significance?
A. No.
Exponent rejected Snyder’s criticism on this (and other points) on two grounds. First, their analysis was not limited to the comparison of variability between Patriot and Colt balls, but also included comparison to their simulations.
Then we went and did all that physical testing. We saw the effect of all those other parameters, the effect or no effect of those parameters. We looked at that and then we went back and looked at the variability of the data comparing, at the same time looking at the variation of the balls, individual balls. And could we account in the difference in pressures based on other physical factors. And the ranges and variability of factors were not predicted by the effect of, say, ball wetness and ball dryness that we saw. So we went back and said, you know, there is variability in 2 here. The statistical analysis you can’t conclude, but based on a review of the fluctuations in the data and looking at the physical experiments that we did, we concluded that there is a difference there and that difference is most likely the differences in starting pressure of the footballs, two different analyses. The statistical analysis did not preclude us from going back and looking at the physical realities that we measured. And that’s what we did to come to that conclusion.
Second, they also argued that the p-value of the F-test entitled them to take notice, even if it was greater than 5%:
And similarly, if you take Case 2 which is making a larger adjustment, the reported p-value’s in the neighborhood of .2, a little bit above .2. Again, if you do the analysis without imposing an equal variances assumption, you get a p-value that’s below ten percent. So it’s statistically significant at the ten percent level, not at the five percent level. The other important point in thinking about statistical significance is that it’s not a black or white line at .05. And there’s no direct way that you can connect .05 certainly to a legal standard for preponderance of evidence. So it’s not that if you are .04, it’s more likely than not, and if you are .06, it’s less likely than not. We have to be clear about that. So for all of those reasons, I think that first finding is without foundation.
Because Exponent had also purported to justify their statistical analyses without consideration of timing as a sort of “preliminary” gatekeeping, Kessler scored some points as to why they didn’t do the same thing with their analysis of differences in variability. In my opinion, this was a better rhetorical than analytic point. I don’t have any issue with Exponent analysing variability; only that they didn’t do enough analysis or that their analysis wasn’t insightful enough.
Snyder also speculated that the greater variability among Patriot balls arose from timing, pointing out that the Colt balls, being measured later, were closer to the asymptote, but did not quantify the impact of this observation.
Q. Even putting aside the fact that Exponent’s results were not statistically significant, are you
aware of any explanation for greater variability among Patriots’ balls compared to Colts’ balls?
A. I’m not here to offer scientific insights. I don’t know if the first-half conditions could lead
to more variance. I’m just going to focus on the scientific guidance provided by Exponent. And
recognizing that the Colts’ balls were measured some time in here (indicating). They are measured at a relatively flat part of the curve (indicating). And if you sample from a
relatively flat part of the curve, you get less variance. And this was not considered by Exponent when they made this comparison and reached the “therefore” conclusion
While timing has some effect on the variability, in my opinion, it is a secondary issue, and does not explain the actual variability.
Conclusion
The variability in Patriot pressures is larger than the variability in Colt pressures, but the form of variability is very odd and, in my opinion, is more indicative of inconsistent gauges and even mistaken inclusion of a Colt ball, than of erratic washroom deflation.
Exponent dismissed the idea that Patriot balls might have been initialized using the Logo gauge on the grounds that Anderson’s pregame measurements were more or less consistent with each team’s own measurements prior to tendering the balls. I’ve argued (as had AEI and MacKinnon) that it was entirely possible that Anderson switched gauges between Patriot and Colt pregame measurements (as did NFL officials at halftime even under its heightened scrutiny). This scenario removes the similarity of Colt team pregame measurements as an issue, since it adopts the assumption that Colt balls were initialized with the Non-Logo gauge, and only requires similarity between Patriot pregame measurements and Anderson’s Logo gauge. There are two ways that this could have happened, both of which were alluded to in Kessler’s examination. One possibility is that the Patriot gauge, like Anderson’s gauge, was older and had gone slightly off calibration. A second possibility was that Patriots had done their pregame measurements while the balls were still warm from gloving, rather than waiting for them to cool down. While Kessler raised these issues, he didn’t close off either one, getting lost on the relevance of gloving after setting up the issue. The other alternative – and one not squarely addressed in the hearing – was that the Patriots had, for some inexplicable reason, deflated their balls by the implausibly small amount of 0.38 psi, an amount that is the exact amount of the inter-gauge bias of Anderson’s gauge – a coincidence that, in my humble opinion, is wildly more improbable than the Patriots using an old and slightly off-calibration gauge or doing their pregame measurements when the balls were still slightly warm from gloving.
However, as discussed above, simply assuming Logo gauge calibration doesn’t solve the problem of variability and, in particular, it creates real interpretation problems for three balls that end up being too “warm”, for which I’ve proposed other alternatives. I recognize that reasonable people may regard these alternative explanations as themselves implausible. But, when the details are examined, erratic washroom deflation is not nearly as plausible an explanation for the observed variation as Brady critics assume.