The Nima Gluten Sensor has launched in the UK to mixed response. I’ve written a piece for Foods Matter providing an overview of the product and looking at the research into its performance and limitations. But in this post I specifically want to explore the often-quoted 96.9% accuracy figure associated with the Nima, to look at how it was derived and what it actually represents.
The precise wording of the claim is this: “Nima detects gluten at 20ppm [20 parts per million] and above at 96.9% accuracy” — and that all-important figure comes from this study, published online in mid-2018, and conducted by the Nima team.
They used both gluten-spiked and unspiked gluten-free food samples as well as various foods from catering outlets for their research.
The first thing to point out is that quite a few of the food samples — as you can see from Figure 5a here — fall below the ’20ppm and above’ range, and yet appear to have contributed towards the calculation of the 96.9% figure (which I’ll come to). Why Nima claim the figure applies only to 20ppm and above samples I can’t answer.
Either way, 20ppm and above is the enormous range 20ppm–1,000,000ppm, and it would be foolish to assume, whatever the ‘accuracy’ figure really is, presupposing that one can be confidently defined and found, that the figure applies equally across all points on that spectrum. It’s sensible to consider accuracy is lower at the lower end and higher at the higher.
But anyway, there is a more serious problem.
It concerns the inclusion of samples within the 2–20ppm range — in other words, samples which contain “detectable gluten” but also fall under a “gluten free” definition.
The crux of much of what I have to say about Nima hinges on this, and the Nimalites out there have failed to see it.
The researchers tested 447 samples which had been tested by other methods to establish their gluten content.
Of those 447, 31 gave an ‘error’ reading on Nima, leaving 416 successful results.
Nima gives one of two results — either ‘gluten found’, or a smiley face (gluten free).
The results are collated into the table shown in 5b, just below the one referred to above. Take a close look at it here.
The top half represents the results. I’ll break the 416 down into the 4 outcomes:
1/ Nima returned ‘gluten detected’ in samples containing gluten at 2ppm or above 284 times. (This was deemed a ‘true’ result -— a true positive.)
2/ Nima returned ‘gluten detected’ in samples containing less than 2ppm (ie either zero gluten or undetectable gluten) 10 times. (This, a false result — a false positive.)
3/ Nima returned a ‘smile’ in a sample with 20ppm or over gluten 3 times. (A false result — a false negative.)
4/ Nima returned a ‘smile’ in a sample with 20ppm or under 119 times. (A true result — a true negative.)
In the bottom half of the table, the calculations, where we see the 96.9% figure. It is calculated by summing the number of true results (ie 284 + 119 = 403) and dividing by the total of true and false results (416). That gives 0.969, or 96.9%.
Note first that the accuracy with the discounted errors is actually 90.2%.
But further undermining the validity of the purported accuracy figure is the following info, given to me by Nima when I was querying data. It concerns foods in the 2–20ppm range, of which there are a number in the study, using 15ppm as an example:
“For a 15ppm sample, such as the carrot currant donut, the donut is placed in gluten or gluten free condition depending on the Gluten Found or Smile result of the sensor. If the 15ppm donut is gluten found, then it will be >2ppm condition. But if the 15ppm donut is a Nima smile, then it will be <20ppm condition.”
Take a moment to absorb what is being admitted to here.
ALL samples in that critical 2–20ppm range were counted either as a true positive or a true negative, depending on what result the Nima gave.
And ALL true results counted positively towards the accuracy figure in the calculation shown above.
There is no way for samples in that 2–20ppm range to give a false result, by this standard. The Nima will ALWAYS be correct, if you set these parameters for ‘accuracy’, because both possible results are ‘accurate’. Had researchers simply selected samples within that range, the accuracy result according to their formula would have been 100%. The Nima would have no way of failing.
It’s not unlike asking a dice how good it is at providing you a whole number greater than 0 and less than 7, and giving it a point for accuracy once you’ve rolled it.
If you’re struggling with this — and I did for some time — then imagine a proposed blood test for people under 20 (bear with me) to reveal whether the donor is a child or adult.
And then imagine it gives one of two results — not ‘child’ or ‘adult’, but instead ‘child’ or ‘teenager’.
A ‘child’ result is unambiguous, but a ‘teenager’ result doesn’t tell you whether it’s an adult teenager (ie 18 or 19) or a child teenager.
Now consider Nima. This is a test ostensibly to determine whether or not a food might be safe or unsafe for coeliacs. For that to be the case, it needs to distinguish between ‘gluten free’ and ‘not gluten free’.
But Nima doesn’t do that. It instead distinguishes between ‘gluten free’ and ‘gluten detected’.
But in the same way that you don’t know whether a ‘teenager’ result belongs to an adult or a child, you also don’t know whether a ‘gluten detected’ result belongs to a ‘gluten free’ or ‘not gluten free’ sample either.
And if you only tested teenagers aged 13 to 17, the fictional blood test will never be wrong — either possible result, either ‘child’ or ‘teenager’, will be correct, but may fail to tell you what you need.
That’s the Nima for you, people. At the critical data range of 2–20ppm, where it really matters, it will never be wrong by its own defined standards and binary result, but it may fail to tell you what you need to know — namely, that it is within or outside safe levels.
You may be satisfied with this, but consider that you could put ‘gluten detected’ and ‘gluten free’ on opposite sides of a coin, toss it, and get 100% accuracy, according to the standards described in that research paper, when applied to this range of foods.
If you want to pay for that, it’s your perogative.
It’s all about input and output.
What questions you ask of a test, and how you judge the answers you receive.
If you’re fairly lenient about the acceptability of the answers you receive, and choose questions to suit the outcome beneficial to you, then you will score well.
But this is bias, should it need spelling out.
Accepting either of the only two possible answers, and including samples which will give either of those two acceptable answers, is not a stringent test in my view.
If you take into account errors, and you remove outliers like 0ppm or a high-ppm food which are statistically unlikely to give a mistaken result due to the chemistry involved, and you instead concentrate on borderline foods in the, say, 5–40ppm range, and you set standards that ‘gluten detected’ means ‘not gluten free’ (which is how many users might interpret that result), then ‘accuracy’ figures would I imagine be modest. If pushed, I’d guess around 70%. Remember that any binary test will be correct 50% of the time due to chance.
I quite understand why Nima did not do this. And as it turns out, in 96.9%, Nima has an excellent, but imperfect, and hence more believable figure, which is being obligingly trotted out by unquestioning bloggers paid to promote the product, and sing the praises of its ‘accuracy’.
If you ask me whether or not it’s raining outside and I tell you it’s 18 degrees, that may well be true, but it does not tell you whether you need an umbrella.
Would that make me an ‘accurate’ weatherman? According to Nimalite logic, yes.
But you might conclude otherwise.