Accuracy matrix and metrics

Accuracy

When communicating the results of an analysis, one of the first questions anyone will ask you is if you calibrated/validated your model. Often, this is a loaded question for a number of reasons, but one of the most addressable ones is to provide them the accuracy metric of your choice. While not always the case, this generally involves comparing two rasters, a reference and a ground truth to create what is known as a “confusion matrix”. This combination of values can be mathematically reduced to a single value, and that singular number is all that’s usually reported.

This of course can get the box checked at the end of the day, but leaves quite a lot of room for improvement. One of the “easiest” ways to improve on this baseline though, is to continue to perform that accuracy assessment at regular intervals. The single marker in that time series may not mean much in isolation, but when you can show steady or incremental improvements in that metric over time, that lends confidence to the underlying process representations you are applying to your model.

The other game to play here is of course the “which metric do you choose?” game. Much like an extremely accessible form of p-hacking, They take catchment aggregated p-bias and take that as the room temperature measurement. “If water model versions are making improvements in each iteration that is seen as the sign that these efforts are improving. That is misleading not only because they’ve hinged that outcome on a single metric with no spatial significance, but also because the calibration and validation data is treated as a monolitic and stable set of observations instead of a living collection of data points which have a seasonality and even event based alternatives.

	Prediction Positive	Prediction Negative	Prediction Null
Truth Positive	True Positive (A)	False Negative (B)	Null Positive (E)
Truth Negative	False Positive (C)	True Negative (D)	Null Negative (F)

Common metrics

The hydrologic and hydraulic communities have a select number of accuracy metrics and traditional workflows that they reach to when they try to quantify the skill or performance of a model.

Name	Context	Reflects
Bias	Continuous	Long term average error (dry/wet) - also forcings
RMSE / RSR Nash-sutcliffe	Cont	A set of standard variance estimatros, sometimes applied to long(Q) to de-emphasize extremes
Correlation	Cont	Linear corrolation coefficent (Pearson’s r)
Kling-Gupta efficencit	Cont	Rational decomposition of error commpointents (linear corrolation, variability bias, and mean bias)
Peak discharge	event	Ability of model to predict Qp for events
Stormflow error	event	Ability of model to predict Rp for events
Conditional	event	Occourance of an event (TP,FP,TN,FN)
Peak timing	event	Ability of model to predict Tp for events
Hydrologic signiture	mixed	Other information content in streamflow signal

Acc

Hydrologists lean to several common metrics but one of the most common is nash shutcliff.

Carson plots provide a rich overview of comparisons by tracking 3 of the 4 confusion matrix values alongside an aggragate metric of the designers choice.

SNOTEL is a series of stations set up primarily along the Rockies that include a series of advanced sensors to measure precipitation. The king of the show is the snow pillow, which measures the weight of snow on top of it, and can be used to calculate the snow water equivalent.

Metrics matter

I don’t have many pet peeves but one of them is listening to a conversation that poorly deploys or interprets accuracy metrics. Much of that frustration comes because those conversation tend to poorly wield these otherwise powerful aggregate statistics in a way that might be inappropriate, or that the “blame” for the poor performance is pinned to a misunderstanding of how variations in the confusion matrix do or do not impact the resulting calculation. One of the simplest ways I can try and describe this is along the lines of a paper on snow cover accuracy I wrote many moons ago. In this, we are comparing accuracy of two measurements, one, a SNOTEL snow pillow and two, a MODIS snow cover pixel. Taking the data from Table 3 of [@collComprehensiveAccuracyAssessment2018],

frame <- data.frame(
  header = c('Month','A','B','C','D','E','F','AC','AA','MCC','A  AB','D DC'),
  jan = c(1,99134,5473,397,1355,163031,1839,0.9448,0.3705,0.3745,0.9477,0.7734),
  feb = c(2,80492,3779,330,1655,159109,1827,0.9524,0.3323,0.4869,0.9552,0.8338),
  mar = c(3,83064,6284,502,4407,172667,4366,0.9280,0.3224,0.5797,0.9297,0.8977),
  apr = c(4,63886,13287,1004,16302,152530,15548,0.8487,0.3054,0.6420,0.8278,0.9420),
  may = c(5,30313,21423,1937,59356,98668,59702,0.7933,0.3304,0.6116,0.5859,0.9684),
  jun = c(6,5538,11733,1441,130360,25731,88136,0.9116,0.5168,0.4693,0.3207,0.9891),
  jly = c(7,171,2268,790,179941,1586,87754,0.9833,0.6609,0.1043,0.0701,0.9956),
  aug = c(8,8,140,1151,174369,172,97594,0.9927,0.6377,0.0170,0.0541,0.9934),
  sep = c(9,1237,1033,2136,166161,3415,91506,0.9814,0.6305,0.4381,0.5449,0.9873),
  oct = c(10,18951,11432,5737,108457,44574,81879,0.8812,0.4701,0.6209,0.6237,0.9498),
  nov = c(11,64085,16762,3620,25111,125803,27115,0.8140,0.3398,0.6035,0.7927,0.8740),
  dec = c(12,79714,6627,985,3292,175798,4859,0.9160,0.3060,0.4707,0.9232,0.7697)
)

accuracy in the context of snow cover presence

See the full paper in remote sensing: Comprehensive accuracy assessment of MODIS daily snow cover products and gap filling methods

One of my earliest foreys into accuracy reporting was comparing how well the fractional snow cover valve from MODIS compared with SNOTEL sites. As a new student to the discipline, I had quite a few questions about how best to approach reporting model behavior, and was interested in describing the way we study accuracy.

This study was quite novel for it’s time due to the sweeping scale that the use of Google Earth Engine provided and the cutting edge nature of the factional snow cover band in the then new MODIS V6 product contained. When I have the time and direction to push back into this space, I feel a site specific exploration of these properties would be insightful.