QC Tests

This page describes the QC tests applied to the Coastal Monitoring Program Water Quality data, and the general methods for selecting the most appropriate thresholds for each test.

Thresholds

Where possible, the thresholds for each QC test and variable were determined from historical data, which provide a baseline of “normal” and “outlying” conditions. The historical data used here was the Coastal Monitoring Program Water Quality data sets submitted to the Nova Scotia Open Data Portal in December 2022. Preliminary quality control measures (e.g., obvious outliers and suspected biofouling removed) were applied to these datasets before submission. Additional QC was applied where required throughout the thresholds analysis. For example, freshwater and other outlier stations were excluded to provide a better representation of “normal” coastal ocean conditions.

The historical data was reviewed carefully prior to calculating thresholds. Depending on the number of observations and the spatial and temporal resolution of observations, data was pooled together or separated into different groups (e.g., county, sensor type).

The distribution of observations was then reviewed to determined which statistics to use to quantify outlying conditions. The mean plus/minus 3 standard deviations was used for relatively normally distributed variables (OOI 2022), while upper quartiles were used for skewed distributions.

These thresholds may be re-evaluated in several years, when more data is available.

QC Tests

Three QARTOD tests, two CMAR-developed tests, and a human in the loop test were applied to the CMAR Water Quality data.

Automated QC tests were applied to each sensor string deployment using CMAR-developed R package qaqcmar, which is available to view and install from GitHub. The human in the loop test was applied during data review using the qc_tests_water_quality R project, which is also available on GitHub.

Gross Range Test

Following QARTOD, the Gross Range Test aims to identify observations that fall outside of the sensor measurement range (flagged Fail) and observations that are statistical outliers (flagged Suspect/Of Interest).

Thresholds for failed observations are named \(sensor_{min}\) and \(sensor_{max}\), and are determined by the sensor specifications. CMAR assigned these thresholds for each variable and sensor based on information in the associated manuals.

Thresholds for suspect/of interest observations are named \(user_{min}\) and \(user_{max}\). CMAR assigned these thresholds based on historical Coastal Monitoring Program data.

Following the OOI Biogeochemical Sensor Data: Best Practices & User Guide, these thresholds were calculated from historical data as the mean +/- three standard deviations (Equation 1, Equation 2):

\[ user_{min} = avg_{var} - 3 * stdev_{var} \tag{1}\]

\[ user_{max} = avg_{var} + 3 * stdev_{var} \tag{2}\]

where \(avg_{var}\) is average of the variable of interest, and \(stdev_{var}\) is the standard deviation of the variable of interest.

Climatological Test

The Climatological Test is a variation of the Gross Range Test that accounts for seasonal variability. Under QARTOD, there is no Fail flag associated with this test for temperature, salinity, or dissolved oxygen due to the dynamic nature of these variables (IOOS 2020, 2018). Following this guidance, CMAR chose to assign the flag Suspect/Of Interest to seasonal outliers for all variables.

The Climatological thresholds are named \(season_{min}\) and \(season_{max}\). Following the OOI Biogeochemical Sensor Data: Best Practices & User Guide, seasons were defined based on the calendar month, and the thresholds were based on historical data. The monthly thresholds were defined similar to the Gross Range Test:

\[ season_{min} = avg_{season} - 3 * stdev_{season} \tag{3}\]

\[ season_{max} = avg_{season} + 3 * stdev_{season} \tag{4}\]

The \(avg_{season}\) was calculated as the average of all observations for a given month, and \(stdev_{season}\) was the associated standard deviation.

Note that OOI used a more complex method (harmonic analysis, as described here) to estimate \(avg_{season}\) to account for spurious values. This was beyond the current scope of the CMAR Coastal Monitoring Program, but could be applied in future iterations of this threshold analysis.

Spike Test

The Spike Test identifies single observations that are unexpectedly high (or low) based on the previous and following observations.

For each observation, a \(spike_{value}\) is calculated based on a spike reference (\(spike_{ref}\)). The \(spike_{value}\) is compared to the Spike Test thresholds and the appropriate flag is assigned.

\[spike_{value} = abs(value - spike_{ref})\] \[spike_{ref} = (lead_{value} - lag_{value}) / 2\]

Due to the dependence on \(lead_{value}\) and \(lag_{value}\), the first and last observations in each sensor deployment will be flagged as Not Evaluated because the \(spike_{ref}\) cannot be calculated.

As a simple example, consider several observations that increase linearly over time (Example 1: Figure). Here, the \(spike_{ref}\) is always equal to the observed value, and so the \(spike_{value}\) is zero, indicating no spike detected (Example 1: Table).

Now consider that the value of one of these observations lies above or below the linear pattern (Example 2). This value will have a relatively high \(spike_{value}\), and may be flagged, depending on the threshold values. Note that the observations on either side of the spike may also be flagged, but multiple spike values may not be (Example 2, Example 3).

CMAR uses two Spike Test thresholds: \(spike_{low}\) and \(spike_{high}\). Observations greater than \(spike_{low}\) but less than or equal to \(spike_{high}\) are assigned a flag of Suspect/Of Interest. Values greater than \(spike_{high}\) are assigned a flag of Fail.

Values for \(spike_{low}\) were selected based on the 99.7th quartile of the \(spike_{value}\) for each variable. The quartile was used instead of the mean and standard deviation because the distribution of \(spike_{value}\) skews right for each variable. The value for \(spike_{high}\) was set to be 3 * \(spike_{low}\).

Rolling Standard Deviation

The Rolling Standard Deviation test was developed by CMAR to identify suspected biofouling in the dissolved oxygen data. The test assumes that there is a 24-hour oxygen cycle, with net oxygen production during the day, and net oxygen consumption during the night. Biofouling is suspected when the amplitude of this cycle, as measured by the standard deviation, increases above a threshold (Figure 1).

The rolling standard deviation, \(sd_{roll}\), was calculated from a 24-hour centered rolling window of observations, i.e., \(T_{n-m/2}\), … \(T_{n-1}\), \(T_{n}\), \(T_{n+1}\), … \(T_{n+m/2}\).1 The number of observations in each window depends on the sample interval, which is typically 10 minutes.

Although this test was designed to identify suspected biofouling, it was also applied to the other Water Quality variables as a general test of the standard deviation. In particular, it is expected to flag rapid changes in temperature due to fall storms and upwelling.

The Rolling Standard Deviation Test threshold is called \(rolling\_sd\_max\). Observations greater than this threshold are flagged as Suspect/Of Interest. This test does not flag any observations as Fail because of the high natural variability in the Water Quality variables. Observations at the beginning and end of the deployment for which the rolling standard deviation cannot be calculated (i.e, observations less than 12 hours from the start or end of deployment) are flagged Not Evaluated.

Values for \(rolling\_sd\_max\) were selected based on the mean and standard deviation or an upper quartile \(sd_{roll}\), depending on the distribution of the variable observations.

Figure 1: Simulated dissolved oxygen data and associated flags from the rolling standard deviation test.

Depth Crosscheck

The Depth Crosscheck Test was developed by CMAR to flag deployments where the measured sensor depth does not align with the estimated sensor depth in the sensor_depth_at_low_tide_m column.

For this test, the difference between the minimum value of measured depth and the estimated depth is calculated. If the absolute difference (\(abs_{diff}\)) between the two is greater than the threshold \(depth\_diff\_max\), then test results in a flag of Suspect/Of Interest.

\[abs_{diff} = abs(sensor\_depth\_at\_low\_tide\_m - min\_measured\_depth\_m) \]

\(depth\_diff\_max\) was determined based on the 95th percentile of the \(abs_{diff}\) from all deployments with measured depth data.

Note that all observations from a deployment will have the same depth crosscheck flag. If there is more than one sensor on the string that measures depth, the worst (highest) flag will be assigned to the deployment.

This is because a Suspect/Of Interest flag for the Depth Crosscheck test is an indication that the sensor string was moored in an area deeper (or shallower) than expected. For example, if the string was moored in an area 10 m deeper than anticipated, all sensors will likely be 10 m deeper than recorded in the sensor_depth_at_low_tide_m column.

Human in the Loop

Human experts reviewed the results of the automated QC tests to identify poor quality observations that were not adequately flagged. Results of the automated tests were not changed, but an additional human in the loop flag of Fail was added to identify these observations.

The Suspect/Of Interest flag was not used for this test because the goal was to flag observations that human experts were confident were poor quality.

Situations where observations were flagged as Fail by human experts include:

  • Spikes with multiple observations (e.g., Spike Test Example 3 above).
  • Known issue with the deployment or sensor (e.g., surface buoy cut during deployment, sensor malfunctioned for most of the deployment).
  • Temperature and measured depth observations that were flagged as Suspect/Of Interest by one or more automated tests, and considered poor quality after review (e.g., evidence that sensor was exposed to air at low tide). Note that most Suspect/Of Interest dissolved oxygen and salinity observations were already treated as poor quality.

References

IOOS. 2018. “QARTOD Manual for Real-Time Quality Control of Dissolved Oxygen Observations.” https://ioos.noaa.gov/ioos-in-action/manual-real-time-quality-control-dissolved-oxygen-observations/.
———. 2020. “QARTOD Manual for Real-Time Quality Control of in-Situ Temperature and Salinity Data: A Guide to Quality Control and Quality Assurance for in-Situ Temperature and Salinity Observations.” https://ioos.noaa.gov/ioos-in-action/temperature-salinity/.
OOI. 2022. “OOI Biogeochemical Sensor Data: Best Practices & User Guide.” https://repository.oceanbestpractices.org/bitstream/handle/11329/2112/OOI%20Biogeochemical%20Sensor%20Data%20Best%20Practices%20and%20User%20Guide.pdf?sequence=1&isAllowed=y.

Footnotes

  1. Note that a centered window is possible because the data is being post-processed after collection. Real-time data would likely use a left-aligned window, i.e., observations \(T_{n}\), \(T_{n-1}\), \(T_{n-2}\)\(T_{n-m}\).↩︎