FAQ

What is a hotspot analysis?

A hotspot analysis is based on the local-\(G\) statistics (\(G_i\) or \(G_i^*\)) developed by Arthur Getis and J. Keith Ord in the 1990s (1, 2). It is used to test whether the values near some location are disproportionately high (hotspot) or low (coldspot) relative to its number of connections. It can be thought of as a form of cluster detection, albeit it does not formally aggregate spatial units into groups.

References:

Getis A, Ord JK (1992) The analysis of spatial association by use of distance statistics. Geographical Analysis 24:189-206.
Ord JK, Getis A (1995) Local spatial autocorrelation statistics: distributional issues and application. Geographical Analysis 27:286-306.

Why does the USDSS use \(G_i^*\) and not \(G_i\) as the hotspot statistic?

\(G_i^*\) (pronounced G-eye-star) is more commonly used than \(G_i\). A hotspot analysis based on \(G_i\) might be preferred if you were only interested in testing whether the values surrounding a location were disproportionately high or low, whereas \(G_i^*\) examines whether the values at and around some location are disproportionately high or low.

What is statistical significance?

When sampling at random from a population, unlikely things can happen, especially if you sample size is small. Imagine that you have a state with 10 counties and each county has a value ranging from 1 to 10. If you were to randomly shuffle the values among counties it is not impossible that counties with high values (or, conversely, counties with low values) could form hotspots (or coldspots) due to chance alone. If a test statistic (such as \(G_i^*\)) were statistically significant, this implies that there is a low probability that the value of the observed test statistic is due to chance alone.

Does USDSS provide information about the statistical significance of a hotspot or coldspot?

Information about the statistical significance of \(G_i^*\) is not currently provided in the USDSS Analysis Module. Although the standard (\(z\)) score that is output by the hotspot analysis can be used to infer significance, there are caveats. For example, if a county has a limited number of connections, the \(z\)-score approximation may be inaccurate. In such cases, the preferred procedure is to infer significance based on a spatial randomization test, but this approach is too computationally intensive for a web application. In addition to issues with the normal approximation, there are issues with multiple test correction when using local spatial statistics such as \(G_i^*\) and there are many methods and philosophies related to correction for multiple testing that can make implementations intractable. For these reasons, we encourage you to use the hotspot tool in the Analysis Module for exploratory spatial data analysis and not necessarily for statistical inference.

What are the main differences between a local spatial statistic and a global spatial statistic?

The \(G_i^*\) hotspot statistic is an example of a local spatial statistic. A local spatial statistic is calculated for each spatial unit within a study area. For example, a hotspot analysis in the USDSS Analysis Module conducted at the level of the state of Georgia produces a \(G_i^*\) statistic for each county. The global analogue of the local Getis-Ord statistics is called the General \(G\) statistic. While a global spatial statistic such as General \(G\) also uses spatial weights to identify nearby locations, it is calculated over all pairs of locations and produces a single statistic for the entire sample. For example, if you were to calculate the General \(G\) statistic for the state of Georgia, the result would be a single number that indicates whether counties that are near each other tend to have similar values.

Local vs. global patterns

Spatial patterns are often the result of several additive processes plus random “noise”. For example, diabetes prevalence could vary at the national level due to underlying differences in racial demographics and socio-economic status among subregions within the United States. On top of such national level patterns there could be localized factors, such as access to healthy foods or local variation in socio-economic factors and demographics; there could also be random variation that is not necessarily dependent on some underlying factor. Technically speaking, local spatial statistics such as \(G_i^*\) are really intended to reveal localized patterns and such localized patterns could be missed in the presence of global patterns. While there are techniques for “detrending” data sets to remove global patterns before calculating local spatial statistics, such techniques are not currently available in the Analysis Module and we encourage you to consider how processes acting at different spatial scales could be driving the results of your hotspot analysis.

What is an “areal unit”? What is a “grid cell”?

There are various forms of spatial data. In spatial analysis an “areal unit” refers to a bounded space (usually represented as some form of polygon) in which some value has been aggregated. In human geography, administrative boundaries (such as a state, county, or census tract) are often used as areal units. In other cases, areal units may be imposed by overlaying something like a square or hexagonal grid over some space and then aggregating values within each “grid cell”. A grid cell is a form of an areal unit where all of the units have the same size and shape. Some authors refer to this sort of regularly spaced areal-unit data as contiguous-unit data. In the USDSS Analysis module, the spatial units are counties, which usually have irregular shapes (an example with grid cells is provided in the help files).

What is a neighbor (in the context of a hotspot analysis)?

A local spatial statistic such as \(G_i^*\) is calculated for each spatial unit (a county in the USDSS) and requires some criterion for defining connections to other spatial units. Spatial units that are located in close proximity to another spatial unit may be informally referred to as “neighbors”. However, beware that in spatial statistics the term “neighbor” may denote a particular type of connection scheme called “k-nearest neighbors” (abbreviated as knn), where the k refers to the number of nearest neighbors or the order of nearest neighbors. Note that the USDSS Analysis Module uses contiguity (counties that border a county) as its criterion for defining connections. Connections based on contiguity in this manner could also be referred to as “1st-order nearest neighbors”.

What is a matrix?

A matrix is a data table with row-column structure, where all of the values are of the same type (such as a discrete integer or continuous numeric). If the number of rows and columns are the same, the matrix is considered to be “square”; otherwise, “rectangular”.

What are spatial weights? What is a spatial weights matrix?

Information about connections between spatial units (counties) are stored in a spatial weights matrix. Most weight matrices use 1s and 0s to represent connections, where a 1 indicates that two units are connected and a 0 indicates that they are not connected. Spatial weight matrices represent a special type of square matrix called a pairwise matrix. In a pairwise matrix, each sample unit is represented in both the rows and columns and the diagonal represents a comparison between a unit and itself. For example in the USDSS Analysis Module, the first row of a pairwise matrix would represent connections between a particular county and all other counties, and the first cell (row 1, column 1) would represent the comparison between the first county and itself. In a pairwise matrix, the upper and lower triangles (i.e. the sections of the matrix above and below the diagonal) are redundant; for example, connections between the first county and all other counties are represented in both the first row and column.

What is row-standardization? Why use it?

In the USDSS Analysis Module, spatial weights are row-standardized prior to calculating the \(G_i^*\) statistic. This means that the values in each row of the weights matrix (i.e. the 1s and 0s) are divided by the row sum. Row standardization is useful when the number of connections varies across spatial units. For example, if one county has six neighbors and another county has three, row standardization ensures that the \(G_i^*\) statistic is not biased upwards when a location has more connections.

How is error associated with estimates of diabetes surveillance indicators at the county level incorporated into the hotspot analysis?

Estimation error associated with diabetes surveillance indicators in NOT accounted for in the hotspot analysis in the USDSS Analysis Module. There is no direct way to account for error in most local spatial statistics, including \(G_i^*\). To account for such error, one would have to conduct thousands of Monte Carlo simulations; each time one value within the 95% confidence interval (CI) would be randomly selected for each county and then the standard (\(z\)) scores would be recalculated. Given enough simulations, this would produce a range of standard scores for each county. While revealing, such simulations would be extremely computationally instensive and impractical, especially for a web application.

A more pragmatic approach might be to consider the average width of the confidence intervals relative to the range of the indicator. If the ratio of the width to range is generally low, then hotspot results are less likely to be affected by estimation error.

How is the scale of the color ribbon chosen for the hotspot analysis?

The scaling of the color ribbon used in the USDSS Analysis module is chosen based on the maximum of the absolute value of the \(z\)-scores. For example, if the most positive \(z\)-score were 4.16 and the most negative \(z\)-score were -3.76, the scaling of the color ribbon would be set between -4.16 and 4.16. While this makes the scale symmetrical, it does not necessarily represent the range of the number (e.g., there would be no values more “blue” than -3.76).