FAQ: Euclidean Distance

What is Euclidean distance?

Euclidean distance was developed by the Greek mathematician Euclid and is associated with Euclidean geometry. Euclidean distance is the type of “straight-line” distance that most people are familiar with and learned about in geometry class. While Euclidean distance is often applied to calculate “geographic distance” it can also be applied to data (= “data distance”). In the latter context, it measures how near or far two sample units are from each other in terms of their measured variables. Technically speaking, it measures the distance between the (row) vectors in Euclidean space.

What is the range of Euclidean distance?

In theory, maximum Euclidean distance is unbounded (\(= \infty\)), but in practice its range depends on the distribution of input values. If the input values are bounded between zero and one, the theoretical maximum is equal to the square root of the number of variables (\(n\)) being compared.

Rescaling and Euclidean distance

Euclidean distance is very sensitive to measurement scale. Imagine a scenario for two US counties, where most of the diabetes variables have a measurement scale from 0 to 1, but one of the variables has a measurement scale from 0 to 10. In this situation, the Euclidean distance will be dominated by variation in the variable with the larger measurement scale. The solution is to “rescale” (or “normalize”) the variables to the same range before calculating Euclidean distance. In the USDSS this is accomplished via “minmax” rescaling. In minmax rescaling, you subtract the minimum value in the series from each value, and then standardize it by the difference between the maximum value and minimum value:

\[ = \frac{x - x_{min}}{x_{max} - x_{min}} \] After minmax rescaling, all of the values for a variable will range between 0 and 1, where the minimum value in the original series will equal zero and the maximum value will equal one. A useful property of minmax rescaling is that it will yield the same values whether a variable is expressed as a count, proportion, or crude rate.

How many indicators should I select?

Euclidean distance requires at least one indicator. There is not an upper limit on the number of indicators that can be selected. Nevertheless, results for Euclidean distance could become nebulous if too many indicators are chosen, a problem that is commonly referred to as “the curse of dimensionality”. For instance, Euclidean distance between uniformly distributed points naturally increases with increasing dimensionality and the ratio between the nearest and farthest distances eventually converges to one, making the concept of proximity less useful in high-dimensional spaces. However, such problems might be diminished in the USDSS Analysis Module by the fact that many of the indicators have restricted ranges (or are being rescaled) and some may be positively correlated, limiting the volume of space where the data are found. Moreover, the number of diabetes indicators that are currently available for analysis is presumably limited enough to avoid strange phenomena that occur with distance measures in high-dimensional spaces.

Correlation and Euclidean distance.

While correlation between variables may restrict volume in high-dimensional spaces, it distorts (elongates) distance along particular axes where variables are correlated. Hence, users should be cautious about choosing indicators that are highly correlated. Future releases of the Analysis Module will likely include distance metrics that are more robust to correlation, such as Mahalanobis distance.

Why do some results for Euclidean distance show more decimals than others?

The Analysis Module uses standard rules for handling significant figures in calculations. Hence, the number of decimal places in the output Euclidean distance calculation is determined by the minimum precision of an input variable. For example, many of the variables are input as percentages with one decimal place (such as 13.7%), representing a proportion with three decimals (0.137); therefore, Euclidean distance measurements that include these numbers would be rounded to three decimal places. To keep the display tractable, the Analysis Module does not display more than six decimal places.

In many cases, such rounding rules may result in tied Euclidean distances between county pairs that have slightly different vector components. However, such minor differences would not generally be observable given the thematic resolution used in the choropleth maps.

Why isn’t the county or indicator I want to select visible?

At the present time, counties that are missing data for one or more indicators will not be available for analysis when using Euclidean distance.