Module 1 · Foundations · Free & interactive
Spatial Autocorrelation & Moran's I in Python
Understand how spatial relationships create patterns. Written by Eric Vaz. Read it here, or open the interactive version to write and run the Python yourself, in your browser, on real open data.
Open the interactive version (write & run the code) →Your browser is the computer
This is a hands-on course. Every dark panel is a live Python session running inside your browser tab (NumPy, pandas, SciPy and Matplotlib are loaded the first time you Run). The session is shared down the page, so a variable you make in one cell is still there in the next.
How the lessons work: grey Example cells are worked code, run them to see the pattern. Green Your turn cells have a blank you fill in: write the line, press Run, and it checks your answer. Stuck? Use Hint or Show solution.
import sys, numpy as np, pandas as pd, scipy
print("Python", sys.version.split()[0])
print("NumPy", np.__version__, "| pandas", pd.__version__, "| SciPy", scipy.__version__)
print()
print("This ran inside your browser. Nothing was sent to a server.")print("...").msg = "ready to learn spatial statistics"
print("You said:", msg)
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
Spatial data, fetched from the cloud
Spatial analysis starts with one of two data models. Point data records a value at a coordinate (an elevation sounding, an air-quality sensor). Areal data attaches a value to a region (a census tract, a state). The tools differ between them.
Pyodide ships pyodide.http.open_url, which fetches a text resource over the network as a file-like object: perfect for pandas.read_csv or json.load. We pull from jsDelivr, a CDN that mirrors public GitHub repos with open CORS headers.
from pyodide.http import open_url
import pandas as pd
# 52 elevation soundings on a 6x6 field (MASS::topo): the classic teaching set.
TOPO = "https://cdn.jsdelivr.net/gh/vincentarelbundock/Rdatasets@master/csv/MASS/topo.csv"
topo = pd.read_csv(open_url(TOPO))[["x", "y", "z"]]
print("POINTS topo:", topo.shape, "-> columns x, y, z(elevation)")
print(topo.head(3).to_string(index=False))pd.read_csv(open_url(url)). A GeoJSON file is loaded the same way but with json.load. Fill the blank to fetch the US states polygons.import json
STATES = "https://cdn.jsdelivr.net/gh/PublicaMundi/MappingAPI@master/data/geojson/us-states.json"
gj = json.load(open_url(STATES))
print("AREAS states:", len(gj["features"]), "polygons")
print("first feature props:", gj["features"][0]["properties"])
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
Look before you model
Before any statistic, look at the data in space. For the topo points, colour each location by its value. This map and the choropleth below are two of the most common figures in the field.
c= argument with the column that holds the value.import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4.8, 3.8))
sc = ax.scatter(topo.x, topo.y, c=topo.z, cmap="viridis", s=80, edgecolor="k", linewidth=.4)
ax.set(title="Elevation soundings", xlabel="x", ylabel="y"); ax.set_aspect("equal")
plt.colorbar(sc, ax=ax, label="elevation")
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
For polygons we need two helpers: one to pull coordinate rings out of a GeoJSON feature (it may be a Polygon or a MultiPolygon), and one for each polygon's centroid (we'll need centroids next lesson). Population density is wildly skewed (New Jersey vs Wyoming), so we analyse its logarithm, and we drop Alaska, Hawaii and Puerto Rico so the neighbour graph isn't distorted by far-flung exclaves.
import numpy as np
def feature_polys(f):
g = f["geometry"]
return g["coordinates"] if g["type"] == "MultiPolygon" else [g["coordinates"]]
def state_centroid(f):
best, best_area = None, -1
for poly in feature_polys(f):
r = np.array(poly[0])
a = abs(0.5*np.sum(r[:-1,0]*r[1:,1] - r[1:,0]*r[:-1,1])) # shoelace
if a > best_area:
best_area, best = a, (r[:,0].mean(), r[:,1].mean())
return best
drop = {"Alaska", "Hawaii", "Puerto Rico"}
feats = [f for f in gj["features"] if f["properties"]["name"] not in drop]
dens = np.array([f["properties"]["density"] for f in feats])
print(len(feats), "contiguous states + DC; raw density spans %.0fx" % (dens.max()/dens.min()))y as the natural log of dens (NumPy's np.log). This is the variable we'll analyse from here on.y = np.log(dens)
cent = np.array([state_centroid(f) for f in feats])
print("y ranges", round(float(y.min()),2), "to", round(float(y.max()),2))
print("centroids:", cent.shape)
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
import matplotlib as mpl
from matplotlib.path import Path
from matplotlib.patches import PathPatch
from matplotlib.collections import PatchCollection
def choropleth(feats, values, title, cmap="viridis"):
norm = mpl.colors.Normalize(values.min(), values.max()); cm = plt.get_cmap(cmap)
fig, ax = plt.subplots(figsize=(7.2, 4.3)); patches, colors = [], []
for f, v in zip(feats, values):
for poly in feature_polys(f):
patches.append(PathPatch(Path(np.array(poly[0])))); colors.append(cm(norm(v)))
ax.add_collection(PatchCollection(patches, facecolor=colors, edgecolor="white", linewidth=.4))
ax.autoscale(); ax.set_aspect(1.3); ax.axis("off"); ax.set_title(title)
fig.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cm), ax=ax, shrink=.7)
choropleth(feats, y, "log population density by state")Who is next to whom
To measure whether nearby places resemble each other, you first have to define nearby. That is the spatial weights matrix W: W[i, j] > 0 when j is a neighbour of i. Common families: contiguity (share a border), distance band (within radius d), and k-nearest (the k closest, so no unit is ever isolated).
We'll build k-nearest weights from the centroids, then row-standardise so each row sums to one, turning a weighted sum over neighbours into a neighbour average.
knn_W by row-standardising: divide W by each row's sum so every row totals 1. Fill the blank.from scipy.spatial import cKDTree
import numpy as np
def knn_W(coords, k=6, row_standardize=True):
n = len(coords); tree = cKDTree(coords)
_, idx = tree.query(coords, k=k+1) # k+1: the nearest point is itself
W = np.zeros((n, n))
for i in range(n):
for j in idx[i, 1:]: # skip self (column 0)
W[i, j] = 1.0
if row_standardize:
W = W / W.sum(1, keepdims=True)
return W
W = knn_W(cent, k=6)
print("W is", W.shape, "| every row sums to 1:", np.allclose(W.sum(1), 1))
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
y is each unit's neighbour-average: the matrix product of W and y. Compute it with the @ operator.ylag = W @ y
print("corr(value, neighbour average):", round(float(np.corrcoef(y, ylag)[0, 1]), 3))
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
Spatial autocorrelation
“Everything is related to everything else, but near things are more related than distant things.” Moran's I turns that into a number: a correlation between each value and the average of its neighbours, running from about −1 (a checkerboard) through 0 (no pattern) to +1 (strong clustering).
z is the mean-centred value and S0 the sum of all weights. Fill the numerator (sum of W * np.outer(z, z)) and denominator (sum of z**2).import numpy as np
def morans_I(y, W):
z = y - y.mean()
S0 = W.sum()
return (len(y)/S0) * np.sum(W * np.outer(z, z)) / np.sum(z**2)
I = morans_I(y, W)
print("Moran's I (log density) =", round(float(I), 3))
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
There's no clean theoretical null for most real maps, so we test by permutation: shuffle the values across locations many times, recompute I each time, and see where the real I falls in that reference distribution.
y. Fill the blank with a shuffled copy of y (use rng.permutation).rng = np.random.default_rng(42)
ref = np.array([morans_I(rng.permutation(y), W) for _ in range(999)])
p = (np.sum(ref >= I) + 1) / (999 + 1)
print(f"pseudo p-value = {p:.3f}")
print("expected if random =", round(-1/(len(y)-1), 3))
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
The Moran scatterplot makes it visual: standardised value on the x-axis, its spatial lag on the y-axis. The slope of the fitted line is Moran's I.
zy.import matplotlib.pyplot as plt
zy = (y - y.mean())/y.std()
zlag = W @ zy
fig, ax = plt.subplots(figsize=(4.8, 4.8))
ax.axhline(0, color="#aaa", lw=.8); ax.axvline(0, color="#aaa", lw=.8)
ax.scatter(zy, zlag, s=32, color="#14233D")
slope = np.polyfit(zy, zlag, 1)[0]
xs = np.linspace(zy.min(), zy.max(), 2)
ax.plot(xs, slope*xs, color="#D6552F", lw=2, label=f"slope = Moran's I = {slope:.3f}")
ax.set(xlabel="value (standardised)", ylabel="spatial lag", title="Moran scatterplot"); ax.legend()
In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.
Spatial autocorrelation in applied research
- Eric Vaz — Google Scholar profile — land-use change, regional science and GeoAI applications of these methods
- Selected publications on ericvaz.com