Module 1 · Foundations · Free & interactive

Spatial Autocorrelation & Moran's I in Python

Understand how spatial relationships create patterns. Written by Eric Vaz. Read it here, or open the interactive version to write and run the Python yourself, in your browser, on real open data.

Open the interactive version (write & run the code) →

Your browser is the computer

This is a hands-on course. Every dark panel is a live Python session running inside your browser tab (NumPy, pandas, SciPy and Matplotlib are loaded the first time you Run). The session is shared down the page, so a variable you make in one cell is still there in the next.

How the lessons work: grey Example cells are worked code, run them to see the pattern. Green Your turn cells have a blank you fill in: write the line, press Run, and it checks your answer. Stuck? Use Hint or Show solution.

ExampleCheck the runtime

import sys, numpy as np, pandas as pd, scipy
print("Python", sys.version.split()[0])
print("NumPy", np.__version__, "| pandas", pd.__version__, "| SciPy", scipy.__version__)
print()
print("This ran inside your browser. Nothing was sent to a server.")

Try it yourselfSay hello

Your first line of code. Replace the blank so it prints a message of your choice with print("...").

msg = "ready to learn spatial statistics"
print("You said:", msg)

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

Spatial data, fetched from the cloud

Spatial analysis starts with one of two data models. Point data records a value at a coordinate (an elevation sounding, an air-quality sensor). Areal data attaches a value to a region (a census tract, a state). The tools differ between them.

Pyodide ships pyodide.http.open_url, which fetches a text resource over the network as a file-like object: perfect for pandas.read_csv or json.load. We pull from jsDelivr, a CDN that mirrors public GitHub repos with open CORS headers.

ExampleLoad a point dataset (run me)

from pyodide.http import open_url
import pandas as pd

# 52 elevation soundings on a 6x6 field (MASS::topo): the classic teaching set.
TOPO = "https://cdn.jsdelivr.net/gh/vincentarelbundock/Rdatasets@master/csv/MASS/topo.csv"
topo = pd.read_csv(open_url(TOPO))[["x", "y", "z"]]
print("POINTS  topo:", topo.shape, "-> columns x, y, z(elevation)")
print(topo.head(3).to_string(index=False))

Try it yourselfLoad the areal dataset

The points used pd.read_csv(open_url(url)). A GeoJSON file is loaded the same way but with json.load. Fill the blank to fetch the US states polygons.

import json
STATES = "https://cdn.jsdelivr.net/gh/PublicaMundi/MappingAPI@master/data/geojson/us-states.json"
gj = json.load(open_url(STATES))
print("AREAS  states:", len(gj["features"]), "polygons")
print("first feature props:", gj["features"][0]["properties"])

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

What changed since the 2010s. The classic workflow was R with sp and gstat, reading a local file off your desktop. The 2026 version is Python, the data lives in the cloud under version control, and it all runs in a browser tab. Same statistics, far less friction.

Look before you model

Before any statistic, look at the data in space. For the topo points, colour each location by its value. This map and the choropleth below are two of the most common figures in the field.

Try it yourselfColour the points by elevation

Make a scatter of the topo points coloured by elevation. Fill the c= argument with the column that holds the value.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4.8, 3.8))
sc = ax.scatter(topo.x, topo.y, c=topo.z, cmap="viridis", s=80, edgecolor="k", linewidth=.4)
ax.set(title="Elevation soundings", xlabel="x", ylabel="y"); ax.set_aspect("equal")
plt.colorbar(sc, ax=ax, label="elevation")

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

For polygons we need two helpers: one to pull coordinate rings out of a GeoJSON feature (it may be a Polygon or a MultiPolygon), and one for each polygon's centroid (we'll need centroids next lesson). Population density is wildly skewed (New Jersey vs Wyoming), so we analyse its logarithm, and we drop Alaska, Hawaii and Puerto Rico so the neighbour graph isn't distorted by far-flung exclaves.

ExampleHelpers + drop the islands (run me)

import numpy as np

def feature_polys(f):
    g = f["geometry"]
    return g["coordinates"] if g["type"] == "MultiPolygon" else [g["coordinates"]]

def state_centroid(f):
    best, best_area = None, -1
    for poly in feature_polys(f):
        r = np.array(poly[0])
        a = abs(0.5*np.sum(r[:-1,0]*r[1:,1] - r[1:,0]*r[:-1,1]))   # shoelace
        if a > best_area:
            best_area, best = a, (r[:,0].mean(), r[:,1].mean())
    return best

drop  = {"Alaska", "Hawaii", "Puerto Rico"}
feats = [f for f in gj["features"] if f["properties"]["name"] not in drop]
dens  = np.array([f["properties"]["density"] for f in feats])
print(len(feats), "contiguous states + DC; raw density spans %.0fx" % (dens.max()/dens.min()))

Try it yourselfTame the skew

Density spans thousands of times over, so a raw choropleth is unreadable. Define y as the natural log of dens (NumPy's np.log). This is the variable we'll analyse from here on.

y    = np.log(dens)
cent = np.array([state_centroid(f) for f in feats])
print("y ranges", round(float(y.min()),2), "to", round(float(y.max()),2))
print("centroids:", cent.shape)

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

ExampleA choropleth, the honest way (run me)

import matplotlib as mpl
from matplotlib.path import Path
from matplotlib.patches import PathPatch
from matplotlib.collections import PatchCollection

def choropleth(feats, values, title, cmap="viridis"):
    norm = mpl.colors.Normalize(values.min(), values.max()); cm = plt.get_cmap(cmap)
    fig, ax = plt.subplots(figsize=(7.2, 4.3)); patches, colors = [], []
    for f, v in zip(feats, values):
        for poly in feature_polys(f):
            patches.append(PathPatch(Path(np.array(poly[0])))); colors.append(cm(norm(v)))
    ax.add_collection(PatchCollection(patches, facecolor=colors, edgecolor="white", linewidth=.4))
    ax.autoscale(); ax.set_aspect(1.3); ax.axis("off"); ax.set_title(title)
    fig.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cm), ax=ax, shrink=.7)

choropleth(feats, y, "log population density by state")

Who is next to whom

To measure whether nearby places resemble each other, you first have to define nearby. That is the spatial weights matrix W: W[i, j] > 0 when j is a neighbour of i. Common families: contiguity (share a border), distance band (within radius d), and k-nearest (the k closest, so no unit is ever isolated).

We'll build k-nearest weights from the centroids, then row-standardise so each row sums to one, turning a weighted sum over neighbours into a neighbour average.

Try it yourselfBuild k-nearest weights

The neighbour lookup is done. Finish knn_W by row-standardising: divide W by each row's sum so every row totals 1. Fill the blank.

from scipy.spatial import cKDTree
import numpy as np

def knn_W(coords, k=6, row_standardize=True):
    n = len(coords); tree = cKDTree(coords)
    _, idx = tree.query(coords, k=k+1)        # k+1: the nearest point is itself
    W = np.zeros((n, n))
    for i in range(n):
        for j in idx[i, 1:]:                  # skip self (column 0)
            W[i, j] = 1.0
    if row_standardize:
        W = W / W.sum(1, keepdims=True)
    return W

W = knn_W(cent, k=6)
print("W is", W.shape, "| every row sums to 1:", np.allclose(W.sum(1), 1))

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

Try it yourselfForm the spatial lag

The spatial lag of y is each unit's neighbour-average: the matrix product of W and y. Compute it with the @ operator.

ylag = W @ y
print("corr(value, neighbour average):", round(float(np.corrcoef(y, ylag)[0, 1]), 3))

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

The lag is the whole game. Almost every areal statistic (Moran's I, spatial lag regression, LISA) is built from W @ y. Get the weights right and the rest follows; get them wrong and every downstream number inherits the mistake.

Spatial autocorrelation

“Everything is related to everything else, but near things are more related than distant things.” Moran's I turns that into a number: a correlation between each value and the average of its neighbours, running from about −1 (a checkerboard) through 0 (no pattern) to +1 (strong clustering).

Try it yourselfCode Moran's I

Moran's I = (n / S0) · Σ W·(z⃗z⃗) / Σ z², where z is the mean-centred value and S0 the sum of all weights. Fill the numerator (sum of W * np.outer(z, z)) and denominator (sum of z**2).

import numpy as np
def morans_I(y, W):
    z  = y - y.mean()
    S0 = W.sum()
    return (len(y)/S0) * np.sum(W * np.outer(z, z)) / np.sum(z**2)

I = morans_I(y, W)
print("Moran's I (log density) =", round(float(I), 3))

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

There's no clean theoretical null for most real maps, so we test by permutation: shuffle the values across locations many times, recompute I each time, and see where the real I falls in that reference distribution.

Try it yourselfTest it by permutation

Build the reference distribution: recompute Moran's I on 999 reshuffles of y. Fill the blank with a shuffled copy of y (use rng.permutation).

rng = np.random.default_rng(42)
ref = np.array([morans_I(rng.permutation(y), W) for _ in range(999)])
p   = (np.sum(ref >= I) + 1) / (999 + 1)
print(f"pseudo p-value = {p:.3f}")
print("expected if random =", round(-1/(len(y)-1), 3))

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

The Moran scatterplot makes it visual: standardised value on the x-axis, its spatial lag on the y-axis. The slope of the fitted line is Moran's I.

Try it yourselfThe Moran scatterplot

Plot the standardised value against its spatial lag. Fill the blank with the spatial lag of the standardised values zy.

import matplotlib.pyplot as plt
zy   = (y - y.mean())/y.std()
zlag = W @ zy
fig, ax = plt.subplots(figsize=(4.8, 4.8))
ax.axhline(0, color="#aaa", lw=.8); ax.axvline(0, color="#aaa", lw=.8)
ax.scatter(zy, zlag, s=32, color="#14233D")
slope = np.polyfit(zy, zlag, 1)[0]
xs = np.linspace(zy.min(), zy.max(), 2)
ax.plot(xs, slope*xs, color="#D6552F", lw=2, label=f"slope = Moran's I = {slope:.3f}")
ax.set(xlabel="value (standardised)", ylabel="spatial lag", title="Moran scatterplot"); ax.legend()

In the interactive version this line is blanked out and you write it yourself, with a hint, a solution, and an automatic check.

Go deeper / cite

Spatial autocorrelation in applied research

Eric Vaz — Google Scholar profile — land-use change, regional science and GeoAI applications of these methods
Selected publications on ericvaz.com

Found this useful?It's free and always will be. A coffee keeps new modules coming.

☕ Buy me a coffee

More free modules

Interpolation & Kriging — Predict values at unsampled locations.Hot Spot Analysis — Identify statistically significant hot and cold spots.All modules & the interactive course →