Obfuscating Data with Clustering beautifully clustered lights and colors

Presented by


Christopher Moravec


Clustering, Obfuscation and Data, Oh my!
The Problem
  • Lots of point data to cluster
  • Can't use the raw points in the browser
Clustering
Raw Points
all points on the map at the same time
Raw Points by type
all points on the map color by type
Clustered by type
points clustered by type
Clustering - Server vs Client
  • Client needs all points within the map extent to make clusters
  • Server can pre-process by using a geohash
  • Produces an "obfuscated" data set
Obfuscation
Raw Points in Portland, OR
raw points in portland, or
Raw Points Counted by Grid Cell
raw points counted by grid cell
Grid Centroids
Grid Centroids
But uniform points yield...uniform clusters!
uniform points yield uniform clusters
Obfuscation Grid Size Selection
  • Large enough to group points together
  • Large enough to avoid single points in dense areas
  • Small enough to avoid the uniform grid problem
Data Details
Green Building Registry
gbr logo
  • A database of Home Energy Scores
  • All scores are verified by certified assessors
  • Allows home buyers to accurately compare the energy costs of homes
  • Currently around 50,000 scores
Our Solution
gbr logo
How does clustering work?
(some) Types of Clustering?
  • k-means
  • Buffer/Distance
  • Geohash
k-means
k-means Process
k-means process for finding data clusters
From wikipedia
Buffer/Distance
  • Simple to complete
  • Easier than NP-Hard
  • Normally done client side
  • Requires access to all (visible) data points
Buffer/Distance Process

const clustered = {};

for (let i = 0, ii = features.length; i < ii; i++) {
    const feature = features[i];
    if (!(getUid(feature) in clustered)) {
    const geometry = this.geometryFunction(feature);
    if (geometry) {
        const coordinates = geometry.getCoordinates();
        createOrUpdateFromCoordinate(coordinates, extent);
        buffer(extent, mapDistance, extent);

        let neighbors = this.source.getFeaturesInExtent(extent);
        neighbors = neighbors.filter(function(neighbor) {
        const uid = getUid(neighbor);
        if (!(uid in clustered)) {
            clustered[uid] = true;
            return true;
        } else {
            return false;
        }
        });
        this.features.push(this.createCluster(neighbors));
    }
    }
}
                        
From Open Layers Cluster Source
Buffer/Distance Process 1
cluster step 1
Buffer/Distance Process 2
points clustered by type
Buffer/Distance Process 3
points clustered by type
Buffer/Distance Process 4
points clustered by type
Buffer/Distance Process 5
points clustered by type
Buffer/Distance Process 6
points clustered by type
Buffer/Distance Process 7
points clustered by type
Buffer/Distance Process 8
points clustered by type
Buffer/Distance Process 9
points clustered by type
Buffer/Distance Process 10
points clustered by type
Buffer/Distance Process 11
points clustered by type
Buffer/Distance Process 12
points clustered by type
Geohash
  • Uses Geohash grids to represent distance
  • Enhanced by pre-grouping data into a grid
  • Neighbor queries are fast since Geohash values are predictable
  • No spatial operations like buffer or intersect
Geohash Cluster Process 1
cluster step 1
Geohash Cluster Process 2
points clustered by type
Geohash Cluster Process 3
points clustered by type
Geohash Cluster Process 4
points clustered by type
Geohash Cluster Process 5
points clustered by type
Geohash Cluster Process 6
points clustered by type
Geohash Cluster Process 7
points clustered by type
Geohash Cluster Process 8
points clustered by type
Geohash Cluster Process 9
points clustered by type
Geohash Cluster Process 10
points clustered by type
Geohash Cluster Process 11
points clustered by type
How does a geohash work?
  • Base 36 encoded
  • Divides ranges in half reducing error
  • Great explanation on wikipedia
  • No spatial operations like buffer or intersect
Geohash Grid on Glitch
Source on GitHub too
High level geohash grid
Combining obfuscation and clustering
  • Leverage ArcGIS Online/Esri JS api Clustering
  • Obfuscate point data
  • Needed Math to stay correct (sums and averages)
Leveraging ArcGIS Online/Esri JS api Clustering
  • Turns out Esri uses Geohashes for Clustering!
  • graphic._aggregationInfo.geohashes
  • Relatively easy to figure out what geohashes are in a cluster and query
  • If you can query the data... keep sum and averages correct
Obfuscate point data
points clustered by type
Obfuscate point data layer
points clustered by type
Lessons Learned
Set out to allow config via ArcGIS Online
  • Ended up overriding config
  • Over complicated the application
  • Next Time: OpenLayers or Leaflet with a custom layer
Clusters not weighted
  • The current scheme will break down over time as the data becomes evenly distributed
  • An obfuscated point may represent multiple points
  • The counts/averages are correct but the cluster is not spatially weighted
  • Next Time: Create custom cluster layer to weight obfuscated points by count
Geohash limitations
  • Edge cases - locations on either side of the equator will not have common prefix values
  • Non-linear - We made squares on a round earth...
  • Next Time: Investigate alternative hashes (Open Location Code)
  • Next Time: Investigate use via projection to cartesian coordinate systems
Thank you.
Happy Obfuscating!
presenter christopher in front of the national museum of art
&
&
christopher@dymaptic.com
Tech Diagram
points clustered by type