Algorithms
Clustering

Clustering Algorithms Overview

Kōji offers two clustering algorithms, the fastest is a Rust translation of the FastCover-PP algorithm by ghoshanirban (opens in a new tab), the other is a Greedy algorithm that leverages the r-tree implentation from the rstar (opens in a new tab) crate. The latter algorithm offers different levels of complexity that vary in time, resource usage, and final performance.

Goal

The primary goal of Kōji's clustering algorithms is to cover the most number of points with the fewest number of clusters of a given radius. However, many users of this app have slight variations of this goal in mind, which is why Kōji offers many different inputs to customize the final result.

Inputs

Data Points

The points to be clustered.

  • All algorithms accepts Vec<[f64;2]>
  • [Latitude, Longitude] pairs

Algorithm

The algorithm to use for clustering.

  • Fastest: UDC, fastest and scales the best, but produces the worst result
  • Fast: Basic implementation of the Greedy algorithm, produces a significantly better result than Fastest
  • Balanced: A great balance of speed, resource usage, and result quality
  • Better: This algorithm can use a large number of resources but produces the best result

Radius

The radius of each cluster to encase the points within

  • Unit is in meters

Min Points

The minimum number of points that a cluster must contain to be considered valid

Max Clusters

The maximum number of clusters that should be generated

  • Unavailable when running the Fastest algorithm
  • Generally based on external factors for the user, such as the number of devices available to scan the route
  • Since the Greedy algorithm works down from the best clusters to the worst, when the desired max is hit, the algorithm will stop and return the best result possible for the set limit

Cluster Split Level

This groups points based on their S2 cell level before clustering them. The groups are then ran on separate threads in order to help with parallelizing workloads.

This was more relevant with the legacy clustering algorithms that were single threaded but can still be useful in some cases

Metrics

In order to compare the performance of multiple algorithms we calculate the mygod_score, aptly named after a colleague who came up with it. This score is calculated by multiplying the number of clusters by the min_points input added to the difference of the total input_points and the number of points that were covered by the algorithm. This incentives the algorithms to cover the maximum number of points possible but only if the clusters cover a number of unique points that is greater than or equal to the set min_points. The lower the score, the better the result.

pub fn get_score(&self, min_points: usize) -> usize {
      self.total_clusters * min_points + (self.total_points - self.points_covered)
}

Runtime is not a factor in the mygod_score because it is not the primary goal of Kōji's userbase. Though it does make development time less painful and is a small factor for some external integrations that can't be waiting around for more than several minutes for a result. In the future, if the algorithm ever gets to a point where it's final results are near perfect, then time may be added to the mygod_score calculation for further refinement.

Algorithm Details

Unit Disc Cover (UDC)

Source (opens in a new tab)

The UDC algorithm is mostly a classic implementation of the UDC problem and is not particularly written for Kōji. While it is much faster, it does not produce a desirable result. It often leads to a lot of overlap between clusters and does not take into account the density of points in a given area. This algorithm is best used when the user is looking for a quick result and does not care about the quality of the result. Since it is less specialized for the task at hand, the inputs are mostly taken into account during pre and post processing functions. On its own, this algorithm does not take into account the spherical nature of the Earth, so as part of the pre-process function, the input points must be projected onto a flat plane that takes the input radius into account. The post-process function then takes the projected points and converts them back to their original spherical coordinates. You can read more about UDC here (opens in a new tab).

Greedy

Source (opens in a new tab)

The Greedy algorithm on the other hand is much more specialized for the task at hand. Compared to its predecessors, it leverages an r-tree algorithm that allows it to run faster than the O(n^2) time complexity that the legacy algorithms could not achieve. This algorithm has three different levels of complexity that user's can select from. Fast, Balanced, and Better. The only difference between these options is the number of potential clusters that are generated based on the input points.

Step 1 - Generate Potential Clusters

Source (opens in a new tab)

Here is where the algorithm complexity input is utilized.

let potential_clusters: Vec<Cluster> = match cluster_mode {
      ClusterMode::Better | ClusterMode::Best => get_s2_clusters(points),
      ClusterMode::Fast => gen_estimated_clusters(&point_tree),
      _ => {
            let neighbor_tree: RTree<Point> = rtree::spawn(radius * 2., points);
            gen_estimated_clusters(&neighbor_tree)
      }
}
  • Fast: Generates possible clusters at the provided points and at 8 equal segments between a given point and its neighbors that are within the given radius.
  • Balanced: Similar logic to Fast but generates the same 8 equal segments between the given point and its neighbors that are within the radius * 2. Additionally, it adds some "wiggle" points at each of the segments.
  • Better: This option leverages the Rust S2 library (opens in a new tab) and generates a potential cluster at every level 22 S2 cell that is within the BBOX of the input points. This is the most resource intensive option but produces the best result. This process has been optimized by starting at level 16 and continueing to split cells down to level 22 in parallel using Rayon (opens in a new tab).

Step 2 - Associate Potential Clusters

Source (opens in a new tab)

Now that we have our potential clusters, we want to associate them with the input points. This is where the primary use of the r-tree is implemented and has saved us the most time compared to the legacy algorithms. Even though this step utilizes Rayon to parallelize the workload, it is still the most expensive part of the algorithm and has the most room for improvement. The r-tree is used to find all points that are within the radius of a given potential cluster. This is done in parallel for all potential clusters. We also have to check to see if the cluster exists in the r-tree, in the case that the best cluster is indeed a point itself. If the cluster has less points than the given min_points input, we remove this cluster from the list of potential clusters, allowing us to release as much memory as possible and save some time in the clustering step.

let clusters_with_data: Vec<Cluster> = potential_clusters
      .into_par_iter()
      .filter_map(|cluster| {
            // Get the associated points
            let mut points: Vec<&Point> = point_tree
                  .locate_all_at_point(&cluster.center)
                  .collect::<Vec<&Point>>();
            // Check if the cluster is a point
            if let Some(point) = point_tree.locate_at_point(&cluster.center) {
                  points.push(point);
            }
            if points.len() < min_points {
                  // Skip it if we're never going to be interested in it anyway
                  None
            } else {
                  Some(Cluster::new(cluster, points, vec![]))
            }
      })
      .collect();

Step 3 - Cluster the Clusters

Source (opens in a new tab)

In this step we are grouping our clusters by the number of points that they cover. This helps us save some time during the clustering step.

// Finds the max number of points that a cluster covers
let max = clusters_with_data
      .par_iter()
      .map(|cluster| cluster.all.len())
      .max()
      // For the users of Kōji there will likely never be anything higher than 100 points in a cluster
      .unwrap_or(100);
 
// Creates a new Vec, filled with Vecs with the size set to the max number of points that a cluster covers
// plus one, because we want to skip the 0 index for convenience
let mut clustered_clusters = vec![vec![]; max + 1];
for cluster in clusters_with_data.into_iter() {
      clustered_clusters[cluster.all.len()].push(cluster);
}

Step 4 - Clustering

Source (opens in a new tab)

This section is broken down into sub steps and will only contain summarized code snippets to avoid copying and pasting the entire function, but if you're interested I would encourage you to check out the source above.

Step 4a - Setup

Source (opens in a new tab)

We start by setting up our mutable variables that help us keep track of where things are during the while loop runs.

  • new_clusters: A HashSet that will hold our picked clusters. The Cluster struct implement the Hash trait and are hashed based on the S2 Cell ID of the center point. Despite the Cluster struct being a complex type, it was unnecessary to use a HashMap because we aren't accessing the data by key, only making sure we aren't inserting duplicates.
  • blocked_points: an additional HashSet that is keeping track of points that have already been clustered. This is key because we do not want points associated with each cluster to count towards our min_points input comparisons if they have already been clustered.
  • current: a var that starts at the max number of points that a cluster covers and is decremented until it reaches our min_points input.
  • The rest of the variables are for tracking how much time is spent on each step and for estimating our % complete in the logs.

The rest of the steps occur in the while loop.

Step 4b - Get Interested Clusters

Source (opens in a new tab)

This step is where we get the clusters that have the number of points that we are currently interested in. Since this algorithm is greedy, we start at the highest and work our way down to the min_points input. This is why we grouped our clusters in step 3. This is a very simple loop that checks whether the index of each item is greater than or equal to our current var. If it is, we add it to our clusters_of_interest Vec. The reason we want greater than or equal to is because we will be further filtering the clusters later on. Just because a cluster didn't end up being viable when current is equal to 42, is still may be the best when current is equal to 30.

for (index, clusters) in clusters_with_data.iter().enumerate() {
      if index < current {
            continue;
      }
      clusters_of_interest.extend(clusters);
}

Step 4c - Filtering Clusters

Source (opens in a new tab)

Now we filter out clusters that are not viable. Iterating through clusters_of_interest in parallel, we check how many points for each one have already been clustered, if the total is less than current, we discard it. During this step we are ensuring two important Vecs for each cluster, all of the points that it covers and the unique number of points that it could potentially be responsible for.

If the determined local_clusters is empty, we subtract one from current and start the next iteration. If not, we then sort the clusters in parallel by the number of points that they cover. If the total points between any two given clusters is equal, then we compare the unique number of points that they cover. This is important for the next step because we will be iterating through the clusters in serial and we want to make sure we get the best before we start discarding other clusters.

Step 4d - Looping and Pushing to our new_clusters HashSet

Source (opens in a new tab)

We iterate through our local_clusters in serial and push the best cluster to our new_clusters HashSet.

  • At the start of the loop check to see if the number of clusters we have already saved is greater than or equal to our max_clusters input and immediately break the entire while loop if so.
  • Next we check every unique point that the cluster is responsible for to see if it has already been clustered, if so, we skip it. This is why sorting them before this step is important.
  • If the cluster passes, then we insert all of the unique points into the blocked_points HashSet and the cluster into our new_clusters HashSet.
if cluster.points.len() >= current {
      for point in cluster.points.iter() {
            if blocked_points.contains(point) {
                  continue 'cluster;
            }
      }
      for point in cluster.points.iter() {
            blocked_points.insert(point);
      }
      new_clusters.insert(cluster);
}

Step 4e - Decrement current and Repeat

Lastly we subtract 1 from our current var and continue to run the next iteration of loop while current is greater than or equal to our min_points input and the length of new_clusters is less than our max_clusters input.

Step 5 - Unique Point Coverage Check

Source (opens in a new tab)

Now that we have our full list of preliminary clusters, we need to make sure that each cluster is responsible for a unique number of points that is greater than or equal to our min_points input.

  1. Create a new r-tree with our potential clusters
  2. Mut iterate through our clusters in parallel
  3. Via the update_unique trait, we then iterate through all of the points that the cluster covers and determine if a point is unique to that cluster by how many clusters are found at the point's location in the r-tree. If the number of clusters is equal to one, we know that the point is unique to that cluster.
  4. Filter out the clusters that do not have a unique number of points that is greater than or equal to our min_points input.
pub fn update_unique(&mut self, tree: &RTree<Point>) {
      let mut points: Vec<_> = self
            .all
            .par_iter()
            .filter_map(|p| {
                  if tree.locate_all_at_point(&p.center).count() == 1 {
                        Some(*p)
                  } else {
                        None
                  }
            })
            .collect();
      points.sort_dedupe();
      self.points = points;
}

Step 6 - Check for Missing

Source (opens in a new tab)

If the min_points input is equal to one, we want to make absolutely sure that no point has been missed as this is most often used in requests that require 100% accuracy. This step is often unnecessary, rarely adds additional clusters, and is generally considered to be a hack that bandaids an unknown issue in the clustering function.

  1. First we reduce all points covered by our clusters into a single HashSet
  2. Next if the length of the HashSet does not equal the length of the input points, we know that at least one point has been missed.
  3. We then iterate through the input points, checking to see which ones exist in the HashSet, saving the ones that do not.
  4. We then extend the current clusters with the missing ones.

Balanced (Legacy)

Source (opens in a new tab)

This algorithm was not the first one written for Kōji but it was the first that was committed to the repo. It operated in at least O(n^2) time, was single threaded, and was total spaghetti. The core of it is based on what Kōji calls "Bootstrapping", which generates circles of a given radius inside of a given Polygon or MultiPolygon to cover the entire area, allowing for routes that pick up all Spawnpoints and Forts.

Concept wise, it was very similar to how Greedy operates now and it ran decently well for creating Fort routes but did not have the logic required to work well with Spawnpoints. I attempted to take what I learned from translating the UDC algorithm and apply it to this algorithm, particularly when it came to combining clusters, since a honeycomb base allowed me to predict which neighboring clusters existed, similar to UDC. However, merging clusters tends to be very tedious work and was prone to errors.

⚠️

The code has now been removed from the repo as it does not provide any benefit but the source is still viewable in the link above.

Brute Force (Legacy)

Source (opens in a new tab)

Shortly before starting work on this algorithm, I had completed the integration with OR-Tools, which utilizes a distance matrix in the TSP wrapper I wrote. I attempted to apply that same logic here as a sort of lookup table for checking which points are within the given radius of neighboring points, and since the values weren't reliant on each other, this calculation could be parallelized with Rayon. The core clustering algorithm is very recognizable as it was the base of the Greedy algorithm. However, it was still slower than what I had hoped for and my attempt to write another merge function wasn't exactly successful.

⚠️

The code has now been removed from the repo as it does not provide any benefit but the source is still viewable in the link above.

Result Comparisons

Now the info you're actually looking for, the results!

Notes:

  • Distance stats have been excluded from each result as the points were unsorted and it is not relevant for directly comparing the clustering algorithms.
  • All algorithms were run on a MacBook Pro M1 with 16GB of RAM and 8 cores.
  • The following inputs were used:
    • min_points: 3
    • radius: 70
    • cluster_split_level: 1

Small Fence

Info

  • Points: 11,064
  • Area: 84,432 km²

Results

# Fastest
      [STATS] =================================================================
      || [AREA] Aalst | Fastest | Radius                                     ||
      || [POINTS] Total: 11064 | Covered: 10861                              ||
      || [CLUSTERS] Total: 1361 | Avg Points: 7                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 35                          ||
      || [TIMES] Clustering: 0.02s | Routing: 0.00s | Stats: 0.07s           ||
      || [MYGOD_SCORE] 4286                                                  ||
      =========================================================================
# Balanced (Legacy)
      [STATS] =================================================================
      || [AREA] Aalst | Balanced | Radius                                    ||
      || [POINTS] Total: 11064 | Covered: 10995                              ||
      || [CLUSTERS] Total: 1196 | Avg Points: 9                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 32                          ||
      || [TIMES] Clustering: 4.11s | Routing: 0.00s | Stats: 0.06s           ||
      || [MYGOD_SCORE] 3657                                                  ||
      =========================================================================
# Brute Force (Legacy)
      [STATS] =================================================================
      || [AREA] Aalst | BruteForce | Radius                                  ||
      || [POINTS] Total: 11064 | Covered: 10442                              ||
      || [CLUSTERS] Total: 931 | Avg Points: 11                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 43                          ||
      || [TIMES] Clustering: 8.23s | Routing: 0.00s | Stats: 0.06s           ||
      || [MYGOD_SCORE] 3415                                                  ||
      =========================================================================
# Fast
      [STATS] =================================================================
      || [AREA] Aalst | Fast | Radius                                        ||
      || [POINTS] Total: 11064 | Covered: 10629                              ||
      || [CLUSTERS] Total: 856 | Avg Points: 12                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 45                          ||
      || [TIMES] Clustering: 0.41s | Routing: 0.00s | Stats: 0.04s           ||
      || [MYGOD_SCORE] 3003                                                  ||
      =========================================================================
# Balanced
      [STATS] =================================================================
      || [AREA] Aalst | Balanced | Radius                                    ||
      || [POINTS] Total: 11064 | Covered: 10712                              ||
      || [CLUSTERS] Total: 854 | Avg Points: 12                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 44                          ||
      || [TIMES] Clustering: 1.13s | Routing: 0.00s | Stats: 0.05s           ||
      || [MYGOD_SCORE] 2914                                                  ||
      =========================================================================
# Better
      [STATS] =================================================================
      || [AREA] Aalst | Better | Radius                                      ||
      || [POINTS] Total: 11064 | Covered: 10715                              ||
      || [CLUSTERS] Total: 828 | Avg Points: 12                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 45                          ||
      || [TIMES] Clustering: 8.18s | Routing: 0.00s | Stats: 0.05s           ||
      || [MYGOD_SCORE] 2833                                                  ||
      =========================================================================

Medium Fence

Info

  • Points: 76,692
  • Area: 276,255 km²

Results

# Fastest
      [STATS] =================================================================
      || [AREA] Amsterdam | Fastest | Radius                                 ||
      || [POINTS] Total: 76692 | Covered: 76189                              ||
      || [CLUSTERS] Total: 9540 | Avg Points: 7                              ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 42                          ||
      || [TIMES] Clustering: 0.10s | Routing: 0.00s | Stats: 0.40s           ||
      || [MYGOD_SCORE] 29123                                                 ||
      =========================================================================
# Balanced (Legacy)
      [STATS] =================================================================
      || [AREA] Amsterdam                                                    ||
      || [POINTS] Total: 76692 | Covered: 76576                              ||
      || [CLUSTERS] Total: 8314 | Avg Points: 9                              ||
      || [BEST_CLUSTER] Amount: 2 | Point Count: 39                          ||
      || [DISTANCE] Total: 1731728m | Longest: 17861m | Avg: 208m            ||
      || [TIMES] Clustering: 206.22s | Routing: 0.00s | Stats: 0.50s         ||
      || [MYGOD_SCORE] 25058                                                 ||
      =========================================================================
# Brute Force (Legacy)
      [STATS] =================================================================
      || [AREA] Amsterdam | BruteForce | Radius                              ||
      || [POINTS] Total: 76692 | Covered: 72854                              ||
      || [CLUSTERS] Total: 6563 | Avg Points: 11                             ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 52                          ||
      || [TIMES] Clustering: 489.92s | Routing: 0.00s | Stats: 0.49s         ||
      || [MYGOD_SCORE] 23527                                                 ||
      =========================================================================
# Fast
      [STATS] =================================================================
      || [AREA] Amsterdam | Fast | Radius                                    ||
      || [POINTS] Total: 76692 | Covered: 73880                              ||
      || [CLUSTERS] Total: 5836 | Avg Points: 12                             ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 53                          ||
      || [TIMES] Clustering: 2.42s | Routing: 0.00s | Stats: 0.40s           ||
      || [MYGOD_SCORE] 20320                                                 ||
      =========================================================================
# Balanced
      [STATS] =================================================================
      || [AREA] Amsterdam | Balanced | Radius                                ||
      || [POINTS] Total: 76692 | Covered: 74551                              ||
      || [CLUSTERS] Total: 5821 | Avg Points: 12                             ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 53                          ||
      || [TIMES] Clustering: 8.20s | Routing: 0.00s | Stats: 0.40s           ||
      || [MYGOD_SCORE] 19604                                                 ||
      =========================================================================
# Better
      [STATS] =================================================================
      || [AREA] Amsterdam | Better | Radius                                  ||
      || [POINTS] Total: 76692 | Covered: 74641                              ||
      || [CLUSTERS] Total: 5690 | Avg Points: 13                             ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 54                          ||
      || [TIMES] Clustering: 41.46s | Routing: 0.00s | Stats: 0.40s          ||
      || [MYGOD_SCORE] 19121                                                 ||
      =========================================================================

Large Fence

Info

  • Points: 169,038
  • Area: 754,594 km²
  • Both of the legacy algorithms have been excluded from this example due to the amount of time it would take to run them.

Results

# Fastest
      [STATS] =================================================================
      || [AREA] Munich | Fastest | Radius                                    ||
      || [POINTS] Total: 169038 | Covered: 167788                            ||
      || [CLUSTERS] Total: 21931 | Avg Points: 7                             ||
      || [BEST_CLUSTER] Amount: 3 | Point Count: 39                          ||
      || [TIMES] Clustering: 0.20s | Routing: 0.00s | Stats: 0.95s           ||
      || [MYGOD_SCORE] 67043                                                 ||
      =========================================================================
# Fast
      [STATS] =================================================================
      || [AREA] Munich | Fast | Radius                                       ||
      || [POINTS] Total: 169038 | Covered: 162747                            ||
      || [CLUSTERS] Total: 13554 | Avg Points: 12                            ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 48                          ||
      || [TIMES] Clustering: 6.48s | Routing: 0.00s | Stats: 0.92s           ||
      || [MYGOD_SCORE] 46953                                                 ||
      =========================================================================
# Balanced
      [STATS] =================================================================
      || [AREA] Munich | Balanced | Radius                                   ||
      || [POINTS] Total: 169038 | Covered: 164028                            ||
      || [CLUSTERS] Total: 13358 | Avg Points: 12                            ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 47                          ||
      || [TIMES] Clustering: 21.16s | Routing: 0.00s | Stats: 0.94s          ||
      || [MYGOD_SCORE] 45084                                                 ||
      =========================================================================
# Better
      [STATS] =================================================================
      || [AREA] Munich | Better | Radius                                     ||
      || [POINTS] Total: 169038 | Covered: 164297                            ||
      || [CLUSTERS] Total: 13048 | Avg Points: 12                            ||
      || [BEST_CLUSTER] Amount: 1 | Point Count: 49                          ||
      || [TIMES] Clustering: 125.90s | Routing: 0.00s | Stats: 0.94s         ||
      || [MYGOD_SCORE] 43885                                                 ||
      =========================================================================

Conclusion

|-------|-----------|------|--------|

FenceAlgorithmTimeScore
SmallFastest0.024286
SmallBalanced (L)4.113657
SmallBrute Force (L)8.233415
SmallFast0.413003
SmallBalanced1.132914
SmallBetter8.182833
MediumFastest0.1029123
MediumBalanced (L)206.2225058
MediumBrute Force (L)489.9223527
MediumFast2.4220320
MediumBalanced8.2019604
MediumBetter41.4619121
LargeFastest0.2067043
LargeFast6.4846953
LargeBalanced21.1645084
LargeBetter125.9043885

While the different variations of the Greedy algorithm don't scale as well as the UDC algorithm, the results are definitely worth it, especially compared to both of the legacy algorithms, which start to become unweildly on anything but smaller sizes fences.

  • The Fastest algorithm is great for quick and dirty calculations that want to cover everything and you need the results immediately.
  • The Fast algorithm is if you want to take advantage of the Greedy algorithm but still need the results very quickly and can't wait for Balanced or Better.
  • The Balanced algorithm is a great balance of speed and result quality. It is the default algorithm that Kōji uses and is what I would recommend for most users.
  • The Better algorithm is for when you want the best result possible and are willing to wait for it. It is the most resource intensive and can take a long time to run on larger fences. I would not recommend running it if your system does not have much free memory.

Future work

The two biggest improvements that can be made to the Greedy algorithm are:

Optimize Potential Cluster Generating

Currently this process takes up the bulk of the algorithm time, which feels unnecessary. It also is a huge resource hog that actually makes it impossible to run Better on areas that are too big if the machine can't handle it.

Additional Post processing

Implementing a Genetic Algorithm (opens in a new tab) approach that further optimizes the intial solution.

  • This could be accomplished by first associating clusters with each other by utilizing the r-tree nearest neighbor methods and all of their respective points.
  • Once the subgroups have been created, try different combination of clusters and see if the mygod_score can be improved at all by trying different combinations of clusters.

Example

If you had a group of 14 points with a min_points of 3, but you have one cluster covering the middle 10, then 2 on either side of that cluster. In this situation, replacing the single centric cluster with two clusters to cover all 14 points would result in an improved mygod_score.