COMMON CLUSTERING ALGORITHMS (PART 3)
Jun 02, 2021 Misc
- Randomly pick up a data point as centroid.
- Calculate the Euclidean distance between the centroid and other data points.
- If the distance is less than the bandwidth, add the data point to the cluster c1.
- Calculate the average of all data points within the cluster c1, and shift the cluster to the average. Update the new centroid.
- Repeat until converge.
- If c1 is overlapped with c2, another cluster, only the cluster with more points is preserved.
- Repeat until all data points are assigned to a cluster.
- Single parameter (bandwidth)
- Does not assume any prior shape on data clusters
- Robust to outliers
- Output depends on the bandwidth
- Time complexity of O(n2)
Similar to DBSCAN, OPTICS has following parameters:
- MinPts: The minimum number of points to form a dense region;
- Eps (Optional): The maximum distance between two data points to be considered as a cluster.
Besides, there are more definitions:
- Core point: A point p is a core point if at least MinPts points are found within its Eps-neighborhood;
- Core distance: The minimum Eps to make a distinct point a core point;
- Reachability distance: The reachability distance of a core point p with regards to another data point o equals to max(coreDist(p), dist(o, p))
- Initialize a ordered list for storing the final sorted points;
- Choose an unprocessed point from the dataset, mark it as processed and put it in the ordered list:
- If it’s a core point, then initiate another priority queue and iterate through all its neighbors:
- For each unprocessed neighbor point, calculate its reachability distance from the core point:
- If it’s not in the priority queue, insert the neighbor point with its reachability distance;
- If it’s already in the priority queue and the new reachability distance is smaller, update the reachability distance of the neighbor point;
- After iterating all unprocessed neighbor points, iterate through the priority queue, mark each point as processed, put it in the ordered list and update its neighbor points’ reachability distance
- For each point p in the ordered list, denote its reachability distance as
r_pand its core distance as
Below is a pseudocode of step 2 from Wikipedia:
OPTICS vs DBSCAN
- DBSCAN is sensitive on radius since Eps is fixed, which means it could be difficult to handle unbalanced density;
- OPTICS costs more on memory due to its usage of priority queue.