COMMON CLUSTERING ALGORITHMS (PART 3)
Mean Shift
Steps
- Randomly pick up a data point as centroid.
- Calculate the Euclidean distance between the centroid and other data points.
- If the distance is less than the bandwidth, add the data point to the cluster c1.
- Calculate the average of all data points within the cluster c1, and shift the cluster to the average. Update the new centroid.
- Repeat until converge.
- If c1 is overlapped with c2, another cluster, only the cluster with more points is preserved.
- Repeat until all data points are assigned to a cluster.
Pros
- Single parameter (bandwidth)
- Does not assume any prior shape on data clusters
- Robust to outliers
Cons
- Output depends on the bandwidth
- Time complexity of O(n2)
OPTICS
Similar to DBSCAN, OPTICS has following parameters:
- MinPts: The minimum number of points to form a dense region;
- Eps (Optional): The maximum distance between two data points to be considered as a cluster.
Besides, there are more definitions:
- Core point: A point p is a core point if at least MinPts points are found within its Eps-neighborhood;
- Core distance: The minimum Eps to make a distinct point a core point;
- Reachability distance: The reachability distance of a core point p with regards to another data point o equals to max(coreDist(p), dist(o, p))
Steps
- Initialize a ordered list for storing the final sorted points;
- Choose an unprocessed point from the dataset, mark it as processed and put it in the ordered list:
- If it’s a core point, then initiate another priority queue and iterate through all its neighbors:
- For each unprocessed neighbor point, calculate its reachability distance from the core point:
- If it’s not in the priority queue, insert the neighbor point with its reachability distance;
- If it’s already in the priority queue and the new reachability distance is smaller, update the reachability distance of the neighbor point;
- After iterating all unprocessed neighbor points, iterate through the priority queue, mark each point as processed, put it in the ordered list and update its neighbor points’ reachability distance
- For each point p in the ordered list, denote its reachability distance as
r_p
and its core distance asc_p
:
If r_p > Eps or not r_p:
if c_p <= Eps and c_p:
// p is a new cluster
else:
// p is an outlier
else:
// p belongs to current cluster
Below is a pseudocode of step 2 from Wikipedia:
function OPTICS(DB, eps, MinPts) is
for each point p of DB do
p.reachability-distance = UNDEFINED
for each unprocessed point p of DB do
N = getNeighbors(p, eps)
mark p as processed
output p to the ordered list
if core-distance(p, eps, MinPts) != UNDEFINED then
Seeds = empty priority queue
update(N, p, Seeds, eps, MinPts)
for each next q in Seeds do
N' = getNeighbors(q, eps)
mark q as processed
output q to the ordered list
if core-distance(q, eps, MinPts) != UNDEFINED do
update(N', q, Seeds, eps, MinPts)
function update(N, p, Seeds, eps, MinPts) is
coredist = core-distance(p, eps, MinPts)
for each o in N
if o is not processed then
new-reach-dist = max(coredist, dist(p,o))
if o.reachability-distance == UNDEFINED then // o is not in Seeds
o.reachability-distance = new-reach-dist
Seeds.insert(o, new-reach-dist)
else // o in Seeds, check for improvement
if new-reach-dist < o.reachability-distance then
o.reachability-distance = new-reach-dist
Seeds.move-up(o, new-reach-dist)
OPTICS vs DBSCAN
- DBSCAN is sensitive on radius since Eps is fixed, which means it could be difficult to handle unbalanced density;
- OPTICS costs more on memory due to its usage of priority queue.
References
- Mean shift Clustering algorithm from scratch
- 机器学习(十)Mean Shift 聚类算法
- Lecture 13: k-means and mean-shift clustering
- Mean Shift
- Clustering Using OPTICS
- OPTICS (Ordering points to identify the clustering structure)
- OPTICS聚类算法原理
- 聚类算法初探(六)OPTICS
- OPTICS algorithm