Clustering: A Powerful Tool for Data Mining and Machine Learning
Clustering is a vital technique in data mining and unsupervised machine learning. Its purpose is to group similar items together. There are two primary forms of clustering: metric clustering and graph clustering. Metric clustering groups data points based on their separation in a given metric space, while graph clustering connects similar data points through edges and groups them accordingly. These clustering methods are especially useful for large collections of data without predefined class labels, like digital text collections on the internet. They have applications in organizing and searching documents, identifying patterns in text, and recommending relevant content to users.
The Dilemma of Text Clustering: Embedding Models vs. Cross-Attention Models
To perform text clustering, researchers often face a dilemma: whether to use embedding models or cross-attention (CA) models. Embedding models like BERT or RoBERTa define a metric clustering problem. On the other hand, CA models like PaLM or GPT define a graph clustering problem. CA models can provide highly accurate similarity scores, but constructing the input graph requires a large number of inference calls to the model. Embedding models, while efficient, produce similarity distances of lower quality compared to CA models.
Introducing KwikBucks: A Novel Clustering Algorithm
To overcome this dilemma, we present KwikBucks, a new clustering algorithm that combines the scalability of embedding models and the quality of CA models. This algorithm has access to both models but sets a budget on the number of queries made to the CA model. It strategically uses the CA model to answer edge queries and relies on the embedding model for unlimited access to similarity scores. We demonstrate how KwikBucks produces high-quality clusters with a relatively low number of query calls to the CA model. We have also made the data used in our experiments open-source.
How Does KwikBucks Work?
KwikBucks builds upon the popular KwikCluster algorithm (Pivot algorithm). It begins by selecting a set of documents as centers, with no similarity edges between them. Clusters are formed around these centers. The combo similarity oracle mechanism, a novel approach we introduce, uses the embedding model to guide queries sent to the CA model. This mechanism ranks centers based on their embedding similarity to a target document and queries the CA model accordingly. Post-processing steps are then performed to merge clusters with strong connections.
Results and Performance
We evaluated KwikBucks on different datasets using various models. Compared to the best baselines, KwikBucks showed a 45% relative improvement in performance across all datasets. It consistently outperformed other baselines at different query budgets. The F1-score, a measure of clustering quality, was used to assess the obtained solutions from our experiments.
Clustering is a powerful tool in data mining and machine learning. We presented KwikBucks, a clustering algorithm that offers the scalability of embedding models and the quality of CA models. It can be applied to various clustering problems with different similarity oracles. Our extensive experiments validated its effectiveness. For more details, please refer to our research paper.
We would like to thank Sandeep Silwal for initiating this project during his summer internship at Google in 2022. We also express our gratitude to our co-authors and everyone who contributed to this work. Special thanks to Ravi Kumar and John Guilyard for their assistance with this blog post.