Package org.carrot2.clustering.kmeans
Class BisectingKMeansClusteringAlgorithm
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm
- All Implemented Interfaces:
AcceptingVisitor
,ClusteringAlgorithm
public class BisectingKMeansClusteringAlgorithm
extends AttrComposite
implements ClusteringAlgorithm
A very simple implementation of bisecting k-means clustering. Unlike other algorithms in Carrot2,
this one creates hard clustering (one document belongs only to one cluster). On the other hand,
the clusters are labeled only with individual words that may not always fully correspond to all
documents in the cluster.
-
Field Summary
FieldsModifier and TypeFieldDescriptionfinal AttrInteger
Number of clusters to create.Per-request overrides of language components (dictionaries).final AttrInteger
Minimum number of labels to return for each cluster.Configuration of the size and contents of the term-document matrix.Configuration of the matrix decomposition method to use for clustering.final AttrInteger
Maximum number of k-means iterations to perform.static final String
final AttrInteger
Number of partitions to create at each k-means clustering iteration.Configuration of the text preprocessing stage.final AttrString
Query terms used to retrieve documents.final AttrBoolean
If enabled, k-means will be applied on the dimensionality-reduced term-document matrix.Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncluster
(Stream<? extends T> docStream, LanguageComponents languageComponents) Cluster a set of documents.Methods inherited from class org.carrot2.attrs.AttrComposite
accept
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.carrot2.attrs.AcceptingVisitor
accept
Methods inherited from interface org.carrot2.clustering.ClusteringAlgorithm
optionalLanguageComponents, supports
-
Field Details
-
NAME
- See Also:
-
clusterCount
Number of clusters to create. The algorithm will create at most the specified number of clusters. -
maxIterations
Maximum number of k-means iterations to perform. -
partitionCount
Number of partitions to create at each k-means clustering iteration. -
labelCount
Minimum number of labels to return for each cluster. -
queryHint
Query terms used to retrieve documents. The query is used as a hint to avoid trivial clusters. -
useDimensionalityReduction
If enabled, k-means will be applied on the dimensionality-reduced term-document matrix. The number of dimensions will be equal to twice the number of requested clusters. If the number of dimensions is lower than the number of input documents, reduction will not be performed. If disabled, the k-means will be performed directly on the original term-document matrix. -
matrixBuilder
Configuration of the size and contents of the term-document matrix. -
matrixReducer
Configuration of the matrix decomposition method to use for clustering. -
preprocessing
Configuration of the text preprocessing stage. -
dictionaries
Per-request overrides of language components (dictionaries).- Since:
- 4.1.0
-
-
Constructor Details
-
BisectingKMeansClusteringAlgorithm
public BisectingKMeansClusteringAlgorithm()
-
-
Method Details
-
requiredLanguageComponents
- Specified by:
requiredLanguageComponents
in interfaceClusteringAlgorithm
- Returns:
- A set of classes required to be present in the
LanguageComponents
instance provided for clustering.
-
cluster
public <T extends Document> List<Cluster<T>> cluster(Stream<? extends T> docStream, LanguageComponents languageComponents) Description copied from interface:ClusteringAlgorithm
Cluster a set of documents.- Specified by:
cluster
in interfaceClusteringAlgorithm
- Type Parameters:
T
- Any subclass ofDocument
. Clusters of objects of the same type are returned.- Parameters:
docStream
- A stream ofdocuments
for clustering.languageComponents
-LanguageComponents
with a set of suppliers for the required language-specific components.- Returns:
- A list of top-level clusters (clusters can form a hierarchy via
Cluster.getClusters()
.
-