Package org.carrot2.text.vsm
Class TermDocumentMatrixBuilder
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.vsm.TermDocumentMatrixBuilder
- All Implemented Interfaces:
AcceptingVisitor
Builds a term document matrix based on the provided
PreprocessingContext
.-
Field Summary
FieldsModifier and TypeFieldDescriptionfinal AttrDouble
The extra weight to apply to words that appeared in boosted fields.A list fields for which to apply extra weight.final AttrInteger
Maximum number of elements the term-document matrix can have.final AttrDouble
Maximum document frequency allowed for words as a fraction of all documents.Method for calculating weights of words in the term-document matrices.Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
buildTermDocumentMatrix
(VectorSpaceModelContext vsmContext) Builds a term-document matrix from data provided in thecontext
, stores the result in there.void
Builds a term-phrase matrix in the same space as the main term-document matrix.static final com.carrotsearch.hppc.IntIntHashMap
contantOrderIntIntHashMap
(int seed) Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Field Details
-
boostedFieldWeight
The extra weight to apply to words that appeared in boosted fields. The larger the value, the stronger the boost. -
boostFields
A list fields for which to apply extra weight. Content of fields provided in this parameter can be given more weight during clustering. You may want to boost, for example, the title field with the assumption that it accurately summarizes the content of the whole document. -
maximumMatrixSize
Maximum number of elements the term-document matrix can have. The larger the allowed matrix size, the more accurate, time- and memory-consuming clustering. -
maxWordDf
Maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger thanmaxWordDf
will be ignored. For example, whenmaxWordDf
is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear.This parameter may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting it to a value lower than 1.0 (e.g. 0.9) may improve the clusters.
Another useful application of this parameter is when there is a need to generate only very specific clusters, that is clusters containing small numbers of documents. This can be achieved by setting
maxWordDf
to extremely low values: 0.1 or 0.05. -
termWeighting
Method for calculating weights of words in the term-document matrices.
-
-
Constructor Details
-
TermDocumentMatrixBuilder
public TermDocumentMatrixBuilder()
-
-
Method Details
-
buildTermDocumentMatrix
Builds a term-document matrix from data provided in thecontext
, stores the result in there. -
contantOrderIntIntHashMap
public static final com.carrotsearch.hppc.IntIntHashMap contantOrderIntIntHashMap(int seed) -
buildTermPhraseMatrix
Builds a term-phrase matrix in the same space as the main term-document matrix. If the processing context contains no phrases,VectorSpaceModelContext.termPhraseMatrix
will remainnull
.
-