Package org.carrot2.text.preprocessing
Class PreprocessingContext
java.lang.Object
org.carrot2.text.preprocessing.PreprocessingContext
- All Implemented Interfaces:
Closeable
,AutoCloseable
Document preprocessing context provides low-level (usually integer-coded) data structures useful
for further processing.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Information about all fields processed for the input documents.class
Information about words and phrases that might be good cluster label candidates.class
Information about all frequently appearing sequences of words found in the input documents.class
Information about all unique stems found in the input documents.class
Information about all tokens of the input documents.class
Information about all unique words found in the input documents. -
Field Summary
FieldsModifier and TypeFieldDescriptionInformation about all fields processed for the input documents.Information about words and phrases that might be good cluster label candidates.Information about all frequently appearing sequences of words found in the input documents.Information about all unique stems found in the input documents.Information about all tokens of the input documents.Information about all unique words found in the input documents.int
Count of documents processed by the tokenizer.final LanguageComponents
Language model to be used -
Constructor Summary
ConstructorsConstructorDescriptionPreprocessingContext
(LanguageComponents languageComponents) Creates a preprocessing context for the provideddocuments
and with the providedlanguageModel
. -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
This method should be invoked after all preprocessing contributors have been executed to release temporary data structures.format
(LabelFormatter formatter, int featureIndex) Applies label formatter to a given word or phrase (depending on the feature index provided).boolean
Returnstrue
if this context contains any label candidates.boolean
hasWords()
Returnstrue
if this context contains any words.char[]
intern
(MutableCharArray chs) Return a unique char buffer representing a given character sequence.static int[]
toFieldIndexes
(byte b) Convert the selected bits in a byte to an array of indexes.toString()
-
Field Details
-
languageComponents
Language model to be used -
documentCount
public int documentCountCount of documents processed by the tokenizer. -
allTokens
Information about all tokens of the input documents. -
allFields
Information about all fields processed for the input documents. -
allWords
Information about all unique words found in the input documents. -
allStems
Information about all unique stems found in the input documents. -
allPhrases
Information about all frequently appearing sequences of words found in the input documents. -
allLabels
Information about words and phrases that might be good cluster label candidates.
-
-
Constructor Details
-
PreprocessingContext
Creates a preprocessing context for the provideddocuments
and with the providedlanguageModel
.
-
-
Method Details
-
hasWords
public boolean hasWords()Returnstrue
if this context contains any words. -
hasLabels
public boolean hasLabels()Returnstrue
if this context contains any label candidates. -
format
Applies label formatter to a given word or phrase (depending on the feature index provided). -
toString
-
toFieldIndexes
public static int[] toFieldIndexes(byte b) Convert the selected bits in a byte to an array of indexes. -
close
public void close()This method should be invoked after all preprocessing contributors have been executed to release temporary data structures.- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-
intern
Return a unique char buffer representing a given character sequence.
-