Package org.carrot2.text.preprocessing
Class BasicPreprocessingPipeline
java.lang.Object
org.carrot2.attrs.AttrComposite
org.carrot2.text.preprocessing.BasicPreprocessingPipeline
- All Implemented Interfaces:
AcceptingVisitor
,ContextPreprocessor
Performs basic preprocessing steps on the provided documents. The preprocessing consists of the
following steps:
InputTokenizer
CaseNormalizer
LanguageModelStemmer
StopListMarker
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final org.carrot2.text.preprocessing.CaseNormalizer
Case normalizer used by the algorithm.protected final org.carrot2.text.preprocessing.LanguageModelStemmer
Stemmer used by the algorithm.protected final org.carrot2.text.preprocessing.StopListMarker
Stop list marker used by the algorithm, contains modifiable parameters.protected final org.carrot2.text.preprocessing.InputTokenizer
Tokenizer used by the algorithm.final AttrInteger
Word Document Frequency threshold.Fields inherited from class org.carrot2.attrs.AttrComposite
attributes
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionpreprocess
(Stream<? extends Document> documents, String query, LanguageComponents langModel) Performs preprocessing on the provided list of documents.Methods inherited from class org.carrot2.attrs.AttrComposite
accept
-
Field Details
-
wordDfThreshold
Word Document Frequency threshold. Words appearing in fewer thandfThreshold
documents will be ignored. -
caseNormalizer
protected final org.carrot2.text.preprocessing.CaseNormalizer caseNormalizerCase normalizer used by the algorithm. -
stemming
protected final org.carrot2.text.preprocessing.LanguageModelStemmer stemmingStemmer used by the algorithm. -
stopListMarker
protected final org.carrot2.text.preprocessing.StopListMarker stopListMarkerStop list marker used by the algorithm, contains modifiable parameters. -
tokenizer
protected final org.carrot2.text.preprocessing.InputTokenizer tokenizerTokenizer used by the algorithm.
-
-
Constructor Details
-
BasicPreprocessingPipeline
public BasicPreprocessingPipeline()
-
-
Method Details
-
preprocess
public PreprocessingContext preprocess(Stream<? extends Document> documents, String query, LanguageComponents langModel) Performs preprocessing on the provided list of documents. Results can be obtained from the returnedPreprocessingContext
.- Specified by:
preprocess
in interfaceContextPreprocessor
-