Class PreprocessingContext.AllStems

java.lang.Object
org.carrot2.text.preprocessing.PreprocessingContext.AllStems
Enclosing class:
PreprocessingContext

public class PreprocessingContext.AllStems extends Object
Information about all unique stems found in the input documents. Each entry in each array corresponds to one base form different words can be transformed to by the Stemmer used while processing. E.g. the English mining and mine will be aggregated to one entry in the arrays, while they will have separate entries in PreprocessingContext.AllWords.

All arrays in this class have the same length and values across different arrays correspond to each other for the same index.

  • Field Details

    • image

      public char[][] image
      Stem image as produced by the Stemmer, may not correspond to any correct word.

      This array is produced by LanguageModelStemmer.

    • mostFrequentOriginalWordIndex

      public int[] mostFrequentOriginalWordIndex
      Pointer to the PreprocessingContext.AllWords arrays, to the most frequent original form of the stem. Pointers to the less frequent variants are not available.

      This array is produced by LanguageModelStemmer.

    • tf

      public int[] tf
      Term frequency of the stem, i.e. the sum of all PreprocessingContext.AllWords.tf values for which the PreprocessingContext.AllWords.stemIndex points to this stem.

      This array is produced by LanguageModelStemmer.

    • tfByDocument

      public int[][] tfByDocument
      Term frequency of the stem for each document. For the encoding of this array, see PreprocessingContext.AllWords.tfByDocument.

      This array is produced by LanguageModelStemmer. The order of documents in this array is not defined.

    • fieldIndices

      public byte[] fieldIndices
      A bit-packed index of all fields in which this word appears at least once. Indexes (positions) of selected bits are pointers to the PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)

      This array is produced by LanguageModelStemmer

  • Constructor Details

    • AllStems

      public AllStems()
  • Method Details