For text

First, it must be noted that only english language is supported.

Optional preprocessing parameters

Removal of stop words

Stop words, such as “a”, “the”, “is”, are communly used words in a language. They can be removedd as they are generally considered to carry little useful information.

Required parameters

You must choose one of the methods to obtain the features which will be used by the model.

Vectorizers

Both methods are implemented with scikit-learn (see more information here)

Bag of words (or n-grams)

Each document is represented as a vector where each column represents the frequency of a specific term. Each term occurrence frequency is treated as a feature.

Term-frequency inverse document-frequency

In order to not focus on terms occurring in all documents, hence not being necessarily discriminative to describe each document, the term frequency is weighted by the inverse of this term frequency across all the documents, computed as follows :

\[idf(t)=\log\frac{1 + n}{1 + df(t)} + 1\]

where \(n\) represents the number of documents and \(df(t)\) is the number of documents containing the term \(t\). The vector is then normalized with the Euclidean norm:

\[v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}\]

Common parameters

1. Range of n-gram

The counting of occurrences can be done for groups of more than individual words (called here 1-gram). You can adjust the range of the desired grouping of words you want to explore.

For example, an n-gram range of (1,2) means the features will include individual words and bigrams, while (2,2) means only bigrams will be counted.

2. Maximum number of features

Will keep only the top max features, ordered by frequency across all documents.