Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in NLP to show how important a word (term) is to a document in a corpus. It increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus.

Term Frequency (TF)

Measures how frequently a term appears within a specific document. It assumes that words appearing more often in a document are more important to that document. There are multiple ways to calculate it:

Raw Count: Simplest form. $TF (t, d) = f (t, d)$ , where $f (t, d)$ is the raw count of term in document.
Boolean Frequency: $TF (t, d) = 1$ if t is present in d, and $0$ otherwise.
Logarithmic Scaling: $TF (t, d) = l o g (1 + f_{t, d})$ .
Augmented Frequency: Normalizes the raw frequency by the frequency of the most frequent term in the document to prevent bias towards longer documents: $TF (t, d) = 0.5 + 0.5 \cdot \frac{f _{t, d}}{m a x { f _{t^{'}, d} : t ^{'} \in d }}$

Inverse Document Frequency (IDF)

Measures how much information a term provides across the entire corpus. It estimates the “informativeness” or “rarity” of a term.

The standard formula is:
$I D F (t, D) = l o g (N / df (t))$

Where:

$N$ : Total number of documents in the corpus $D$ . $N = ∣ D ∣$ .
$df (t)$ : Document Frequency of term. The number of documents in the corpus that contain the term.

Smoothing: To avoid division by zero (if $df (t) = 0$ ) and to prevent the IDF score from becoming zero for terms appearing in all documents ( $df (t) = N$ ), smoothing is typically applied. A common variant is:
$I D F (t, D) = l o g (N / (df (t) + 1)) + 1$
Scikit-learn by default adds 1 to the numerator:
$I D F (t, D) = l o g ((N + 1) / (df (t) + 1)) + 1$

TF-IDF Calculation

$TF - I D F (t, d, D) = TF (t, d) * I D F (t, D)$

Vectorization using TF-IDF

TF-IDF is used to transform a collection of text documents into a numerical feature matrix (document-term matrix). The steps are the following

Vocabulary Creation: Identify all unique terms across the entire corpus.
Matrix Construction: Create a matrix where rows represent documents and columns represent terms from the vocabulary.
Filling the Matrix: The value in cell $(i, j)$ is the TF-IDF score of the j-th term in the i-th document.
Output: This results in a sparse matrix, where most entries are zero because a single document usually contains only a small subset of the overall vocabulary.

DSWoK — Data Science Well of Knowledge

Explorer

Term Frequency-Inverse Document Frequency

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF Calculation

Vectorization using TF-IDF

Links

Graph View

Table of Contents

Backlinks