Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in NLP to show how important a word (term) is to a document in a corpus. It increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus.

Term Frequency (TF)

Measures how frequently a term appears within a specific document. It assumes that words appearing more often in a document are more important to that document. There are multiple ways to calculate it:

  • Raw Count: Simplest form. , where is the raw count of term in document.
  • Boolean Frequency: if t is present in d, and otherwise.
  • Logarithmic Scaling: .
  • Augmented Frequency: Normalizes the raw frequency by the frequency of the most frequent term in the document to prevent bias towards longer documents:

Inverse Document Frequency (IDF)

Measures how much information a term provides across the entire corpus. It estimates the “informativeness” or “rarity” of a term.

The standard formula is:

Where:

  • : Total number of documents in the corpus . .
  • : Document Frequency of term. The number of documents in the corpus that contain the term.

Smoothing: To avoid division by zero (if ) and to prevent the IDF score from becoming zero for terms appearing in all documents (), smoothing is typically applied. A common variant is:

Scikit-learn by default adds 1 to the numerator:

TF-IDF Calculation

Vectorization using TF-IDF

TF-IDF is used to transform a collection of text documents into a numerical feature matrix (document-term matrix). The steps are the following

  1. Vocabulary Creation: Identify all unique terms across the entire corpus.
  2. Matrix Construction: Create a matrix where rows represent documents and columns represent terms from the vocabulary.
  3. Filling the Matrix: The value in cell is the TF-IDF score of the j-th term in the i-th document.
  4. Output: This results in a sparse matrix, where most entries are zero because a single document usually contains only a small subset of the overall vocabulary.