Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in NLP to show how important a word (term) is to a document in a corpus. It increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus.
Term Frequency (TF)
Measures how frequently a term appears within a specific document. It assumes that words appearing more often in a document are more important to that document. There are multiple ways to calculate it:
- Raw Count: Simplest form. , where is the raw count of term in document.
- Boolean Frequency: if t is present in d, and otherwise.
- Logarithmic Scaling: .
- Augmented Frequency: Normalizes the raw frequency by the frequency of the most frequent term in the document to prevent bias towards longer documents:
Inverse Document Frequency (IDF)
Measures how much information a term provides across the entire corpus. It estimates the “informativeness” or “rarity” of a term.
The standard formula is:
Where:
- : Total number of documents in the corpus . .
- : Document Frequency of term. The number of documents in the corpus that contain the term.
Smoothing: To avoid division by zero (if ) and to prevent the IDF score from becoming zero for terms appearing in all documents (), smoothing is typically applied. A common variant is:
Scikit-learn by default adds 1 to the numerator:
TF-IDF Calculation
Vectorization using TF-IDF
TF-IDF is used to transform a collection of text documents into a numerical feature matrix (document-term matrix). The steps are the following
- Vocabulary Creation: Identify all unique terms across the entire corpus.
- Matrix Construction: Create a matrix where rows represent documents and columns represent terms from the vocabulary.
- Filling the Matrix: The value in cell is the TF-IDF score of the
j-th term in thei-th document. - Output: This results in a sparse matrix, where most entries are zero because a single document usually contains only a small subset of the overall vocabulary.