Term Frequency-Inverse Document Frequency - Data science Well of Knowledge

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in NLP to show how important a word (term) is to a document in a corpus. It increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus. ### Term Frequency (TF) Measures how frequently a term appears within a specific document. It assumes that words appearing more often in a document are more important to that document. There are multiple ways to calculate it: * **Raw Count:** Simplest form. $TF(t, d) = f(t, d)$, where $f(t, d)$ is the raw count of term in document. * **Boolean Frequency:** $TF(t, d) = 1$ if *t* is present in *d*, and $0$ otherwise. * **Logarithmic Scaling:** $TF(t, d) = log(1 + f_{t,d})$. * **Augmented Frequency:** Normalizes the raw frequency by the frequency of the most frequent term in the document to prevent bias towards longer documents: $TF(t, d) = 0.5 + 0.5 \cdot \frac{f_{t,d}}{\max\{f_{t',d}: t' \in d\}}$ * ### Inverse Document Frequency (IDF) Measures how much information a term provides across the entire corpus. It estimates the "informativeness" or "rarity" of a term. The standard formula is: $IDF(t, D) = log(N / df(t))$ Where: * $N$: Total number of documents in the corpus $D$. $N = |D|$. * $df(t)$: Document Frequency of term. The number of documents in the corpus that contain the term. **Smoothing:** To avoid division by zero (if $df(t) = 0$) and to prevent the IDF score from becoming zero for terms appearing in all documents ($df(t) = N$), smoothing is typically applied. A common variant is: $IDF(t, D) = log(N / (df(t) + 1)) + 1$ Scikit-learn by default adds 1 to the numerator: $IDF(t, D) = log((N + 1) / (df(t) + 1)) + 1$ ## TF-IDF Calculation $TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)$ ## Vectorization using TF-IDF TF-IDF is used to transform a collection of text documents into a numerical feature matrix (document-term matrix). The steps are the following 1. **Vocabulary Creation:** Identify all unique terms across the entire corpus. 2. **Matrix Construction:** Create a matrix where rows represent documents and columns represent terms from the vocabulary. 3. **Filling the Matrix:** The value in cell $(i, j)$ is the TF-IDF score of the `j`-th term in the `i`-th document. 4. **Output:** This results in a sparse matrix, where most entries are zero because a single document usually contains only a small subset of the overall vocabulary. ## Links * [Wikipedia: tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) * [Scikit-learn: TfidfVectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)