Natural Language Processing (NLP) is a multidisciplinary field that combines linguistics, computer science, and artificial intelligence to enable computers to understand, interpret, and generate human language.

Before 2010s most approaches relied on hand-crafted features (word frequencies, n-grams, linguistic annotations) extracted from text. At the early stages of NLP, the main approaches to solving NLP tasks consisted of if-else rules and templates, using rule-based grammars, ontologies, and heuristics.
Some examples:

  • checking if a word (or it’s stem/lemma) are present in a dictionary
  • checking if a part of the text follows a pattern of pre-defined grammatical rules
  • applying Hidden Markov Models to part-of-speech tagging and Conditional Random Fields to NER
  • applying Naïve Bayes and SVM to word counts (including TF-IDF) and n-grams

In 2010s, Word Embeddings were introduced - Word2Vec, GloVe, fastText and recurrent models became more widely used. While RNN and LSTM were proposed earlier, people started using them more together with word embeddings. Later, GRU variation became popular.

In 2017, the paper Attention is all you need appeared and started the era of Transformer. BERT- and GPT-style transformers became the core of the further approaches.

There was another pivotal paper in 2018 - Universal Language Model Fine-tuning for Text Classification (ULMFiT), which introduced fine-tuning as we know it: general pre-training, target-task fine-tuning and target-task classifier fine-tuning.

There were some other papers which explored similar ideas. In Large Language Models in Machine Translation the authors trained a model on up to 2T tokens with up to 300B n-grams (up to 5-gram). And Semi-supervised Sequence Learning suggested pretraining a sequence autoencoder and then fine-tuning it for classification.

The appearance of Large Language Models was another significant step. The criteria of LLM can be vague, so even BERT can be called an LLM, but if by LLM we mean large models that can do a variety of tasks through prompting and without fine-tuning (or minimal fine-tuning), then GPT-2 or GPT-3 can be called the first LLMs.

The rest of this note will serve as an index for the other notes related to NLP.

Text processing

Raw texts need to be processed in order to be used in ML models. This processing includes:

  • Stemming and lemmatization - converting words to their basic forms and finding word roots, respectively
  • Stop word removal - getting rid of the words that don’t add much information
  • Tokenization - splitting text into elements (character, word fragments, words) and encoding them

Architectures

Training and Fine-tuning Techniques

  • Pre-training objectives
  • Fine-tuning strategies
  • Instruction Learning
  • Reinforcement Learning from Human Feedback (RLHF)
  • Direct Preference Optimization (DPO)
  • Tokenization
  • Parameter Efficient Fine-tuning (LoRA, QLoRA, Prefix Tuning)
  • Prompting

Prompt Tuning: Tunes a set of concatenated input embeddings vectors (generally called “soft prompts”, but not referring to the soft prompts here). Initially applied to T5-LM models.

Prefix Tuning: Tunes KV cache (soft prefixes) for every layer, and can be casually described as “prompt tuning, but in every layer”, although that is slightly inaccurate. In practice, uses an auxiliary MLP to generate the soft prefixes to help training. Initially applied to GPT-2 and BART models.

P-Tuning: Uses LSTMs to generate soft prompts (not prefixes). Initially applied to GPT-2 and BERT/RoBERTa/MegatronLM models.

P-Tuning v2: Essentially Prefix Tuning applied to BERT-type models.

NLP Tasks

  • Named Entity Recognition
  • Machine Translation
  • Text Classification
    • Topic Modeling - discovering abstract themes in a document collection without supervision
  • Text Generation
  • Chat-bots
  • RAG
  • Information Retrieval and Extraction
    • Search
    • Question answering
    • Summarization

Metrics

  • BLUE
  • ROUGE
  • METEOR
  • Perplexity
  • Human Evaluation Methods
  • Benchmark Datasets