Introduction to Large Language Models [ LLMs ]

Smsstemburu
7 min readDec 26, 2024

--

Topics covered in this page :

Introduction to NLP and LLMs

  • Basics of NLP
  • Evolution of LLM’s
  • Overview of LLM’s like GPT, BERT and T5

Transformer Architecture

  • Key components of the transformer model
  • Embeddings: Word/Token representation
  • Self-attention mechanism ( Query, key and value )
  • Tranformer architecture
  • Why transformers replaced RNNs and LSTMs

Basics of Natural Language Processing (NLP)

NLP is the foundational element which is the domain of AI that enables machines to understand, interpret and respond to human language. At the core NLP uses algorithms to process and analyze human language data turning the chaotic wilderness of our words to structures, understandable information.

Early NLP systems were rule-based, relying on sets of hand-coded rules to interpret language. Modern NLP uses statistical machine learning techniques allowing machines to learn language patterns from vast datasets. This learning enables NLP systems to perform tasks like sentimental analysis, language translation and speech recognition.

Evolution of Large Language Models (LLMs)

A step above NLPs. LLMs like GPT, BERT are built upon the foundation laid by NLP. They use deep learning, a subset of machine learning to process and generate human language on a vast scale. They are trained on large datasets which allows them to generate text that is contextually relevant, handling tasks that range from writing articles to engaging in conversations.

Unlike traditional machine learning, deep learning does not requires manual feature extraction. Human intervention is not needed to identify and select the most relevant features for a deep learning model.

The general process of creating an LLM includes pretraining and fine-tuning. The “pre” in “pretraining” refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through fine-tuning where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains. The two most popular fine-tuning LLMs are instruction fine-tuning and classification fine-tuning.

Overview of LLMs like GPT, BERT and T5

GPT : Generative Pre-trained Transformer

Architecture — Decoder only, at the core of GPT’s functionality is the transformer architecture that utilizes the attention and self-attention mechanism to process sequences of data.

Key feature — GPT uses a unidirectional approach, where each word is predicted based on the previous words in a sequence. This is ideal for text generation tasks, completion and its capacity lies in composing human-like text to providing nuanced responses to complex queries.

Training methodology — Language Modeling (predicting the next word in a sequence).

BERT : Bidirectional Encoder Representations from Transformers

Architecture — Encoder only

Key feature — It processes words in bidirectional architecture, either left to right or right to left. BERT reads and analyzes text in both directions. This allows model to grasp the context of a word based on the surrounding words. Its capabilities in language understanding and question answering along with enhancement of search engine results to powering conversational AI are unmatched.

Training methodology — BERT’s training is unique due to its use of Masked Language Modeling (MLM) & Next Sentence Prediction (NSP).

T5 : Text-to-Text Transfer Transformer

Architecture — Encoder-Decoder

Key feature — T5 frames all NLP tasks as a text-to-text problem, meaning both the input and output are treated as text. Mainly used for translation, summarization and question answering.

Training methodology — Span-based Masked Language Modeling.

Transformer Architecture

The transformer architecture was introduced in the 2017 paper “Attention is all you Need” which replaced the sequence based models like RNNs and LSTMs. The transformer architecture is built on self-attention mechanisms, which allows it to model relationship between all the elements in a sequence in a parallelized manner, overcoming the bottlenecks of sequential processing.

Transformer has two major components -

  1. Encoder — Encoder processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. This has typically 6 layers, each layer has two main components -
    - Multi-Head Self-Attention Mechanism
    - Feed-Forward Neural Network
  2. Decoder — Decoder takes the encoded vectors as input to generate the output text.

Each stage of this process relies heavily on the attention mechanism, which helps model to focus on relevant parts of the input during encoding and decoding.

Architecture in detail -

Encoder Architecture -

Tokenization -

Before the input tokens can be embedded, the raw text data is converted to numerical representation.

Input Embeddings —

Input embedding layer is basically a lookup table that maps each token to a dense vector of fixed dimensions which captures semantic meaning.

Positional Encoding -

Since transformers do not have an idea of sequence order unlike RNNs, positional encodings are added to the input embeddings to inject information about the position of each token in the sequence. This helps the model to understand the relative position of tokens.

Multi-Head Self-Attention Mechanism -

The self-attention mechanism allows the model to look at every word in the input sequence when processing any given word. The process is as follows -

  • Each word or token in the input is transformed into queries, keys and values.
  • The attention scores are calculated as a product of queries and keys. This score determines how much focus should each word or token should have on every other word.
  • The attention score is used to weight the values.
  • In the multi head part, this process is repeated multiple times with different learned projections where the results are combined. This allows the model to capture different types of relationships between words.

Feed-Forward Neural Networks -

After self-attention, data is applied with additional transformations to capture significant features and context of natural language.

Add & Norm (Layer Normalization and Residual Connections) -

Each of the self-attention and feed-forward layers has a residual connection (skip connection) and is followed by layer normalization. This helps avoid training issues like vanishing gradients and ensures stable learning.

Decoder Architecture -

Input Embedding and Positional Decoding -

Just like encoder, decoder begins with token embeddings and positional encodings to understand the sequence of words.

Masked Self-Attention -

The decoder also uses self-attention, but with a masking mechanism. This ensures that the decoder only attends to previous tokens in the sequence and not future ones (causing information leakage).

Self-Attention -

The decoder has an additional attention layer, known as encoder-decoder attention. This allows the decoder to attend to the encoders output while generating each token. This helps decoder to focus on relevant parts of the input sequence during generation.

Final Layer and Softmax -

After the decoder layers, the output is passed through a linear layer followed by a Softmax to generate probabilities over the vocabulary, effectively choosing the next token in the sequence.

Why transformers replaced RNNs and LSTMs ?

Before getting in, what are RNN and LSTM models ?

Recurrent Neural Network (RNN)

These are only a class of neural networks with internal memory (a short-term memory) which are used for processing sequential data. These can only remember the input it received which allows them to be precise in predicting the next. These are preffered algorithm for sequantial data like time series, speech, text amd financial data.

RNNs are being used in the software behind Siri and Google Translate.

The main challenges of recurrent neural networks are —

  • Complex training process as they process data sequentially
  • Difficulty with long sequences makes RNNs to work harder to remember past information
  • Inefficient as these process data sequentially which can be slow and inefficient approach
  • Vanishing gradient which occurs when values of gradient are too small and this model stops learning and takes too long as a result

Long Short-Term Memory (LSTM)

Long short-term memory networks (LSTMs) are an extension for RNNs, which basically extends memory. LSTMs assign data weights which helps RNNs to either let new information in, forget information or give it importance enough to impact the output.

LSTMs enable RNNs to remember inputs over a long period of time which can read, write and delete the information from its memory. This memory can be seen as a gated cell which decides to store or delete the information. The weights assigning to decide the importance is also learned by algorithms. It has input, forget and output gate. The input gate determines whether or not to let the input in, forget gate decides to let the data out if not required or let it impact the output at the current time step (output gate).

This has solved the vanishing gradients issue which we have in RNNs.

This still have challenges in -

  • Flexibility and efficiency in attending different parts of the input sequence as these have gates to control the information flow the are not flexible enough
  • Not scalable to handle large amounts of data and parameters

Summary of why transformers replaced RNNs and LSTMs -

  • Parallelization -
    Process all tokens simultaneously using the self-attention mechanism allowing much faster training and better use of modern hardware like GPUs/TPU’s
  • Handling long range dependencies
  • Self-attention mechanism -
    RNNs and LSTMs rely on hidden states and gates to process information which is limiting wherein transformers use self attention mechanism to weigh the importance of each token in the sequence proving more flexibility
  • Scalable -
    Allows creation of large models like GPT-3, BERT
  • Efficiency -
    Handle computations in parallel which can process the entire sequence at once leading to more efficient training.

Check out the other important topics on LLMs that you need to know here — prompt engineering and much more

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Smsstemburu
Smsstemburu

Written by Smsstemburu

A tech-ruminate, Inquisitive and willing to learn new technologies to tackle real-time problems & build real-time web applications

No responses yet

Write a response