Building a Simple Transformer Model with OpenAI’s ChatGPT

Introduction:

For quite some time, I have wanted to read and implement the Transformer paper. After going through several well-illustrated posts such as https://peterbloem.nl/blog/transformers and https://e2eml.school/transformers.html, the concepts still felt abstract to me since I hadn’t run the code. Many people mentioned that a full implementation would be too tedious. However, with the help of ChatGPT, I was able to build a minimum viable product (MVP) that made these concepts more tangible. As I observed the training epoch outputs and the generated text, the previously abstract ideas began to take a more concrete form.

In this blog post, we will explore how to create a simple transformer model using PyTorch with the help of OpenAI’s ChatGPT. We will create a toy dataset, tokenize it, and train a transformer model to predict the next word in a sentence. The transformer model we will build consists of multi-head attention, position-wise feed-forward networks, and positional encoding.

Preparing the dataset:

First, let’s create a small dataset with three sentences to train our transformer model.

sentences = [
    "The quick brown fox jumped over the lazy dog.",
    "Advancements in AI have transformed the way we interact with technology.",
    "Yesterday, the stock market experienced a significant decline due to geopolitical tensions."
]

Tokenizing the dataset:

Now, tokenize the dataset by splitting each sentence into words.

tokenized_sentences = [sentence.split() for sentence in sentences]

Creating the vocabulary:

Create a vocabulary from the tokenized sentences, including special tokens for padding, start of sentence, and end of sentence.

word2idx = {"[PAD]": 0, "[CLS]": 1, "[SEP]": 2}
for sentence in tokenized_sentences:
    for token in sentence:
        if token not in word2idx:
            word2idx[token] = len(word2idx)

Encoding the dataset:

Encode the dataset by converting words into indices using the word2idx dictionary, and pad the sentences to the same length.

# Find max_seq_len
max_seq_len = max(len(sentence) for sentence in tokenized_sentences)

# Encode and pad sentences
encoded_sentences = []
for sentence in tokenized_sentences:
    encoded_sentence = [word2idx["[CLS]"]] + [word2idx[word] for word in sentence] + [word2idx["[SEP]"]]
    encoded_sentence += [word2idx["[PAD]"]] * (max_seq_len - len(encoded_sentence))
    encoded_sentences.append(encoded_sentence)

Defining the transformer model:

Define the transformer model using PyTorch, including components such as positional encoding, multi-head attention, feed-forward network, and transformer blocks.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model):
        super(PositionalEncoding, self).__init__()
        self.d_model = d_model

    def forward(self, x):
        seq_len = x.size(1)
        pe = torch.zeros(seq_len, self.d_model)
        
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, self.d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / self.d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).to(x.device)
        x = x + pe
        return x

# Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, nhead):
        super(MultiHeadAttention, self).__init__()
        self.nhead = nhead
        self.head_dim = d_model // nhead
        self.qkv_linear = nn.Linear(d_model, d_model * 3)
        self.fc = nn.Linear(d_model, d_model)
        self.scale = self.head_dim ** -0.5

    def forward(self, x):
        batch_size, seq_len, _ = x.size()
        qkv = self.qkv_linear(x).view(batch_size, seq_len, self.nhead, -1).transpose(1, 2)
        q, k, v = qkv.chunk(3, dim=-1)
        attn_output = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        attn_output = torch.softmax(attn_output, dim=-1)
        attn_output = torch.matmul(attn_output, v)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
        attn_output = self.fc(attn_output)
        return attn_output

# Feedforward Network
class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model, dim_feedforward):
        super(FeedForwardNetwork, self).__init__()
        self.fc1 = nn.Linear(d_model, dim_feedforward)
        self.fc2 = nn.Linear(dim_feedforward, d_model)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

    
# Transformer Block
class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward):
        super(TransformerBlock, self).__init__()
        self.mha = MultiHeadAttention(d_model, nhead)
        self.ffn = FeedForwardNetwork(d_model, dim_feedforward)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)
    def forward(self, x):
        attn_output = self.mha(x)
        x = self.norm1(x + self.dropout(attn_output))
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

# Transformer Model
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_layers, dim_feedforward, max_seq_len):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        # self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        self.pos_encoding = PositionalEncoding(d_model)
        self.transformer_blocks = nn.ModuleList([TransformerBlock(d_model, nhead, dim_feedforward) for _ in range(num_layers)])
        self.fc = nn.Linear(d_model, vocab_size)
    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)
        for block in self.transformer_blocks:
            x = block(x)
        x = self.fc(x)
        return x

Training the model:

Instantiate the transformer model, set the loss function, and optimizer. Prepare input and target tensors and train the model for a few epochs.

vocab_size = len(word2idx)
d_model = 8
nhead = 2
num_layers = 1
dim_feedforward = 16

model = TransformerModel(vocab_size, d_model, nhead, num_layers, dim_feedforward)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

input_data = torch.tensor(encoded_sentences[:-1], dtype=torch.long)
target_data = torch.tensor(encoded_sentences[1:], dtype=torch.long)

num_epochs = 200
for epoch in range(num_epochs):
    optimizer.zero_grad()
    output = model(input_data)
    loss = criterion(output.view(-1, vocab_size), target_data.view(-1))
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

Conclusion:

In this blog post, we learned how to create a simple transformer model using PyTorch with the help of OpenAI’s Chat. To see the prompts I used and the resulting code, please visit my GitHub repository: https://github.com/chenmiaomiao/the-art-of-lazying/tree/main/examples/lazy-learning/BuildChachaGPTWithChatGPT.

Leave a Reply Cancel reply