This project is a slight modification of a fork on Karpathy's work - https://github.com/karpathy/nanoGPT
The dataset was downloaded from - https://github.com/priya-dwivedi/Deep-Learning/tree/master/GPT2-HarryPotter-Training/books
All files were then merged and fed into the model.
# Reading the file
with open('harry_potter_series.txt', 'r', encoding='utf-8') as f:
text = f.read()
print("length of dataset in characters: ", len(text))
length of dataset in characters: 6765168
# let's look at the first 1000 characters
print(text[:1000])
/ THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursley s had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn’t think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursl
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)
!"%&'()*,-./0123456789:;>?ABCDEFGHIJKLMNOPQRSTUVWXYZ\]abcdefghijklmnopqrstuvwxyz|~—‘’“”•■□ 92
In order to utilize this project we had to follow these steps:
This script transforms each charachter to an integer and creats the a map to later decode it, it also generates train and validation sets saving them in '.bin' files.
This is a configuration file, loaded by the 'configurator.py' script when the train.py script is properly launched like so 'python train.py .\config\train_harrypotter_char.py'. In particular we reduced the number of iterations from 5000 to 2500, increased the learning rate to 1e-2 (0.01), increase the minimum learning rate to 1e-3, and reduced the batch size to 32 from the original 64. This last change was necessary due to hardware constraints as our GPU quickly run out of memory with a batch size of 64.
import torch
import os
import sys
sys.path.append('..')
from model import GPTConfig, GPT
ckpt_path = os.path.join('../out-harrypotter-char', 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location='cuda')
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
number of parameters: 10.66M
<All keys matched successfully>
As we can see this 'small' network is already quite complex with almost 10.7M parameters. It should not surprise that our GPU run at full capacity for over 4 hours to process this training.
A. GPU usage graphs
Every 250 iterations the model is evaluated and if the validation loss is reduced the new model is stored in 'out-harrypotter-char'. To extract results from our models we can run 'python sample.py --out_dir=out-harrypotter-char', this will utilize the model saved to generate text starting from a '\n' character.
Fine tuning a GPT2 model is a very different endeavour, we initially planned to finetune all different model of GPT2, from the smallest, to the XL model. As we can see the size of the smallest model is already outstanding.
ckpt_path = os.path.join('../out-harrypotter', 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location='cuda')
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
number of parameters: 123.65M
<All keys matched successfully>
Already at the smallest size there is a order of magnitude difference with the previous model with over 120 million parameters.
Furthermore these models use by default a block size of 1024, which represent the attention. Unfortunately our available computational power and memory storage does not allow for these models to train or finetune. We tweaked the settings in order to reduce block size to 512, the model does automatically adjust to this size reduction through the 'crop_block_size' function defined in the model.py class within the GPT class.
Fine tuning a pretrained model adds one more step to the previous outlined process which is loading the weights and biases, the process is then as follows:
This script transforms each sub-word token to an integer and creats the a map to later decode it, it also generates train and validation sets saving them in '.bin' files.
This is a configuration file, loaded by the 'configurator.py' script when the train.py script is properly launched like so 'python train.py .\config\finetune_harrypotter.py'. In particular we reduced determined the reduction in block size from 1024 to 512, and change the model to gpt2 instead of gpt2-xl. Both changes were necessary due to hardware constraints. We also increased the number of iterations to 100.
B. GPU usage graphs
As we can see the GPU memory was fully utilized from start to finish even with this smaller GPT2 model but slightly less time was spent accessing that memory compared to the character level model previously built.
The following graphs were created on https://wandb.ai/ platform.
We used this platform to keep track of our system metrics, predictions and losses
C. Iteration vs Step
Evaluaion was made every 250 iterations as it's clearly visible from this graph.
D. Train loss vs Step
E. Val loss vs Step
F. Iteration vs Step
Evaluaion was made every 5 iterations as it's clearly visible from this graph.
G. Train loss vs Step
H. Val loss vs Step
As we can see from the above results the loss for the character level model is lower then the of the pretrained GPT2 model. Two main reasons come to mind to motivate these results:
One further observation is that already, after 100 iterations the loss stopped improving and plateaued. This seems to indicate that it might actually not be able to reach better results. Furthermore the character-level model started to overfit as we can see in figure D and E validation loss stopped improving while, even if slightly the training loss kept improving.
Following is the output comparison of GPT-2 after 5 iteration vs 100 iteration:

Following is the output comparison of Character-level GPT after 250 iteration vs 2000 iteration:

Following is the output comparison of best of two models i.e. GPT-2 after 100 iteration vs Character-level GPT after 2000 iteration:
