nanoGPT trained on Harry Potter Dataset¶

Problem Statement¶

  1. Creating a character-level GPT on the books of Harry Potter.
  2. Fine-tuning GPT-2(created by Karpathy) on the books of Harry Potter.

This project is a slight modification of a fork on Karpathy's work - https://github.com/karpathy/nanoGPT

The Dataset¶

The dataset was downloaded from - https://github.com/priya-dwivedi/Deep-Learning/tree/master/GPT2-HarryPotter-Training/books

All files were then merged and fed into the model.

In [ ]:
# Reading the file
with open('harry_potter_series.txt', 'r', encoding='utf-8') as f:
    text = f.read()
In [ ]:
print("length of dataset in characters: ", len(text))
length of dataset in characters:  6765168
In [ ]:
# let's look at the first 1000 characters
print(text[:1000])
/ 




THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, 
were proud to say that they were perfectly normal, 
thank you very much. They were the last people you’d 
expect to be involved in anything strange or 
mysterious, because they just didn’t hold with such 
nonsense. 

Mr. Dursley was the director of a firm called 
Grunnings, which made drills. He was a big, beefy 
man with hardly any neck, although he did have a 
very large mustache. Mrs. Dursley was thin and 
blonde and had nearly twice the usual amount of 
neck, which came in very useful as she spent so 
much of her time craning over garden fences, spying 
on the neighbors. The Dursley s had a small son 
called Dudley and in their opinion there was no finer 
boy anywhere. 

The Dursleys had everything they wanted, but they 
also had a secret, and their greatest fear was that 
somebody would discover it. They didn’t think they 
could bear it if anyone found out about the Potters. 
Mrs. Potter was Mrs. Dursl
In [ ]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)
 !"%&'()*,-./0123456789:;>?ABCDEFGHIJKLMNOPQRSTUVWXYZ\]abcdefghijklmnopqrstuvwxyz|~—‘’“”•■□
92

Methodology¶

Train a character-level GPT model from scratch¶

In order to utilize this project we had to follow these steps:

  • import the new text containing J.K. Rowling's books,
  • tweak and run the prepare.py script provided by Karpathy in the 'data/shakespeare_char' folder,
    This script transforms each charachter to an integer and creats the a map to later decode it, it also generates train and validation sets saving them in '.bin' files.
  • run the tweaked script in order to generate train and validation sets,
  • tweak a copy of 'train_shakespeare_char.py' (called 'train_harrypotter_char.py'),
    This is a configuration file, loaded by the 'configurator.py' script when the train.py script is properly launched like so 'python train.py .\config\train_harrypotter_char.py'. In particular we reduced the number of iterations from 5000 to 2500, increased the learning rate to 1e-2 (0.01), increase the minimum learning rate to 1e-3, and reduced the batch size to 32 from the original 64. This last change was necessary due to hardware constraints as our GPU quickly run out of memory with a batch size of 64.
  • run the 'train.py' script utilizing our configuration file as mentioned above.
In [ ]:
import torch
import os
import sys
sys.path.append('..')
from model import GPTConfig, GPT

ckpt_path = os.path.join('../out-harrypotter-char', 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location='cuda')
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
number of parameters: 10.66M
Out[ ]:
<All keys matched successfully>

As we can see this 'small' network is already quite complex with almost 10.7M parameters. It should not surprise that our GPU run at full capacity for over 4 hours to process this training.

A. GPU usage graphs

GPU usage

Every 250 iterations the model is evaluated and if the validation loss is reduced the new model is stored in 'out-harrypotter-char'. To extract results from our models we can run 'python sample.py --out_dir=out-harrypotter-char', this will utilize the model saved to generate text starting from a '\n' character.

Finetune a pretrained GPT2 model on Harry Potter books¶

Fine tuning a GPT2 model is a very different endeavour, we initially planned to finetune all different model of GPT2, from the smallest, to the XL model. As we can see the size of the smallest model is already outstanding.

In [ ]:
ckpt_path = os.path.join('../out-harrypotter', 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location='cuda')
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict)
number of parameters: 123.65M
Out[ ]:
<All keys matched successfully>

Already at the smallest size there is a order of magnitude difference with the previous model with over 120 million parameters.
Furthermore these models use by default a block size of 1024, which represent the attention. Unfortunately our available computational power and memory storage does not allow for these models to train or finetune. We tweaked the settings in order to reduce block size to 512, the model does automatically adjust to this size reduction through the 'crop_block_size' function defined in the model.py class within the GPT class.
Fine tuning a pretrained model adds one more step to the previous outlined process which is loading the weights and biases, the process is then as follows:

  • import the new text containing J.K. Rowling's books to the 'data/harrypotter' folder,
  • tweak and run the prepare.py script provided by Karpathy in the 'data/shakespeare' folder,
    This script transforms each sub-word token to an integer and creats the a map to later decode it, it also generates train and validation sets saving them in '.bin' files.
  • run the tweaked script in order to generate train and validation sets,
  • tweak a copy of 'finetune_shakespeare.py' (called 'finetune_harrypotter.py'),
    This is a configuration file, loaded by the 'configurator.py' script when the train.py script is properly launched like so 'python train.py .\config\finetune_harrypotter.py'. In particular we reduced determined the reduction in block size from 1024 to 512, and change the model to gpt2 instead of gpt2-xl. Both changes were necessary due to hardware constraints. We also increased the number of iterations to 100.
  • run the 'train.py' script utilizing our configuration file as mentioned above.

B. GPU usage graphs

GPU usage

As we can see the GPU memory was fully utilized from start to finish even with this smaller GPT2 model but slightly less time was spent accessing that memory compared to the character level model previously built.

Results¶

The following graphs were created on https://wandb.ai/ platform.

We used this platform to keep track of our system metrics, predictions and losses

Graphs for Character-level GPT:¶

C. Iteration vs Step

Evaluaion was made every 250 iterations as it's clearly visible from this graph.

Iteration vs Step

D. Train loss vs Step

Train loss vs Step

E. Val loss vs Step

Val loss vs Step

Graphs for pre-trained GPT-2:¶

F. Iteration vs Step

Evaluaion was made every 5 iterations as it's clearly visible from this graph.

Iteration vs Step

G. Train loss vs Step

Train loss vs Step

H. Val loss vs Step

Val loss vs Step

Conclusion¶

As we can see from the above results the loss for the character level model is lower then the of the pretrained GPT2 model. Two main reasons come to mind to motivate these results:

  • the character-level was trained for many more iterations (2500 vs 100) due to the leaner nature of the model itself,
  • the amount of weights in the larger network and the many rounds of pretraining will take a lot of iterations to be tweaked and generate a more "Harry Potter like" result.

One further observation is that already, after 100 iterations the loss stopped improving and plateaued. This seems to indicate that it might actually not be able to reach better results. Furthermore the character-level model started to overfit as we can see in figure D and E validation loss stopped improving while, even if slightly the training loss kept improving.


A more clear influence of J.K. Rowling's work is also evident in the text produced as we are now going to see.
  1. Finetuning of GPT-2 gave results that were more understandable from the start and became more relevant to Harry Potter as iterations increases.

Following is the output comparison of GPT-2 after 5 iteration vs 100 iteration:

gpt2_5iter gpt2_100iter

  1. The model which was built from scratch (Character-level GPT) was barely readable at the begining but it's Harry Potter's influence was immediately clear.

Following is the output comparison of Character-level GPT after 250 iteration vs 2000 iteration:

char_gpt_250iter char_gpt_2000iter

  1. Finetuning (GPT-2) improved it's relevance to Harry Potter books over the iterations while the model built from the scratch (Character-level GPT) became more readable and started to vageuly make sense.

Following is the output comparison of best of two models i.e. GPT-2 after 100 iteration vs Character-level GPT after 2000 iteration:

gpt2_100iter char_gpt_2000iter

Refrences¶

https://github.com/karpathy/nanoGPT

https://github.com/priya-dwivedi/Deep-Learning/tree/master/GPT2-HarryPotter-Training/books