| |
- builtins.object
-
- babyGPT
class babyGPT(builtins.object) |
|
babyGPT(*args, **kwargs)
|
|
Methods defined here:
- __init__(self, *args, **kwargs)
- Initialize self. See help(type(self)) for accurate signature.
- run_code_with_buffered_context_for_training_TransformerFG(self, xformer, master_decoder, dataloader, checkpoint_frequency=4000, display_train_loss=False)
- Drawn from the training routines in the Transformer module of DLStudio
- save_checkpoint_decoder(self, decoder, dir_name, iter_index)
- Save the decoder checkpoint
- save_checkpoint_embedding_generator(self, embedding_generator, dir_name, iter_index)
- save checkpoint for the embedding_generator
- save_decoder(self, decoder)
- Save the trained decoder to a disk file
- save_embedding_generator(self, embedding_generator)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
Data and other attributes defined here:
- ArticleDatasetWithBufferedContext = <class 'babyGPT.babyGPT.ArticleDatasetWithBufferedContext'>
- This class supplies the 'foundational' dataloader for training. When using the PyTorch Lightning
module for for multi-GPU training, this dataloader is routed through Lightning's LightningDataModule
class as you will see later in this code file. Lightning requires its dataloaders to be Python
generators.
The parameter 'context_window_size' is the number of fresh tokens that the dataloader must supply in
each training iteration. And the parameter 'context_buffer_size' is the number of trailing tokens
in the previous batch that are prepended to the fresh tokens in the current batch. The number of
tokens that the transformer sees is the sum of these two sizes.
- ArticleGatherer = <class 'babyGPT.babyGPT.ArticleGatherer'>
- This script is for collecting data for experimenting with the Transformer based unsupervised learning
code in baby_gpt.py.
The articles are downloaded from the URLs that are specified by the argument 'urls' in the constructor
shown below. See the script "create_base_model_with_buffered_context.py" in the Examples directory
for how to set the URL strings for this argument. Here are some examples:
urls = ['https://finance.yahoo.com','http://cnn.com',
'https://timesofindia.indiatimes.com',
'https://purdueexponent.org','https://slate.com',
'https://sports.yahoo.com']
urls = ['http://cnn.com']
urls = ['https://slate.com']
urls = ['https://timesofindia.indiatimes.com']
- AttentionHead = <class 'babyGPT.babyGPT.AttentionHead'>
- Borrowed from the Transformers module of DLStudio
- BasicDecoderWithMasking = <class 'babyGPT.babyGPT.BasicDecoderWithMasking'>
- Borrowed from the Transformers module of DLStudio
- EmbeddingGenerator = <class 'babyGPT.babyGPT.EmbeddingGenerator'>
- MasterDecoderWithMasking = <class 'babyGPT.babyGPT.MasterDecoderWithMasking'>
- This class was borrowed initially from the Transformers module of the DLStudio platform. Subsequently, its
definition was significantly expanded to fulfill the constraints imposed by the PyTorch Lightning API.
For information regarding the operation of this class, please visit the website for DLStudio at Purdue.
- PromptResponder = <class 'babyGPT.babyGPT.PromptResponder'>
- Prompting a trained babyGPT models means that you supply a small number of words (as, say, the
beginning of a new thought) as a prompt and the model supplies the rest of the words to complete
the thought. The class comes with two methods, the first for extending your prompt until it
reaches a period, and the second for going beyond the first period encountered.
Any interaction with a trained GPT model has to deal with the following issue: What to do with
the context buffer that is meant to be a continuation of the last part of the previous "sentence"
fed into the transformer.
Ideally, we should be placing in the context buffer words that create a context for the prompt.
But there is no easy way to that without a more elaborate model. An example of more elaborate
modeling would be to have the input to the transformer consist of, say, an SOS token, a special
context token consisting possibly of integer index values beyond the tokenizer vocab, followed
by a context buffer that would be the last part of the previous sentence, followed, finally,
by the new input tokens.
babyGPT gives you two options regarding what to do with the context buffer for your prompt:
-- all_zeros
-- get_from_prompt
With the first option, all of the integer encoding values in the context buffer are set to
the integer zero. And, with the second option, at this time, the context buffer contains
a portion or all of the prompt itself. If the tokenized version of the prompt is shorter
than the size of the context buffer, only the context_buffer_size number of elements of the
prompt are retained for the context buffer. In the opposite case, just the initial
context_buffer_size number of elements of the prompt are retained.
- SelfAttention = <class 'babyGPT.babyGPT.SelfAttention'>
- Borrowed from the Transformers module of DLStudio
- TrainTokenizer = <class 'babyGPT.babyGPT.TrainTokenizer'>
- Tokenizers play a critical role in language modeling because they create a fixed-sized vocabulary
for the corpus you are working with --- regardless of the size of the corpus itself. Unless your
text corpus is based on a set of documents frozen in time, ordinarily, as the size of a text corpus
goes up, so does the size of the vocabulary --- despite the illusion to the contrary created by
the fixed sizes of the language dictionaries you have seen all your life. How we express ourselves
is a living thing. We are constantly inventing new words and new expressions; these form important
components of what's referred to as the zeitgeist.
Having a fixed-sized vocab is important because the loss functions used in deep-learning network
used for language processing are based on maximum-likelihood prediction of the next token given
the tokens seen previously. That requires estimating the probabilities associated with all
possible tokens at the next position. As you can imagine, it would be impossible to engage in
such probabilistic reasoning if you did not know in advance the size of the vocabulary.
- TransformerFG = <class 'babyGPT.babyGPT.TransformerFG'>
- This I have borrowed from the DLStudio's Transformers module. "FG" stands for "First Generation" --- which is
the Transformer as originally proposed by Vaswani et al.
| |