| |
- builtins.object
-
- babyGPT
class babyGPT(builtins.object) |
|
babyGPT(*args, **kwargs)
|
|
Methods defined here:
- __init__(self, *args, **kwargs)
- Initialize self. See help(type(self)) for accurate signature.
- run_code_with_buffered_context_for_training_TransformerFG(self, xformer, master_decoder, dataloader, checkpoint_frequency=1000, display_train_loss=False)
- Drawn from the training routines in the Transformer module of DLStudio
- save_checkpoint_decoder(self, decoder, dir_name, iter_index)
- Save the decoder checkpoint
- save_checkpoint_embedding_generator(self, embedding_generator, dir_name, iter_index)
- save checkpoint for the embedding_generator
- save_decoder(self, decoder)
- Save the trained decoder to a disk file
- save_embedding_generator(self, embedding_generator)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
Data and other attributes defined here:
- ArticleDatasetWithBufferedContext = <class 'babyGPT.babyGPT.ArticleDatasetWithBufferedContext'>
- The parameter 'context_window_size' is related to how many tokens you can feed into the
transformer at one iteration as the training corpus is being scanned. In my Week 14 lecture
on Transformers, I used the notation 'max_seq_len' for this parameter.
- ArticleGatherer = <class 'babyGPT.babyGPT.ArticleGatherer'>
- This script is for collecting data for experimenting with the Transformer based
unsupervised learning code in baby_gpt.py.
The articles are downloaded from the URLs that are specified by the argument 'urls' in the
constructor shown below. See the script "create_base_model.py" in the Examples directory
for how to set the URL strings for this argument. Here are some examples:
urls = ['https://finance.yahoo.com','http://cnn.com',
'https://timesofindia.indiatimes.com',
'https://purdueexponent.org','https://slate.com',
'https://sports.yahoo.com']
urls = ['http://cnn.com']
urls = ['https://slate.com']
urls = ['https://timesofindia.indiatimes.com']
- AttentionHead = <class 'babyGPT.babyGPT.AttentionHead'>
- Borrowed from the Transformers module of DLStudio
- BasicDecoderWithMasking = <class 'babyGPT.babyGPT.BasicDecoderWithMasking'>
- Borrowed from the Transformers module of DLStudio
- EmbeddingGenerator = <class 'babyGPT.babyGPT.EmbeddingGenerator'>
- MasterDecoderWithMasking = <class 'babyGPT.babyGPT.MasterDecoderWithMasking'>
- Borrowed from the Transformers module of DLStudio
- PromptResponder = <class 'babyGPT.babyGPT.PromptResponder'>
- Prompting a trained babyGPT models means that you supply a small number of words (as, say, the
beginning of a new thought) as a prompt and the model supplies the rest of the words to complete
the thought. The class comes with two methods, the first for extending your prompt until it
reaches a period, and the second for going beyond the first period encountered.
Any interaction with a trained GPT model has to deal with the following issue: What to do with
the context buffer that is meant to be a continuation of the last part of the previous "sentence"
fed into the transformer.
Ideally, we should be placing in the context buffer words that create a context for the prompt.
But there is no easy way to that without a more elaborate model. An example of more elaborate
modeling would be to have the input to the transformer consist of, say, an SOS token, a special
context token consisting possibly of integer index values beyond the tokenizer vocab, followed
by a context buffer that would be the last part of the previous sentence, followed, finally,
by the new input tokens.
babyGPT gives you two options regarding what to do with the context buffer for your prompt:
-- all_zeros
-- get_from_prompt
With the first option, all of the integer encoding values in the context buffer are set to
the integer zero. And, with the second option, at this time, the context buffer contains
a portion or all of the prompt itself. If the tokenized version of the prompt is shorter
than the size of the context buffer, only the context_buffer_size number of elements of the
prompt are retained for the context buffer. In the opposite case, just the initial
context_buffer_size number of elements of the prompt are retained.
- ScheduledOptim = <class 'babyGPT.babyGPT.ScheduledOptim'>
- As in the Transformers module of DLStudio, for the scheduling of the learning rate
during the warm-up phase of training TransformerFG, I have borrowed the class shown below
from the GitHub code made available by Yu-Hsiang Huang at:
https://github.com/jadore801120/attention-is-all-you-need-pytorch
- SelfAttention = <class 'babyGPT.babyGPT.SelfAttention'>
- Borrowed from the Transformers module of DLStudio
- TrainTokenizer = <class 'babyGPT.babyGPT.TrainTokenizer'>
- Tokenizers play a critical role in language modeling because they create a
fixed-sized vocabulary for the corpus you are working with --- regardless of
the size of the corpus itself. Unless your text corpus is based on a set of
documents frozen in time, ordinarily, as the size of a text corpus goes up,
so does the size of the vocabulary --- despite the illusion to the contrary
created by the fixed sizes of the language dictionaries you have seen all
your life. How we express ourselves is a living thing. We are constantly
inventing new words and new expressions; these form important components of
what's referred to as the zeitgeist.
Having a fixed-sized vocab is important because the loss functions used in
deep-learning network used for language processing are based on
maximum-likelihood prediction of the next token given the tokens seen
previously. That requires estimating the probabilities associated with all
possible tokens at the next position. As you can imagine, it would be
impossible to engage in such probabilistic reasoning if you did not know in
advance the size of the vocabulary.
- TransformerFG = <class 'babyGPT.babyGPT.TransformerFG'>
- I have borrowed from the DLStudio's Transformers module. "FG" stands for "First Generation" --- which is the Transformer
as originally proposed by Vaswani et al.
| |