DLStudio-2.5.8.html

DLStudio

Version 2.5.8, 2026-April-30

A software platform for teaching the Deep Learning class at Purdue University

DLStudio.py
Version:  2.5.8
Author:  Avinash Kak (kak@purdue.edu)
Date:  2026-April-30

Download Version 2.5.8: gztar	Total number of downloads (all versions) from this website: 7183 This count is automatically updated at every rotation of the weblogs (normally once every two to four days) Last updated: Sun May 31 06:08:01 EDT 2026
View the main module code file in your browser View the Adversarial Learning code file in your browser View the Seq2Seq Learning code file in your browser View the Data Prediction code file in your browser View the Transformers code file in your browser View the Metric Learning code file in your browser View the Generative Diffusion code file in your browser Download the image datasets for the main DLStudio module Download the image datasets for adversarial learning and diffusion Download the datasets for text classification Download the dataset for sequence-to-sequence learning Download the dataset for data prediction Download the datasets for transformer based learning

Download Version 2.5.8: gztar

Total number of downloads (all versions) from this website: 7183

            This count is automatically updated at every rotation of
          the weblogs (normally once every two to four days)
          Last updated: Sun May 31 06:08:01 EDT 2026

View the main module code file in your browser
View the Adversarial Learning code file in your browser
View the Seq2Seq Learning code file in your browser
View the Data Prediction code file in your browser
View the Transformers code file in your browser
View the Metric Learning code file in your browser
View the Generative Diffusion code file in your browser

Download the image datasets for the main DLStudio module
Download the image datasets for adversarial learning and diffusion
Download the datasets for text classification
Download the dataset for sequence-to-sequence learning
Download the dataset for data prediction
Download the datasets for transformer based learning

CONTENTS:

CHANGE LOG
INTRODUCTION
    SKIP CONNECTIONS
    OBJECT DETECTION AND LOCALIZATION
    NOISY OBJECT DETECTION AND LOCALIZATION
    IoU REGRESSION FOR OBJECT DETECTION AND LOCALIZATION
    SEMANTIC SEGMENTATION
    TEXT CLASSIFICATION
    DATA MODELING WITH ADVERSARIAL LEARNING
    DATA MODELING WITH DIFFUSION
    SEQUENCE-TO-SEQUENCE LEARNING WITH ATTENTION
    DATA PREDICTION
    TRANSFORMERS
    METRIC LEARNING
INSTALLATION
USAGE
CONSTRUCTOR PARAMETERS
PUBLIC METHODS
THE MAIN INNER CLASSES OF THE DLStudio CLASS
MODULES IN THE DLStudio PLATFORM
Examples DIRECTORY
ExamplesAdversarialLearning DIRECTORY
ExamplesDiffusion DIRECTORY
ExamplesSeq2SeqLearning DIRECTORY
ExamplesDataPrediction DIRECTORY
ExamplesTransformers DIRECTORY
ExamplesMetricLearning DIRECTORY
THE DATASETS INCLUDED
    FOR THE MAIN DLStudio CLASS and its INNER CLASSES
    FOR Seq2Seq LEARNING
    FOR ADVERSARIAL LEARNING AND DIFFUSION
    FOR DATA PREDICTION
    FOR TRANSFORMERS
BUGS
ACKNOWLEDGMENTS
ABOUT THE AUTHOR
COPYRIGHT

CHANGE LOG

  Version 2.5.8:

    DLStudio had an error in the VQGAN training loop that was caused by calling
    detach() inappropriately. The rest of the code remains unchanged.

  Version 2.5.7:

    This version fixes the error caused by incorrect normalization of the Q.K^T
    dot-products when calculating attention.  Previously, the normalization with
    respect to the size of the embedding vector was outside the nn.Softmax
    normalization, but it needed to be inside.

  Version 2.5.6:

    The MetricLearning module of DLStudio now includes code for demonstrating
    unsupervised clustering with the InfoNCE loss function.  In order to show how you
    can use this code, I have included a new example script in the directory named
    ExamplesMetricLearning at the top level of the DLStudio distribution.  The name
    of this script is "example_for_InfoNCE_loss_unsupervised.py".

  Version 2.5.5:

    With the inclusion of VQGAN, this version of DLStudio incorporates all three
    foundational concepts in variational autoencoding: VAE, VQVAE and VQGAN .  The
    last, VQGAN, occupies a very special place in Deep Learning because it
    established unequivocally that, after tokenization, attention based processing
    with transformers can be made to work exactly the same for both languages and
    images.  This is in keeping with the belief held by researchers (psychologists,
    psychophysicists, cognitive scientists, etc.) that after the lowest levels of
    sensory processing, the brain uses the same structures for drawing high level
    inferences from different types of sensory signals. And the fact that many people
    tend to be highly imagistic readers lends further support to this idea of the
    presence of unified structures in the brain for high-level processing of sensory
    information regardless of the sensors involved.

  Version 2.5.4:

    I have significantly redesigned the neural networks for the autoencoding classes
    in DLStudio.  These classes are Autoencoder, VAE, and VQVAE. These networks can
    now be made arbitrarily deep with a single parameter named num_repeats that
    controls the number of repetitions of the SkipBlock instances for which the
    channel dimensionality remains unchanged from the input to the output.

  Version 2.5.3:

    I have fixed the v2.5.2 module packaging error in the new version.  In addition,
    I have also changed the tensor axis used for nn.Softmax normalization of the
    "Q.K^T" dot-products from the word axis in the Q-tensors to the word-axis in the
    K-tensors.  This could potentially lead to superior results when solving problems
    in which the cross-attention plays a significantly larger role than
    self-attention.

  Version 2.5.2:

    In this version, I have simplified the code in the AdversarialLearning module of
    DLStudio.  This I did by dropping what turned out to be a not-really-needed
    container class for the rest of the code in the module.  Another change made in
    this version is the addition of VQ-VAE code in the main DLStudio class file.
    This added code is in the form of a new class called VQVAE. (BTW, VQ stands for
    Vector Quantization.  It is an important concept that is at the heart of what's
    known as Codebook Learning for more efficient discrete representation of images
    using a finite vocabulary of embedding vectors.)

  Version 2.5.1:

    With this version, DLStudio now comes with an implementation for Variational
    Auto-Encoding (VAE) for images.  The VAE Encoder learns the latent distribution
    (as represented by the mean and the variance of a presumed isotropic Gaussian)
    for a training dataset.  Subsequently, the VAE Decoder samples this distribution
    for generating new images that can be useful variants of the input images.  My
    VAE implementation is in the form of two additional inner classes in the main
    DLStudio class: Autoencoder and VAE.  The former serves as the base class for the
    latter.

  Version 2.5.0:

    An encoder/decoder component of a typical Transformer consists of an attention
    layer followed by an FFN (Feed Forward Network) layer.  There was an error in the
    FFN layer that I have fixed in this version of DLStudio.  While the previous
    versions also worked from the standpoint of overall transformer-based learning,
    the error caused the number of learnable parameters to depend strongly on the
    maximum sequence length at the input to the transformer.  You will not see that
    problem in Version 2.5.0.

  Version 2.4.9:

    This version contains fixes for the pathname errors in the Transformers module of
    DLStudio. The errors were related to where the models and the checkpoints were
    supposed to be stored during training.

  Version 2.4.8:

    In this version, I have made two important changes to the Transformers module in
    DLStudio: (1) The Transformers module now includes a new class that I have named
    visTransformer that works like the famous Vision Transformer (ViT) proposed by
    Dosovitskiy et al. And (2) I have made changes to the QKV code for the
    calculation of self and cross attention in all of the Transformer classes in the
    module. The attention calculations should now execute faster, which is a very
    important consideration in any transformer based learning.

  Version 2.4.3:

    The diffusion modeling part of the code should now accept training images of any
    size.  Previously it was limited to images of size 64x64.  Note that this change
    is not as significant as you might think because, under the hood, the actual
    input image size is changed to the size 64x64 for diffusion modeling.  So this
    change is more for your convenience than anything else.  I have also improved the
    image visualization code in the ExamplesDiffusion directory. The new
    implementation of the script VisualizeSamples.py automatically generates a
    collage of the images generated from noise by the script
    GenerateNewImageSamples.py.  Other changes include minor clean-up of the main doc
    page for GenerativeDiffusion module and of a couple of the functions in the
    module.

  Version 2.4.2:

    DLStudio now includes a new module devoted to data modeling with diffusion.  This
    module, named GenerativeDiffusion, is a module in the overall DLStudio platform.
    The GenerativeDiffusion class resides at the same level of software abstraction
    as the main DLStudio class in the platform.  See the README in the
    ExamplesDiffusion directory of the distribution for how to experiment with the
    diffusion based code in DLStudio.

  Version 2.3.6:

    Gets rid of the inadvertently hardcoded value for the batch size in the testing
    part of the code for Semantic Segmentation.

  Version 2.3.5:

    In this version I have improved the quality of the code in the Semantic
    Segmentation inner class of DLStudio.

  Version 2.3.4:

    Contains a cleaner design for the SkipBlock network. That led to improved design
    for some of the larger networks in DLStudio that use the SkipBlock as a
    building-block.

  Version 2.3.3:

    This version fixes a couple of bugs in DLStudio.  The newer versions of PyTorch
    do not like it if, during the forward flow of data through a network, you make
    in-place changes to the data objects that are in the computational graph.
    Examples of in-place operations are those that result from using compound
    assignment operators like '+=', '*=', etc., and those that result from the use of
    slice operators on arrays WHEN THEY ARE ASSIGNED TO.  Such bugs are difficult to
    troubleshoot because the error messages returned by PyTorch are as unhelpful as
    they can be --- they give you no indication at all as to the location of the
    fault. This version of DLStudio was tested with Version 2.0.1 of PyTorch.

  Version 2.3.2:

    This version of DLStudio includes a new Metric Learning module (name of the
    class: MetricLearning). The main idea of metric learning is to learn a mapping
    from the images to their embedding vector representations in such a way that the
    embeddings for what are supposed to be similar images are pulled together and
    those for dissimilar images are pulled as far apart as possible.  After such a
    mapping function is learned, you can take a query image (whose class label is not
    known), run it through the network to find its embedding vector, and,
    subsequently, assign to the query image the class label of the nearest
    training-image neighbor in the embedding space.  As explained in my Metric
    Learning lecture slides in the Deep Learning class at Purdue, this approach to
    classification is likely to work better under data conditions when the more
    traditional neural network classifiers fail.

  Version 2.3.0:

    I have made it a bit simpler for you to use DLStudio's transformer classes in
    your own scripts.  This I did by eliminating 'Transformers' as the parent class
    of TransformerFG and TransformerPreLN.  Now, in your own scripts, you can have
    more direct access to these two classes that you need for transformer based
    learning.  Your best guide to the syntax for calling TransformerFG and
    TransformerPreLN are the example scripts "seq2seq_with_transformerFG.py" and
    "seq2seq_with_transformerPreLN.py" in the ExamplesTransformers directory of the
    distribution.  Additional changes in Version 2.3.0 include general code clean-up
    by getting rid of variables no longer being used, harmonizing the names of the
    constructor options, etc.

  Version 2.2.8:

    In this version I have fixed a couple of errors that had crept into the previous
    version at the time of packaging that distribution.

  Version 2.2.7:

    This version provides you with the tools you need to cope with the frustrations
    of training a transformer based network. Such networks in general are difficult
    to train, in the sense that your per-epoch training time is likely to be much
    longer than what you are accustomed to, and it can take many, many more epochs to
    get the model to converge.  In addition, you have the problem of stability to
    deal with. Stability means that with a wrong choice for the hyperparameters, the
    model that you are currently training could suddenly begin to diverge (which is
    something akin to mode collapse in training a GAN). If you have to wait until the
    end of training to see such failures, that can be very frustrating.  To cope with
    these problems, this version of DLStudio automatically spits out a checkpoint for
    the model every 5 epochs and also gives you the functions for evaluating the
    performance of the checkpoints. The performance check can be as simple as looking
    at the translated sentences vis-a-vis their targets for a random selection of
    sentence pairs from the data.  When real learning is taking place, you will see
    longer and longer fragments of the translated sentences correspond to the target
    sentences. On the other hand, when you have model divergence, the translated
    sentences will appear to be gibberish.  A future version of DLStudio will also
    print out the BLEU score for the checkpoints.

  Version 2.2.5:

    This version contains significantly improved documentation for DCGAN and WGAN in
    the AdversarialLearning class of DLStudio.

  Version 2.2.4:

    I have cleaned up the code in the new DIoULoss class that I added in the previous
    version. The script object_detection_and_localization_iou.py in the Examples
    directory of DLStudio is based on this loss function.

  Version 2.2.3:

    The inner class DetectAndLocalize of DLStudio now contains a custom loss function
    provided by the class DIoULoss that implements the more modern variants of the
    basic IoU (Intersection over Union) loss function.  These IoU variants are
    explained in the slides 37-42 of my Week 7 Lecture on "Object Detection and
    Localization."  Your best entry point to become familiar with these loss
    functions is the script object_detection_and_localization_iou.py in the Examples
    directory of DLStudio.

  Version 2.2.2:

    This version of DLStudio presents my implementation of transformers in deep
    learning. You will find two transformer implementations in the Transformers
    module of the DLStudio platform in the distribution directory: TransformerFG and
    TransformerPreLN.  "FG" in TransformerFG stands for "Transformer First
    Generation"; it is my implementation of the architecture presented originally in
    the seminal paper "Attention is All You Need" by Vaswani et el.  And the second,
    TransformerPreLN ("PreLN" stands for "Pre Layer Norm") is a small but important
    modification of the original idea that is based on the paper "On Layer
    Normalization in the Transformer Architecture" by Xiong et al.  I could have
    easily combined the two implementations with a small number of conditional
    statements to account for the differences, however I have chosen to keep them
    separate in order to make it easier for the two to evolve separately and to be
    used differently for educational purposes.

  Versions 2.1.7 through 2.2.1:

    These version numbers are for the stepping-stones in my journey into the world of
    transformers --- my experiments with how to best implement the different
    components of a transformer for educational purposes.  As things stand, these
    versions contain features that did not make into the public release version 2.2.2
    on account of inadequate testing.  I may include those features in versions of
    DLStudio after 2.2.2.

  Version 2.1.6:

    All the changes are confined to the DataPrediction module in the DLStudio
    platform.  After posting the previous version, I noticed that the quality of the
    code in DataPrediction was not up to par.  The new version presents a cleaned-up
    version of the DataPrediction class.

  Version 2.1.5:

    DLStudio has now been equipped with a new module named DataPrediction whose focus
    is solely on solving data prediction problems for time-series data.  A
    time-series consists of a sequence of observations recorded at regular intervals.
    These could, for example, be the price of a stock share recorded every hour; the
    hourly recordings of electrical load at your local power utility company; the
    mean average temperature recorded on an annual basis; and so on.  We want to use
    the past observations to predict the value of the next one.  While data
    prediction has much in common with other forms of sequence based learning, it
    presents certain unique challenges of its own and those are with respect to (1)
    Data Normalization; (2) Input Data Chunking; and (3) Multi-dimensional encoding
    of the "datetime" associated with each observation in the time-series.

  Version 2.1.3:

    Some users of DLStudio have reported that when they run the WGAN code for
    adversarial learning, the dataloader sometimes hangs in the middle of a training
    run.  (A WGAN training session may involve as many as 500 epochs.)  In trying to
    reproduce this issue, I discovered that the training loops always ran to
    completion if you set the number of workers in the dataloader to 0.  Version
    2.1.3 makes it easier for you to specify the number of workers in your own
    scripts that call on the WGAN functionality in the AdversarialLearning class.

  Version 2.1.2:

    The adversarial learning part of DLStudio now includes a WGAN implementation that
    uses Gradient Penalty for the learning required by the Critic.  All the changes
    made are in the AdversarialLearning class at the top level of the module.

  Version 2.1.1:

    In order to make it easier to navigate through the large code base of the module,
    I am adopting the convention that "Network" in the name of a class be reserved
    for only those cases when a class actually implements a network.  This convention
    requires that the name of an encapsulating class meant for teaching/learning a
    certain aspect of deep learning not contain "Network" in it.  Therefore, in
    Version 2.1.1, I have changed the names of the top-level classes
    AdversarialNetworks and Seq2SeqNetworks to AdversarialLearning and
    Seq2SeqLearning, respectively.

  Version 2.1.0:

    I have reorganized the code base a bit to make it easier for DLStudio to grow in
    the future.  This I did by moving the sequence-to-sequence learning (seq2seq)
    code to a separate module of the DLStudio platform.  The name of the new module
    is Seq2SeqLearning and it resides at the top level of the distribution.

  Version 2.0.9:

    With this version, DLStudio comes with educational material on
    sequence-to-sequence learning (seq2seq). To that end, I have included the
    following two new classes in DLStudio: (1) Seq2SeqWithLearnableEmbeddings for
    seq2seq with learnable embeddings; and (2) Seq2SeqWithPretrainedEmbeddings for
    doing the same with pre-trained embeddings. Although I have used word2vec for the
    case of pre-trained embeddings, you would be able to run the code with the
    Fasttext embeddings also.  Both seq2seq implementations include the attention
    mechanism based on my understanding of the original paper on the subject by
    Bahdanau, Cho, and Bengio. You will find this code in a class named
    Attention_BCB.  For the sake of comparison, I have also included an
    implementation of the the attention mechanism used in the very popular NLP
    tutorial by Sean Robertson.  You will find that code in a class named
    Attention_SR. To switch between these two attention mechanisms, all you have to
    do is to comment-out and uncomment a couple of lines in the DecoderRNN code.

  Version 2.0.8:

    This version pulls into DLStudio a very important idea in text processing and
    language modeling --- word embeddings.  That is, representing words by
    fixed-sized numerical vectors that are learned on the basis of their contextual
    similarities (meaning that if two words occur frequently in each other's context,
    they should have similar numerical representations).  Use of word embeddings is
    demonstrated in DLStudio through an inner class named
    TextClassificationWithEmbeddings.  Using pre-trained word2vec embeddings, this
    new inner class can be used for experimenting with text classification, sentiment
    analysis, etc.

  Version 2.0.7:

    Made incremental improvements to the visualization of intermediate results during
    training.

  Version 2.0.6:

    This is a result of further clean-up of the code base in DLStudio.  The basic
    functionality provided by the module has not changed.

  Version 2.0.5:

    This version has a bug-fix for the training loop used for demonstrating the power
    of skip connections.  I have also cleaned up how the intermediate results
    produced during training are displayed in your terminal window.  In addition, I
    deleted the part of DLStudio that dealt with Autograd customization since that
    material is now in my ComputationalGraphPrimer module.

  Version 2.0.4:

    This version mostly changes the HTML formatting of this documentation page.  The
    code has not changed.

  Version 2.0.3:

    I have been experimenting with how to best incorporate adversarial learning in
    the DLStudio platform. That's what accounts for the jump from the previous public
    release version 1.1.4 to new version 2.0.3.  The latest version comes with a
    separate class named AdversarialLearning for experimenting with different types
    of such networks for learning data models with adversarial learning and,
    subsequently, generating new instances of the data from the learned models. The
    AdversarialLearning class includes two Discriminator-Generator (DG) pairs and one
    Critic-Generator (CG) pair. Of the two DG pairs, the first is based on the logic
    of DCGAN, and the second a small modification of the first.  The CG pair is based
    on the logic of Wasserstein GAN.  This version of the module also comes with a
    new examples directory, ExamplesAdversarialLearning, that contains example
    scripts that show how you can call the different DG and CG pairs in the
    AdversarialLearning class.  Also included is a new dataset I have created,
    PurdueShapes5GAN-20000, that contains 20,000 images of size 64x64 for
    experimenting with the GANs in this module.

  Version 1.1.4:

    This version has a new design for the text classification class TEXTnetOrder2.
    This has entailed new scripts for training and testing when using the new version
    of that class. Also includes a fix for a bug discovered in Version 1.1.3

  Version 1.1.3:

    The only change made in this version is to the class GRUnet that is used for text
    classification.  In the new version, the final output of this network is based on
    the LogSoftmax activation.

  Version 1.1.2:

    This version adds code to the module for experimenting with recurrent neural
    networks (RNN) for classifying variable-length text input. With an RNN, a
    variable-length text input can be characterized with a hidden state vector of a
    fixed size.  The text processing capabilities of the module allow you to compare
    the results that you may obtain with and without using a GRU. For such
    experiments, this version also comes with a text dataset based on an old archive
    of product reviews made available by Amazon.

  Version 1.1.1:

    This version fixes the buggy behavior of the module when using the 'depth'
    parameter to change the size of a network.

  Version 1.1.0:

    The main reason for this version was my observation that when the training data
    is intentionally corrupted with a high level of noise, it is possible for the
    output of regression to be a NaN (Not a Number).  In my testing at noise levels
    of 20%, 50%, and 80%, while you do not see this problem when the noise level is
    20%, it definitely becomes a problem when the noise level is at 50%.  To deal
    with this issue, this version includes the test 'torch.isnan()' in the training
    and testing code for object detection.  This version of the module also provides
    additional datasets with noise corrupted images with different levels of noise.
    However, since the total size of the datasets now exceeds the file-size limit at
    'https://pypi.org', you'll need to download them separately from the link
    provided in the main documentation page.

  Version 1.0.9:

    With this version, you can now use DLStudio for experiments in semantic
    segmentation of images.  The code added to the module is in a new inner class
    that, as you might guess, is named SemanticSegmentation.  The workhorse of this
    inner class is a new implementation of the famous Unet that I have named mUNet
    --- the prefix "m" stands for "multi" for the ability of the network to segment
    out multiple objects simultaneously.  This version of DLStudio also comes with a
    new dataset, PurdueShapes5MultiObject, for experimenting with mUNet.  Each image
    in this dataset contains a random number of selections from five different shapes
    --- rectangle, triangle, disk, oval, and star --- that are randomly scaled,
    oriented, and located in each image.

  Version 1.0.7:

    The main reason for creating this version of DLStudio is to be able to use the
    module for illustrating how to simultaneously carry out classification and
    regression (C&R) with the same convolutional network.  The specific C&R problem
    that is solved in this version is the problem of object detection and
    localization. You want a CNN to categorize the object in an image and, at the
    same time, estimate the bounding-box for the detected object. Estimating the
    bounding-box is referred to as regression.  All of the code related to object
    detection and localization is in the inner class DetectAndLocalize of the main
    module file.  Training a CNN to solve the detection and localization problem
    requires a dataset that, in addition to the class labels for the objects, also
    provides bounding-box annotations for the objects.  Towards that end, this
    version also comes with a new dataset called PurdueShapes5.  Another new inner
    class, CustomDataLoading, that is also included in Version 1.0.7 has the
    dataloader for the PurdueShapes5 dataset.

  Version 1.0.6:

    This version has the bugfix for a bug in SkipBlock that was spotted by a student
    as I was demonstrating in class the concepts related to the use of skip
    connections in deep neural networks.

  Version 1.0.5:

    This version includes an inner class, BMEnet, for experimenting with skip
    connections to improve the performance of a deep network.  The Examples
    subdirectory of the distribution includes a script,
    playing_with_skip_connections.py, that demonstrates how you can experiment with
    skip connections in a network.

  Version 1.0.4:

    I have added one more inner class, AutogradCustomization, to the module that
    illustrates how to extend Autograd if you want to endow it with additional
    functionality. And, most importantly, this version fixes an important bug that
    caused wrong information to be written out to the disk when you tried to save the
    learned model at the end of a training session. I have also cleaned up the
    comment blocks in the implementation code.

  Version 1.0.3:

    This is the first public release version of this module.

INTRODUCTION

    DLStudio is an integrated software platform for teaching (and learning) a wide
    range of basic architectural features of deep-learning neural networks.

    To get the most educational value out of DLStudio, please see the slides for my
    lectures at Purdue's Deep Learning class.  Most of the learning you do with
    DLStudio is through the scripts in the various Example directories in the
    distribution.  To get access to these Example directories, please do NOT do "sudo
    pip install" on this module since that only gives you the main module files.  You
    would need to install it from the tar archive according to the installation
    instructions in this documentation page.

    As to why you may find DLStudio useful for your learning, note that most
    instructors who teach deep learning ask their students to download the so-called
    famous networks from, say, GitHub and become familiar with them by running them
    on the datasets used by the authors of those networks.  This approach is akin to
    teaching automobile engineering by asking the students to take the high-powered
    cars of the day out for a test drive.  In my opinion, this rather commonly used
    approach does not work for instilling in the students a deep understanding of the
    issues related to network architectures.

    On the other hand, DLStudio offers its own implementations for a variety of key
    features of neural network architectures.  These implementations, along with
    their explanations through detailed slide presentations at our Deep Learning
    class website at Purdue, result in an educational framework that is much more
    efficient in what it can deliver within the time constraints of a single
    semester.

    DLStudio facilitates learning through a combination of inner classes of the main
    module class --- called DLStudio naturally --- and several additional modules in
    the overall platform.  These modules deal with Adversarial Learning, Metric
    Learning, Variational Autoencoding, Diffusion, Data Prediction, Text Analytics,
    Transformer based learning, etc.

    For the most part, the common code that you'd need in different scenarios for
    using neural networks has been placed inside the definition of the main DLStudio
    class in a file named DLStudio.py in the distribution.  That makes more compact
    the definition of the other inner classes within DLStudio. And, to a certain
    extent, that also results in a bit more compact code in the different modules in
    the DLStudio platform.

    In what follows, I'll, briefly describe each of thes major topics addressed in the
    DLStudio platform.

   SKIP CONNECTIONS

    You can use DLStudio's inner class BMEnet to experiment with connection skipping
    in a deep network. Connection skipping means to provide shortcuts in a
    computational graph around the commonly used network components like
    convolutional and other types of layers.  In the absence of such shortcuts, deep
    networks suffer from the problem of vanishing gradients that degrades their
    performance.  Vanishing gradients means that the gradients of the loss calculated
    in the early layers of a network become increasingly muted as the network becomes
    deeper.  An important mitigation strategy for addressing this problem consists of
    creating a CNN using blocks with skip connections.

    As shown in the script playing_with_skip_connections.py in the Examples directory
    of the distribution, you can easily create a CNN with arbitrary depth just by
    using the constructor option "depth" for BMEnet. The basic block of the network
    constructed in this manner is called SkipBlock which, very much like the
    BasicBlock in ResNet-18, has a couple of convolutional layers whose output is
    combined with the input to the block.

    Note that the value given to the "depth" constructor option for the BMEnet class
    does NOT translate directly into the actual depth of the CNN. [Again, see the
    script playing_with_skip_connections.py in the Examples directory for how to use
    this option.] The value of "depth" is translated into how many instances of
    SkipBlock to use for constructing the CNN.

    If you want to use DLStudio for learning how to create your own versions of
    SkipBlock-like shortcuts in a CNN, your starting point should be the following
    script in the Examples directory of the distro:

                playing_with_skip_connections.py

    This script illustrates how to use the inner class BMEnet of the module for
    experimenting with skip connections in a CNN. As the script shows, the
    constructor of the BMEnet class comes with two options: skip_connections and
    depth.  By turning the first on and off, you can directly illustrate in a
    classroom setting the improvement you can get with skip connections.  And by
    giving an appropriate value to the "depth" option, you can show results for
    networks of different depths.

   OBJECT DETECTION AND LOCALIZATION

    The code for how to solve the problem of object detection and localization with a
    CNN is in the inner classes DetectAndLocalize and CustomDataLoading.  This code
    was developed for version 1.0.7 of the module.  In general, object detection and
    localization problems are more challenging than pure classification problems
    because solving the localization part requires regression for the coordinates of
    the bounding box that localize the object.  If at all possible, you would want
    the same CNN to provide answers to both the classification and the regression
    questions and do so at the same time.  This calls for a CNN to possess two
    different output layers, one for classification and the other for regression.  A
    deep network that does exactly that is illustrated by the LOADnet classes that
    are defined in the inner class DetectAndLocalize of the DLStudio platform.  [By
    the way, the acronym "LOAD" in "LOADnet" stands for "LOcalization And
    Detection".] Although you will find three versions of the LOADnet class inside
    DetectAndLocalize, for now only pay attention to the LOADnet2 class since that is
    the one I have worked with the most for creating the 1.0.7 distribution.

    As you would expect, training a CNN for object detection and localization
    requires a dataset that, in addition to the class labels for the images, also
    provides bounding-box annotations for the objects in the images. Out of my great
    admiration for the CIFAR-10 dataset as an educational tool for solving
    classification problems, I have created small-image-format training and testing
    datasets for illustrating the code devoted to object detection and localization
    in this module.  The training dataset is named PurdueShapes5-10000-train.gz and
    it consists of 10,000 images, with each image of size 32x32 containing one of
    five possible shapes --- rectangle, triangle, disk, oval, and star. The shape
    objects in the images are randomized with respect to size, orientation, and
    color.  The testing dataset is named PurdueShapes5-1000-test.gz and it contains
    1000 images generated by the same randomization process as used for the training
    dataset.  You will find these datasets in the "data" subdirectory of the
    "Examples" directory in the distribution.

    Providing a new dataset for experiments with detection and localization meant
    that I also needed to supply a custom dataloader for the dataset.  Toward that
    end, Version 1.0.7 also includes another inner class named CustomDataLoading
    where you will my implementation of the custom dataloader for the PurdueShapes5
    dataset.

    If you want to use DLStudio for learning how to write your own PyTorch code for
    object detection and localization, your starting point should be the following
    script in the Examples directory of the distro:

                object_detection_and_localization.py

    Execute the script and understand what functionality of the inner class
    DetectAndLocalize it invokes for object detection and localization.

   NOISY OBJECT DETECTION AND LOCALIZATION

    When the training data is intentionally corrupted with a high level of noise, it
    is possible for the output of regression to be a NaN (Not a Number).  Here is
    what I observed when I tested the LOADnet2 network at noise levels of 20%, 50%,
    and 80%: At 20% noise, both the labeling and the regression accuracies become
    worse compared to the noiseless case, but they would still be usable depending on
    the application.  For example, with two epochs of training, the overall
    classification accuracy decreases from 91% to 83% and the regression error
    increases from under a pixel (on the average) to around 3 pixels.  However, when
    the level of noise is increased to 50%, the regression output is often a NaN (Not
    a Number), as presented by 'numpy.nan' or 'torch.nan'.  To deal with this
    problem, Version 1.1.0 of the DLStudio platform checks the output of the
    bounding-box regression before drawing the rectangles on the images.

    If you wish to experiment with detection and localization in the presence of
    noise, your starting point should be the script

                noisy_object_detection_and_localization.py

    in the Examples directory of the distribution.  Note that you would need to
    download the datasets for such experiments directly from the link provided near
    the top of this documentation page.

   IoU REGRESSION FOR OBJECT DETECTION AND LOCALIZATION

    Starting with version 2.2.3, DLStudio illustrates how you can use modern variants
    of the IoU (Intersection over Union) loss function for the regression needed for
    object localization.  These loss functions are provided by the DIoULoss class
    that is a part of DLStudio's inner class DetectAndLocalize. If you wish to
    experiment with these loss functions, you best entry point would be the script

                object_detection_and_localization_iou.py

    in the Examples directory of the distribution.  This script uses the same
    PurdueShapes5-10000-train.gz and PurdueShapes5-1000-test.gz training and testing
    datasets as the object_detection_and_localization.py script mentioned earlier.

   SEMANTIC SEGMENTATION

    The code for how to carry out semantic segmentation is in the inner class that is
    appropriately named SemanticSegmentation.  At its simplest, the purpose of
    semantic segmentation is to assign correct labels to the different objects in a
    scene, while localizing them at the same time.  At a more sophisticated level, a
    system that carries out semantic segmentation should also output a symbolic
    expression that reflects an understanding of the scene in the image that is based
    on the objects found in the image and their spatial relationships with one
    another.  The code in the new inner class is based on only the simplest possible
    definition of what is meant by semantic segmentation.

    The convolutional network that carries out semantic segmentation DLStudio is
    named mUNet, where the letter "m" is short for "multi", which, in turn, stands
    for the fact that mUNet is capable of segmenting out multiple object
    simultaneously from an image.  The mUNet network is based on the now famous Unet
    network that was first proposed by Ronneberger, Fischer and Brox in the paper
    "U-Net: Convolutional Networks for Biomedical Image Segmentation".  Their UNET
    extracts binary masks for the cell pixel blobs of interest in biomedical images.
    The output of UNET can therefore be treated as a pixel-wise binary classifier at
    each pixel position.  The mUNet class, on the other hand, is intended for
    segmenting out multiple objects simultaneously form an image. [A weaker reason
    for "m" in the name of the class is that it uses skip connections in multiple
    ways --- such connections are used not only across the two arms of the "U", but
    also also along the arms.  The skip connections in the original Unet are only
    between the two arms of the U.

    mUNet works by assigning a separate channel in the output of the network to each
    different object type.  After the network is trained, for a given input image,
    all you have to do is examine the different channels of the output for the
    presence or the absence of the objects corresponding to the channel index.

    This version of DLStudio also comes with a new dataset,
    PurdueShapes5MultiObject, for experimenting with mUNet.  Each image in this
    dataset contains a random number of selections from five different shapes,
    with the shapes being randomly scaled, oriented, and located in each image.
    The five different shapes are: rectangle, triangle, disk, oval, and star.

    Your starting point for learning how to use the mUNet network for segmenting
    images should be the following script in the Examples directory of the distro:

                semantic_segmentation.py

    Execute the script and understand how it uses the functionality packed in the
    inner class SemanticSegmentation for segmenting out the objects in an image.

   VARIATIONAL AUTO-ENCODNG

    Starting with Version 2.5.1, you can experiment with generative autoencoding in
    the DLStudio platform.  The inner class of DLStudio that allows you to do that is
    called VAE (for Variational Auto-Encoding).  Generative data modeling with
    variational autoencoding is based on the assumption that exists a relatively
    simple Latent Space that captures the essence of what's in the images you are
    interested in.  That is, it is possible to map any given input image to a
    low-dimensional (relatively speaking) vector z that can be modeled with a simple
    probabilty distribution (which, ideally speaking, would be a zero-mean,
    unit-covariance Gaussian) that could subseqently be used to generate useful
    variants of the input image.

    A great thing about VAE is that it allows you to carry out what's known as
    disentanglement learning in which the learned latent space captures the essence
    of each image in your training dataset and have the Decoder capture the rest.
    (The Encoder's job is to map the input image to its representation in the Latent
    Space.)

    I have implemented variational autoencoding in DLStudio through a base class
    Autoencoder and a derived class VAE. Both of these are inner classes of the main
    DLStudio class. (That is, both these classea are defined in the file
    "DLStudio.py" in the distribution.) It is the Encoder and the Decoder in the base
    class Autoencoder that does the bulk of computing in VAE. What the VAE class does
    specifically is to feed the output of Autoencoder's Encoder into two nn.Linear
    layers for the learning of the mean and the log-variance of the latent
    distribution for the training dataset.  Regarding decoding, VAE's Decoder invokes
    what's known as the "reparameterization trick" for sampling the latent
    distribution to first construct a sample from the latent space, reshape the
    sample appropriately, and to then feed it into Autoencoder's Decoder.

    If you want to experiment with autoencoding and variational autoencoding in
    DLStudio, your staring points should be the following three scripts in the
    Examples directory of the distribution:

               run_autoencoder.py
               run_vae.py
               run_vae_for_image_generation.py

    The first script, run_autoencoder.py, is for experimenting with just the
    Autoencoder class by itself.  For example, if you wanted to experiment with
    dimensionality reduction with an Autoencoder, all you would need to do would be
    to change the last or the last couple of layers of the Decoder in the Autoencoder
    class and see for yourself the results by running this script.  The second
    script, run_vae.py, is for experimenting with the variational autoencoding code
    in the VAE class. And the last script, run_vae_for_image_generation.py, is for
    experimenting with just the Decoder part of the VAE class.  The VAE Decoder is in
    reality a Generator that samples a Gaussian probability distribution, as
    specified by its mean and covariance, and transforms the sample back into an
    image that, ideally speaking, should belong to the same distribution as the
    images used for training the VAE.

   VAE WITH CODEBOOK LEARNING

    The development of VQVAE brought us what's now known as the Codebook Learning in
    Deep Learning that eventually placed languages and images on an equal footing
    with regard to attention-based processing with transformers.  The prefix "VQ"
    stands for "Vector Quantization" that replaces each vector produced by the
    Encoder (the values in the vector lie along the channel dimension at the output
    of an Encoder) by the closest similar vector in a learned codebook.  This act
    suppresses noise and other irrelevant variations at the Encoder output.

    VQVAE does three things: (1) If the output of the Encoder is of shape NxNxC, it
    is re-imagined as representing the input image with N^2 embedding vectors, each
    of dimensionality C. (2) You declare a Codebook consisting of K vectors, each of
    dimensionality, again, C. And, finally, (3) You replace each embedding vector at
    the output of the Encoder with its closest Codebook vector. Finally, you feed the
    "reconstituted" embedding vectors into the Decoder to recover an image (or its
    desired variants in a conditional implementation).

    VQGAN goes one step further by showing that the Codebook vectors in a VQVAE can
    be treated like the embedding vectors associated with the tokens in transformer-
    based neural architectures for language processing.  The token sequences for such
    transformer based processing would consist of the integer indices associated with
    the Codebook vectors that an input image is mapped to. The original authors of
    VQGAN have demonstrated that such networks can generate virtual image variants of
    training images, with the virtual images being larger in size and of greater
    visual complexity than was possible before.

    To learn these ideas through DLStudio, please note the following scripts in the
    Examples directory of the distribution:

             run_vqvae.py

             run_vqgan.py

             run_vqgan_map_image_to_codebook.py

             run_vqgan_transformer.py

    The script "run_vqvae.py" stands on its own and try to play with it in order to
    learn the basics of codebook learning.

    The following two scripts named above:

             run_vqgan.py

             run_vqgan_transformer.py

    are two parts of what it takes to accomplish the following: autoregressive
    modeling of the images through the tokens that correspond to the vectors in the
    learned Codebook.  You first work with "run_vqgan.py" to train the Encoder, the
    Decoder, and the VectorQuantizer networks.  Subsequently, using the components
    trained by "run_vqgan.py", you work with "run_vqgan_transformer.py" in order to
    train a transformer network for autoregressive modeling of the images.

    On the four scripts named above, that leaves

             run_vqgan_map_image_to_codebook.py

    is specifically for you to acquire deeper intuitions about what exactly in the
    images is represented by the Codebook vectors.  In order to play with this
    script, you MUST first run the script "run_vqgan.py" for training the VQGAN
    network.  Subsequently, playing with "run_vqgan_map_image_to_codebook.py" will
    help you better understand the role played by the Codebook vectors in the output
    produced by the Decoder.  For additional information related to this script, see
    the comment block at the top of the file in the Examples directory of the
    distribution.

   TEXT CLASSIFICATION

    Starting with Version 1.1.2, the module includes an inner class
    TextClassification that allows you to do simple experiments with neural
    networks with feedback (that are also called Recurrent Neural Networks).  With
    an RNN, textual data of arbitrary length can be characterized with a hidden
    state vector of a fixed size.  To facilitate text based experiments, this
    module also comes with text datasets derived from an old Amazon archive of
    product reviews.  Further information regarding the datasets is in the comment
    block associated with the class SentimentAnalysisDataset. If you want to use
    DLStudio for experimenting with text, your starting points should be the
    following three scripts in the Examples directory of the distribution:

                text_classification_with_TEXTnet.py
                text_classification_with_TEXTnetOrder2.py
                text_classification_with_GRU.py

    The first of these is meant to be used with the TEXTnet network that does not
    include any protection against the vanishing gradients problem that a poorly
    designed RNN can suffer from.  The second script mentioned above is based on
    the TEXTnetOrder2 network and it includes rudimentary protection, but not
    enough to suffice for any practical application.  The purpose of TEXTnetOrder2
    is to serve as an educational stepping stone to a GRU (Gated Recurrent Unit)
    network that is used in the third script listed above.

    Starting with Version 2.0.8, the Examples directory of DLStudio also includes
    the following three scripts that use the same learning networks as the
    corresponding scripts mentioned above but with word representations based on
    word2vec embeddings:

                text_classification_with_TEXTnet_word2vec.py
                text_classification_with_TEXTnetOrder2_word2vec.py
                text_classification_with_GRU_word2vec.py

    The pre-trained word2vec embeddings used in these scripts are accessed
    through the popular gensim library.

   DATA MODELING WITH ADVERSARIAL LEARNING

    Starting with version 2.0.3, DLStudio includes a separate module named
    AdversarialLearning for experimenting with different adversarial learning
    approaches for data modeling.  Adversarial Learning consists of simultaneously
    training a Generator and a Discriminator (or, a Generator and a Critic) with
    the goal of getting the Generator to produce from pure noise images that look
    like those in the training dataset.  When Generator-Discriminator pairs are
    used, the Discriminator's job is to become an expert at recognizing the
    training images so it can let us know when the generator produces an image
    that does not look like what is in the training dataset.  The output of the
    Discriminator consists of the probability that the input to the discriminator
    is like one of the training images.

    On the other hand, when a Generator-Critic pair is used, the Critic's job is
    to become adept at estimating the distance between the distribution that
    corresponds to the training dataset and the distribution that has been learned
    by the Generator so far.  If the distance between the distributions is
    differentiable with respect to the weights in the networks, one can backprop
    the distance and update the weights in an iterative training loop.  This is
    roughly the idea of the Wasserstein GAN that is incorporated as a
    Critic-Generator pair CG1 in the AdversarialLearning class.

    The AdversarialLearning class includes two kinds of adversarial networks for
    data modeling: DCGAN and WGAN.

    DCGAN is short for "Deep Convolutional Generative Adversarial Network", owes
    its origins to the paper "Unsupervised Representation Learning with Deep
    Convolutional Generative Adversarial Networks" by Radford et al.  DCGAN was
    the first fully convolutional network for GANs (Generative Adversarial
    Network). CNN's typically have a fully-connected layer (an instance of
    nn.Linear) at the topmost level.  For the topmost layer in the Generator
    network, DCGAN uses another convolution layer that produces the final output
    image.  And for the topmost layer of the Discriminator, DCGAN flattens the
    output and feeds that into a sigmoid function for producing scalar value.
    Additionally, DCGAN also gets rid of max-pooling for downsampling and instead
    uses convolutions with strides.  Yet another feature of a DCGAN is the use of
    batch normalization in all layers, except in the output layer of the Generator
    and the input layer of the Discriminator.  As the authors of DCGAN stated,
    while, in general, batch normalization stabilizes learning by normalizing the
    input to each layer to have zero mean and unit variance, applying BN at the
    output results in sample oscillation and model instability.  I have also
    retained in the DCGAN code the leaky ReLU activation recommended by the
    authors for the Discriminator.

    The other adversarial learning framework incorporated in AdversarialLearning
    is based on WGAN, which stands for Wasserstein GAN.  This GAN was proposed in
    the paper "Wasserstein GAN" by Arjovsky, Chintala, and Bottou.  WGANs is based
    on estimating the Wasserstein distance between the distribution that
    corresponds to the training images and the distribution that has been learned
    so far by the Generator.  The authors of WGAN have shown that minimizing this
    distance in an iterative learning framework also involves solving a minimax
    problem involving a Critic and a Generator. The Critic's job is to become an
    expert at recognizing the training data while, at the same time, distrusting
    the output of the Generator. Unlike the Discriminator of a GAN, the Critic
    merely seeks to estimate the Wasserstein distance between the true
    distribution associated with the training data and the distribution being
    learned by the Generator.  As the Generator parameters are kept fixed, the
    Critic seems to update its parameters that maximize the Wasserstein distance
    between the true and the fake distributions. Subsequently, as the Critic
    parameters are kept fixed, the Generator updates its learnable parameters in
    an attempt to minimize the same distance.

    Estimation of the Wasserstein distance in the above logic requires for the
    Critic to learn a 1-Lipschitz function. DLStudio implements the following two
    strategies for this learning:

        --  Clipping the values of the learnable parameters of the Critic network
            to a user-specified interval;

        --  Penalizing the gradient of the norm of the Critic with respect to its
            input.

    The first of these is implemented in the function "run_gan_code()" in the file
    AdversarialLearning.py and the second in the function
    "run_wgan_with_gp_code()" in the same file.

    If you wish to use the DLStudio platform to learn about data modeling with
    adversarial learning, your entry points should be the following scripts in the
    ExamplesAdversarialLearning directory of the distro:

        1.  dcgan_DG1.py

        2.  dcgan_DG2.py

        3.  wgan_CG1.py

        4.  wgan_with_gp_CG2.py

    The first script demonstrates the DCGAN logic on the PurdueShapes5GAN dataset.
    In order to show the sensitivity of the basic DCGAN logic to any variations in
    the network or the weight initializations, the second script introduces a
    small change in the network.  The third script is a demonstration of using the
    Wasserstein distance for data modeling through adversarial learning. The
    fourth script includes a gradient penalty in the critic logic called on by the
    third script.  The results produced by these scripts (for the constructor
    options shown in the scripts) are included in a subdirectory named
    RVLCloud_based_results.

   DATA MODELING WITH DIFFUSION

    Starting with Version 2.4.2, DLStudio includes a new module named
    GenerativeDiffusion for experimenting with what's known as "Denoising Diffusion".
    The Denoising Diffusion approach to data modeling is based on the interaction
    between two Markov Chains over T timesteps: A forward chain called the q-chain
    and a reverse chain called the p-chain.

    At each timestep in the forward q-chain, the data coursing through the chain is
    subject to a Markov transition that injects a small amount of zero-mean and
    isotropic Gaussian noise into the data. The goal in the q-chain is to inject
    sufficient noise at each timestep so that, at the end of the T timesteps, one
    will end up with pure isotropic noise.

    On the other hand, the goal in the reverse p-chain, is to start with zero-mean
    isotropic noise, subject it to a denoising Markov transition that gets rid of
    a bit of the noise in the input, do so at every timestep, until you have
    recovered a recognizable image at the end of the chain.

    While the amount of noise that is injected into the data at each transition in
    the forward q-chain is set by the user, how much denoising to carry out at the
    corresponding transition in the reverse p-chain is determined by a neural
    network whose job is to estimate the amount of denoising that, in a sense,
    would be "exact" opposite of the extent of diffusion carried at the
    corresponding transition in the forward q-chain.

    The computational scenario described above becomes particularly tractable for
    the case when you use isotropic Gaussian noise for both diffusion and
    denoising. When the transition probability at each timestep is isotropic
    Gaussian in the forward q-chain, it is easy to show that one can combine an
    arbitrary number of timesteps and get to the target timestep in a single hop.
    This leads to a particularly efficient algorithm described below for training
    the denoising neural network whose job is merely to estimate the best
    denoising transitions at each timestep:

    --- At each iteration of training the neural network, randomly choose a timestep
        t from the range that consists of T timesteps.

    --- Apply a single cumulative q-chain transition to the input training image
        that would be equivalent to taking the input image through t consecutive
        transitions in the q-chain.

    --- For each q-chain transition to the timestep t, use the Bayes' Rule to estimate
        the posterior probability q( x_{t-1} | x_t, x_0 ) from the Markov transition
        probability q( x_t | x0, x_{t-1} ).

    --- Use the posterior probabilities mentioned above as the target for training
        the neural network whose job is to estimate the transition probability p(
        x_{t-1} | x_t ) in the reverse p-chain.  The loss function for training
        the neural network could be the KL-Divergence between the posterior q(
        x_{t-1} | x_t, x_0 ) and the predicted p( x_{t-1} | x_t ).

        Another possibility for the loss would be the MSE error between the
        isotropic noise that was injected in the q-chain transition in question
        and the prediction of the same in the p-chain by using the posterior
        estimates for the mean and the variance using the transition probability
        p( x_{t-1} | x_t ) predicted by the neural network.

        Yet another possibility is to directly form an estimate for the input
        image x_0 using the above-mentioned posterior estimates for the mean and
        the variance and then construct an MSE loss based on the difference
        between the estimated x_0 and its true value.

    As should be clear from the above description, the sole goal of training the
    neural network is to make it an expert at the prediction of the denoising
    transition probabilities p( x_{t-1} | x_t ).  Typically, you carry out the
    training in an infinite loop while spiting out the checkpoints every so often.

    When you are ready to see the image generation power of a checkpoint, you
    start with isotropic Gaussian noise as the input and take it through all of
    the T timestep p-chain transitions that should lead to a recognizable image.

    The ExamplesDiffusion directory of DLStudio contains the following files that
    you will find helpful for your experiments with diffusion:

       0.  README

       1.  RunCodeForDiffusion.py

       2.  GenerateNewImageSamples.py

       3.  VisualizeSamples.py

    Any experiment with diffusion will involve all three scripts mentioned above.
    The script RunCodeForDiffusion.py is for training the neural network to become
    adept at learning the p-chain transition probabilities p( x_{t-1} | x_t ).
    The script GenerateNewImageSamples.py is for generating the images using the
    learned model.  This script deposits all the generated images in a numpy
    archive for ndarrays.  The last script, VisualizeSamples.py, is for extracting
    the individual images from that archive.  Please make sure that you have gone
    through the README mentioned above before starting your experiments with the
    diffusion part of DLStudio.

   SEQUENCE-TO-SEQUENCE LEARNING WITH ATTENTION

    Sequence-to-sequence learning (seq2seq) is about predicting an outcome
    sequence from a causation sequence, or, said another way, a target sequence
    from a source sequence.  Automatic machine translation is probably one of the
    most popular applications of seq2seq.  DLStudio uses English-to-Spanish
    translation to illustrate the programming idioms and the PyTorch structures
    you need for seq2seq.  To that end, Version 2.1.0 of DLStudio includes a
    new module named Seq2SeqLearning that consists of the following two
    demonstration classes:

        1.  Seq2SeqWithLearnableEmbeddings

        2.  Seq2SeqWithPretrainedEmbeddings

    As their names imply, the first is for seq2seq with learnable embeddings and
    the second for seq2seq with pre-trained embeddings like word2vec or fasttext.

    As mentioned above, the specific example of seq2seq addressed in my
    implementation code is translation from English to Spanish. (I chose this
    example because learning and keeping up with Spanish is one of my hobbies.)
    In the Seq2SeqWithLearnableEmbeddings class, the learning framework learns the
    best embedding vectors to use for the two languages involved. On the other
    hand, in the Seq2SeqWithPretrainedEmbeddings class, I use the word2vec
    embeddings provided by Google for the source language.  As to why I use the
    pre-training embeddings for just the source language is explained in the main
    comment doc associated with the class Seq2SeqWithPretrainedEmbeddings.

    Any modern attempt at seq2seq must include attention.  This is done by
    incorporating a separate Attention network in the Encoder-Decoder framework
    needed for seq2seq learning.  The goal of the attention network is to modify
    the current hidden state in the decoder using the attention units produced
    previously by the encoder for the source language sentence.  The main
    Attention model I have used is based on my understanding of the attention
    mechanism proposed by Bahdanau, Cho, and Bengio. You will see this attention
    code in a class named Attention_BCB in the seq2seq implementations named
    above. I have also provided another attention class named Attention_SR that is
    my implementation of the attention mechanism in the very popular NLP tutorial
    by Sean Robertson at the PyTorch website.  The URLs to both these attention
    mechanisms are in my Week 14 lecture material on deep learning at Purdue.

    The following two scripts in the ExamplesSeq2SeqLearning directory are your
    main entry points for experimenting with the seq2seq code in DLStudio:

        1.  seq2seq_with_learnable_embeddings.py

        2.  seq2seq_with_pretrained_embeddings.py

    With the first script, the overall network will learn on its own the best
    embeddings to use for representing the words in the two languages.  And, with
    the second script, the pre-trained word2vec embeddings from Google are used
    for the source language while the system learns the embeddings for the target
    language.

   DATA PREDICTION

    Let's say you have a sequence of observations recorded at regular intervals.
    These could, for example, be the price of a stock share recorded every hour;
    the hourly recordings of electrical load at your local power utility company;
    the mean average temperature recorded on an annual basis; and so on.  We want
    to use the past observations to predict the value of the next one.  Solving
    these types of problems is the focus of the DataPrediction module in the
    DLStudio platform.

    As a problem, data prediction has much in common with text analytics and
    seq2seq processing, in the sense that the prediction at the next time instant
    must be based on the previous observations in a manner similar to what we do
    in text analytics where the next word is understood taking into account all
    the previous words in a sentence.  However, there are three significant
    differences between purely numerical data prediction problems and text-based
    problems:

    1) Data Normalization: As you know by this time, neural networks require that
       your input data be normalized to the [0,1] interval, assuming it consists
       of non-negative numbers, or the [-1,1] interval otherwise.  When solving a
       sequential-data problem like text analytics, after you have normalized the
       input data (which is likely to consist of the numeric embeddings for the
       input words), you can forget about it.  You don't have that luxury when
       solving a data prediction problem.  As you would expect, the next value
       predicted by an algorithm must be at the same scale as the original input
       data.  This requires that the output of a neural-network-based prediction
       algorithm must be "inverse normalized".  And that, in turn, requires
       remembering the normalization parameters used in each channel of the input
       data.

    2) Input Data Chunking: The notion of a sentence that is important in text
       analytics does not carry over to the data prediction problem.  In general,
       you would want a prediction to be made using ALL of the past
       observations. When the sequential data available for training a predictor
       is arbitrarily long, as is the case with numerical data in general, you
       would need to decide how to "chunk" the data --- that is, how to extract
       sub-sequences from the data for the purpose of training a neural network.

    3) Datetime Conditioning: Time-series data typically includes a "datetime"
       stamp for each observation.  Representing datetime as a one-dimensional
       ever-increasing time value does not work for data prediction if the
       observations depend on the time of the day, the day of the week, the season
       of the year, and other such temporal effects.  Incorporating such effects
       in a prediction framework requires a multi-dimensional encoding of the
       datetime values.  See the doc page associated with the DataPrediction class
       for a longer explanation of this aspect of data prediction.

    Now that you understand how the data prediction problem differs from, say, the
    problem of text analytics, it is time for me to state my main goal in defining
    the DataPrediction module in the DLStudio platform.  I actually have two
    goals:

    (a) To deepen your understanding of a GRU.  At this point, your understanding
        of a GRU is likely to be based on calling PyTorch's GRU in your own code.
        Using a pre-programmed implementation for a GRU makes life easy and you
        also get a piece of highly optimized code that you can just call in your
        own code.  However, with a pre-programmed GRU, you are unlikely to get
        insights into how such an RNN is actually implemented.

    (b) To demonstrate how you can use a Recurrent Neural Network (RNN) for data
        prediction taking into account the data normalization, chunking, and
        datetime conditioning issues mentioned earlier.

    To address the first goal above, the DataPrediction class presented in this
    file is based on my pmGRU (Poor Man's GRU).  This GRU is my implementation of
    the "Minimal Gated Unit" GRU variant that was first presented by Joel Heck and
    Fathi Salem in their paper "Simplified Minimal Gated Unit Variations for
    Recurrent Neural Networks".  Its hallmark is that it combines the Update and
    the Reset gates of a regular GRU into a single gate called the Forget Gate.
    You could say that pmGRU is a lightweight version of a regular GRU and its use
    may therefore lead to a slight loss of accuracy in the predictions.  You will
    find it educational to compare the performance you get with my pmGRU-based
    implementation with an implementation that uses PyTorch's GRU for the same
    dataset.

    Your main entry point for experimenting with the DataPrediction module is
    the following script in the ExamplesDataPrediction directory of the DLStudio
    distribution:

        power_load_prediction_with_pmGRU.py

    Before you can run this script, you would need to download the training
    dataset used in this example.  See the "For Data Prediction" part of the "The
    Datasets Included" section of the doc page for that.

   TRANSFORMERS

    For Seq2SeqLearning learning, the goal of a Transformer based implementation
    is the same as described earlier in this Introduction except that now you
    completely forgo recurrence. That is, you only use the mechanism of attention
    to translate sentences from a source language into sentences in the target
    language. For such applications, you need two forms of attention:
    self-attention and cross-attention.  Self-attention refers to the
    intra-sentence relationships between the words and cross-attention refers to
    the inter-sentence relationships between the words in a pair of sentences, one
    in the source language and the other in the target language. I have explained
    these concepts in great detail in the doc sections of the inner classes in the
    Transformers class.  In particular, I have explained the concept of the
    "dot-product" attention in which each word puts out three things: a Query
    Vector Q, a Key Vector K, and a Value Vector V. By taking the dot-product of
    the Query Vector Q of a word with the Key Vector K for all the words in a
    sentence, the neural network gets a measure of the extent to which each word
    in a sentence is important to every other word.  These dot-product values are
    then used as weights on the Value Vectors, V, for the individual words.  Cross
    attention works in a similar manner, except that now you take the dot-products
    of the Q vectors in the target-language sentence with the K vectors in the
    corresponding source-language sentence for producing the weight vectors that
    tell us how to weight the source-language Value Vectors vis-a-vis the words in
    the target language.

    In addition to their use in Seq2SeqLearning learning, transformers are now
    also used widely in computer vision applications. As a nod to their adoption
    in the learning required for solving CV problems, I have created a new class
    named visTransformer in the Transformers module of DLStudio.  The transformer
    part of the logic in a visTransformer is identical to what it is in a
    transformer class for Seq2SeqLearning learning.  That logic kicks in after you
    have divided an image into patches and you represent each patch by an
    embedding vector --- in exactly the same as when you represent a word or a
    token in a sentence by an embedding vector.

    You will see three different implementations of the transformer architecture in
    the Transformers module of the DLStudio platform:

          TransformerFG

          TransformerPreLN

          visTransformer

    The "FG" suffix TransformerFG stands for "First Generation"; the "PreLN"
    suffix in TransformerPreLN for "Pre LayerNorm"; and, finally, the name
    visTransformer stands for "Vision Transformer."

    TransformerFG is my implementation of the transformer architecture proposed in
    the famous paper by Vaswani et al.  and TransformerPreLN my implementation of
    the same architecture but with the modification suggested by Xiong et al. for
    more stable learning.  Since, the modification is small from an architectural
    standpoint, I could have combined both transformer types in the same
    implementation with some conditional logic to account for the differences.
    However, I have chosen to keep them separate mostly for educational purposes.
    Further details on these implementations are in the documentation blocks in
    the Transformers module.

    The visTransformer implementation is based on the paper "An Image is Worth
    16x16 Words: Transformers for Image Recognition at Scale'' by Dosovitskiy et
    al.

    If you want to use my code for learning the main ideas related to how to
    create purely attention based networks, your starting point for that should be
    the following scripts in the ExamplesTransformers directory of the DLStudio
    distribution:

        seq2seq_with_transformerFG.py
        seq2seq_with_transformerPreLN.py

    These scripts uses the following English-Spanish sentence-pairs dataset

           en_es_xformer_8_90000.tar.gz

    that contains 90,000 pairs of English-Spanish sentences with the maximum
    number of words in each sentence limited to 8 words.  For processing by the
    attention networks, each sentence is enclosed in <SOS> and <EOS> tokens, with
    the former standing for "Start of Sentence" and the latter for "End of
    Sentence".

    And if you wish to use visTransformer for solving image recognition problems
    with a transformer based implementation, your starting point should be
    the following scripts in the same ExamplesTransformers directory that was
    mentioned above:

          image_recog_with_visTransformer.py
          test_checkpoint_for_visTransformer.py

    Both these script use the CIFAR10 dataset for demonstrating image recognition.

   METRIC LEARNING

    One of the main idea of metric learning is to learn a mapping from images and
    text to their embedding vector representations in such a way that the embeddings
    for what are supposed to be similar data entities are pulled together and those
    for dissimilar entities are pulled as far apart as possible.  After such a
    mapping function is learned, you can take, say, a query image (whose class label
    is not known), run it through the network to find its embedding vector, and,
    subsequently, assign to the query images the class label of the nearest
    training-image neighbor in the embedding space.  As explained in my Metric
    Learning lecture in the Deep Learning class at Purdue, this approach to
    classification is likely to work under data circumstances when the older network
    classifiers fail.

    This approach to learning leads to some pretty impressive models that are trained
    through embedding vectors in a multi-modality (images and text) embedding space.
    Operating in a purely self-supervised (or, unsupervised) manner, now you can try
    to minimize the distances the embedding vectors associated with the images and
    those associated with the captions that accompany the images.

    Your most basic demonstrations of supervised metric learning will consist of
    single-modality supervised learning of the class labels associated with the
    images. These demonstrations are based on the use of one of two loss functions:
    Pairwise Contrastive Loss and Triplet Loss.

    On the other hand, for demonstrating the power of metric learning in unsupervised
    settings, you are likely to use InfoNCE and PatchNCE losses.,

    INTRODUCTION TO SUPERVISED METRIC LEARNING:

    Regarding supervised metric learning, the Pairwise Contrastive Loss is based on
    extracting all the Positive and the Negative Pairs of images form a batch.  For a
    Positive Pair, both the images in the pair must have the same label, and, for a
    Negative Pair, the two labels must be different.  A minimization of the Pairwise
    Contrastive Loss should decrease the distance between the embedding vectors for a
    Positive Pair and increase the distance between the embedding vectors for a
    Negative Pair.  If the two embeddings in a Negative Pair are already well
    separated, there would be no point to have them contribute to the loss
    calculation.  This is accomplished by incorporating the notion of a margin.  The
    idea is that we want to increase the distance between the two embeddings in a
    Negative Pair to the value specified by the Margin and no more. Should it be the
    case that two such embeddings are already separated by a distance greater than
    the Margin, we do not include such pairs in the loss calculation.

    Triplet Loss, on the other hand, starts with the notion of triplets (i,j,k) of
    the indices for triplets of images in a batch, with no two of the indices being
    the same. The image indexed by i is considered to be the Anchor in the triplet,
    the image indexed by j as Positive, and the one by k as the Negative.  We also
    refer to such triplets with the notation (Anchor, Pos, Neg).  We again need the
    notion of a Margin. When we form the (Anchor, Pos, Neg) triplets from a batch, we
    focus on only those Neg images that are further away from the Anchor than the Pos
    image, but no farther than the (Anchor,Pos) distance plus the Margin.  Including
    the Negative images that are closer than the (Anchor, Pos) distance can make the
    learning unstable and including the Negatives that farther than the "(Anchor,Pos)
    plus the Margin" distance is likely to be wasteful.

    Forming set of Positive and Negative Pairs for the Pairwise Contrastive Loss and
    forming sets of Triplets for the Triplet Loss is referred to as Mining a batch.
    This allows us to talk about concepts like "negative-hard mining", "negative
    semi-hard mining", etc., that depend on the relative distances between the images
    in the Negative Pairs and the distance of a negative vis-a-vis those in a
    Positive Pair.

    If you wish to use this module to learn about supervised metric learning, your
    entry points should be the following scripts in the ExamplesMetricLearning
    directory of the DLStudio distro:

        1.  example_for_pairwise_contrastive_loss_supervised.py

        2.  example_for_triplet_loss_supervised.py

    The sort of results you will get with these two scripts are shown in my Week 9
    lecture slides at Purdue's Deep Learning website.

    INTRODUCTION TO UNSUPERVISED METRIC LEARNING:

    Considering that unsupervised learning (also called self-supervised learning) is
    the Holy Grail of machine learning, the degree of attention it is getting in deep
    learning should not be surprising. What's adding to the excitement is the huge
    success that self-supervised learning has had in the construction of LLMs (Large
    Language Models). LLMs start out as the models built with pure self-learning that
    are subsequently further fine-tuned with supervised and reinforcement learning.

    In the image domain, people have shown noteworthy results with unsupervised
    learning when the loss functions are based on what has come to be known as Noise
    Contrastive Estimation (NCE).  In particular, I am talking about the two loss
    functions InfoNCE and PatchNCE.

    In Version 2.5.6 of DLStudio, I enlarged the scope of the MetricLearning module
    by incorporating in it the function named

               run_code_for_unsupervised_training_with_InfoNCE_loss()

    applies InfoNCE loss to the clustering of the CFAR-10 dataset. The example script
    that you can call on to execute the above 'run_code' is:

               example_for_InfoNCE_loss_unsupervised.py

    in the ExamplesMetricLearning directory of DLStudio.  For the kinds of
    unsupervised clustering results you can expect to see with the above script, see
    my Lecture 9 slides at Purdue's Deep Learning website.

    PROGRAMMING CHALLENGES:

    To calculate the Pairwise Contrastive Loss, you must be first extract Positive
    and Negative Pairs from a batch.  A Positive Pair means that both the embeddings
    in the pair carry the same class label and a Negative Pair means that the two
    embeddings in the pair have dissimilar labels.  From a programming standpoint,
    the challenge is how to form these pairs without scanning through a batch with
    'for' loops --- since such loops are an anathema to any GPU based processing of
    data. What comes to our rescue are a combination of the broadcast properties of
    tensors (inherited from numpy) and tensor-based Boolean logic. For example, by
    comparing a column tensor of the sample labels in a batch with a row tensor of
    the same and testing for the equality of the sample labels, you instantly have a
    2D array whose (i,j) element is True if the i-th and the j-th batch samples carry
    the same class label.

    Even after you have constructed the Positive and the Negative Pairs from a batch,
    your next mini-challenge is to reformat the batch sample indices in the pairs in
    order to conform to the input requirements of PyTorch's loss function
    torch.nn.CosineEmbeddingLoss.  The input consists of three tensors, the first two
    of which are of shape (N,M), where N is the total number of pairs extracted from
    the batch and M the size of the embedding vectors. The first such NxM tensor
    corresponds to the fist batch sample index in each pair. And the second such NxM
    tensor corresponds to the second batch sample index in each pair. The last tensor
    in the input args to the CosineEmbeddingLoss loss function is of shape Nx1, in
    which the individual values are either +1.0 or -1.0, depending on whether the
    pair formed by the first two embeddings is a Positive Pair or a Negative Pair.

    The programming challenge for calculating the Triplet Loss is similar to what it
    is for the Pairwise Contrastive Loss: How to extract all the triplets from a
    batch without using 'for' loops.  The first step is to form array index triplets
    (i,j,k) in which two indices are the same.  If B is the batch size, this is
    easily done by first forming a BxB array that is the logical negation of a
    Boolean array of the same size whose True values are only on the diagonal.  We
    can reshape this BxB Boolean array into three BxBxB shaped Boolean arrays, the
    first in which the True values exist only where i and j values are not the same,
    the second in which the True values occur only when i and k index values are not
    the same, and the third that has True values only when the j and k index values
    are not the same.  By taking a logical AND of all three BxBxB Boolean arrays, we
    get the result we want.  Next, we construct a BxBxB Boolean tensor in which the
    True values occur only where the first two index values imply that their
    corresponding labels are identical and where the last index corresponds to a
    label that does not agree with that for the first two index values.

    Even after you have formed the triplets, your next mini-challenge is to reformat
    the triplets into what you need to feed into the PyTorch loss function
    torch.nn.TripletMarginLoss. The loss function takes three arguments, each of
    shape (N,M) where N is the total number of triplets extracted from the batch and
    M the size of the embedding vectors.  The first such NxM tensor is the Anchor
    embedding vectors, the second for the Positive embedding vectors, the last for
    the Negative embedding vectors.

INSTALLATION

    The DLStudio class was packaged using setuptools.  For installation, execute the
    following command in the source directory (this is the directory that contains
    the setup.py file after you have downloaded and uncompressed the package):

            sudo python3 setup.py install

    On Linux distributions, this will install the module file at a location that
    looks like

             /usr/local/lib/python3.10/dist-packages/

    If you do not have root access, you have the option of working directly off
    the directory in which you downloaded the software by simply placing the
    following statements at the top of your scripts that use the DLStudio class:

            import sys
            sys.path.append( "pathname_to_DLStudio_directory" )

    To uninstall DLStudio, simply delete the source code directory, locate where
    DLStudio was installed with "locate DLStudio" and delete those files.  As
    mentioned above, the full pathname to the installed version is likely to look
    like /usr/local/lib/python3.10/dist-packages/DLStudio*

    If you want to carry out a non-standard install of DLStudio, look up the
    on-line information on Disutils by pointing your browser to

              http://docs.python.org/dist/dist.html

USAGE

    If you want to specify a network with just a configuration string, your usage
    of the module is going to look like:

        from DLStudio import *

        convo_layers_config = "1x[128,3,3,1]-MaxPool(2) 1x[16,5,5,1]-MaxPool(2)"
        fc_layers_config = [-1,1024,10]

        dls = DLStudio(   dataroot = "/home/kak/ImageDatasets/CIFAR-10/",
                          image_size = [32,32],
                          convo_layers_config = convo_layers_config,
                          fc_layers_config = fc_layers_config,
                          path_saved_model = "./saved_model",
                          momentum = 0.9,
                          learning_rate = 1e-3,
                          epochs = 2,
                          batch_size = 4,
                          classes = ('plane','car','bird','cat','deer',
                                     'dog','frog','horse','ship','truck'),
                          use_gpu = True,
                          debug_train = 0,
                          debug_test = 1,
                      )

        configs_for_all_convo_layers = dls.parse_config_string_for_convo_layers()
        convo_layers = dls.build_convo_layers2( configs_for_all_convo_layers )
        fc_layers = dls.build_fc_layers()
        model = dls.Net(convo_layers, fc_layers)
        dls.show_network_summary(model)
        dls.load_cifar_10_dataset()
        dls.run_code_for_training(model)
        dls.run_code_for_testing(model)


    or, if you would rather experiment with a drop-in network, your usage of the
    module is going to look something like:

        dls = DLStudio(   dataroot = "/home/kak/ImageDatasets/CIFAR-10/",
                          image_size = [32,32],
                          path_saved_model = "./saved_model",
                          momentum = 0.9,
                          learning_rate = 1e-3,
                          epochs = 2,
                          batch_size = 4,
                          classes = ('plane','car','bird','cat','deer',
                                     'dog','frog','horse','ship','truck'),
                          use_gpu = True,
                          debug_train = 0,
                          debug_test = 1,
                      )

        exp_seq = DLStudio.ExperimentsWithSequential( dl_studio = dls )   ## for your drop-in network
        exp_seq.load_cifar_10_dataset_with_augmentation()
        model = exp_seq.Net()
        dls.show_network_summary(model)
        exp_seq.run_code_for_training(model)
        exp_seq.run_code_for_testing(model)


    This assumes that you copy-and-pasted the network you want to
    experiment with in a class like ExperimentsWithSequential that is
    included in the module.

CONSTRUCTOR PARAMETERS

    batch_size:  Carries the usual meaning in the neural network context.

    classes:  A list of the symbolic names for the classes.

    convo_layers_config: This parameter allows you to specify a convolutional network
                  with a configuration string.  Must be formatted as explained in the
                  comment block associated with the method
                  "parse_config_string_for_convo_layers()"

    dataroot: This points to where your dataset is located.

    debug_test: Setting it allow you to see images being used and their predicted
                 class labels every 2000 batch-based iterations of testing.

    debug_train: Does the same thing during training that debug_test does during
                 testing.

    epochs: Specifies the number of epochs to be used for training the network.

    fc_layers_config: This parameter allows you to specify the final
                 fully-connected portion of the network with just a list of
                 the number of nodes in each layer of this portion.  The
                 first entry in this list must be the number '-1', which
                 stands for the fact that the number of nodes in the first
                 layer will be determined by the final activation volume of
                 the convolutional portion of the network.

    image_size:  The heightxwidth size of the images in your dataset.

    learning_rate:  Again carries the usual meaning.

    momentum:  Carries the usual meaning and needed by the optimizer.

    path_saved_model: The path to where you want the trained model to be
                  saved in your disk so that it can be retrieved later
                  for inference.

    use_gpu: You must set it to True if you want the GPU to be used for training.

PUBLIC METHODS

    (1)  build_convo_layers()

         This method creates the convolutional layers from the parameters in the
         configuration string that was supplied through the constructor option
         'convo_layers_config'.  The output produced by the call to
         'parse_config_string_for_convo_layers()' is supplied as the argument to
         build_convo_layers().

    (2)  build_fc_layers()

         From the list of ints supplied through the constructor option
         'fc_layers_config', this method constructs the fully-connected portion of
         the overall network.

    (3)  check_a_sampling_of_images()

         Displays the first batch_size number of images in your dataset.

    (4)  display_tensor_as_image()

         This method will display any tensor of shape (3,H,W), (1,H,W), or just
         (H,W) as an image. If any further data normalizations is needed for
         constructing a displayable image, the method takes care of that.  It has
         two input parameters: one for the tensor you want displayed as an image
         and the other for a title for the image display.  The latter parameter is
         default initialized to an empty string.

    (5)  load_cifar_10_dataset()

         This is just a convenience method that calls on Torchvision's
         functionality for creating a data loader.

    (6)  load_cifar_10_dataset_with_augmentation()

         This convenience method also creates a data loader but it also includes
         the syntax for data augmentation.

    (7)  parse_config_string_for_convo_layers()

         As mentioned in the Introduction, DLStudio allows you to specify a
         convolutional network with a string provided the string obeys the
         formatting convention described in the comment block of this method.
         This method is for parsing such a string. The string itself is presented
         to the module through the constructor option 'convo_layers_config'.

    (8)  run_code_for_testing()

         This is the method runs the trained model on the test data. Its output is
         a confusion matrix for the classes and the overall accuracy for each
         class.  The method has one input parameter which is set to the network to
         be tested.  This learnable parameters in the network are initialized with
         the disk-stored version of the trained model.

    (9)  run_code_for_training()

         This is the method that does all the training work. If a GPU was detected
         at the time an instance of the module was created, this method takes care
         of making the appropriate calls in order to transfer the tensors involved
         into the GPU memory.

    (10) save_model()

         Writes the model out to the disk at the location specified by the
         constructor option 'path_saved_model'.  Has one input parameter for the
         model that needs to be written out.

    (11) show_network_summary()

         Displays a print representation of your network and calls on the
         torchsummary module to print out the shape of the tensor at the output of
         each layer in the network. The method has one input parameter which is
         set to the network whose summary you want to see.

THE MAIN INNER CLASSES OF THE DLStudio CLASS

    By "inner classes" I mean the classes that are defined within the class file
    DLStudio.py in the DLStudio directory of the distribution.  The DLStudio platform
    also includes several modules that reside at the same level of software
    abstraction as the main DLStudio class defined in the DLStudio.py file.

    The purpose of the following two inner classes is to demonstrate how you can
    create a custom class for your own network and test it within the framework
    provided by DLStudio.

    (1)  class ExperimentsWithSequential

         This class is my demonstration of experimenting with a network that I found
         on GitHub.  I copy-and-pasted it in this class to test its capabilities.
         How to call on such a custom class is shown by the following script in the
         Examples directory:

                     playing_with_sequential.py

    (2)  class ExperimentsWithCIFAR

         This is very similar to the previous inner class, but uses a common example
         of a network for experimenting with the CIFAR-10 dataset. Consisting of
         32x32 images, this is a great dataset for creating classroom demonstrations
         of convolutional networks.  As to how you should use this class is shown in
         the following script

                    playing_with_cifar10.py

         in the Examples directory of the distribution.

    (4)  class BMEnet (for connection skipping experiments)

         This class is for investigating the power of skip connections in deep
         networks.  Skip connections are used to mitigate a serious problem
         associated with deep networks --- the problem of vanishing gradients.  It
         has been argued theoretically and demonstrated empirically that as the depth
         of a neural network increases, the gradients of the loss become more and
         more muted for the early layers in the network.

    (5)  class DetectAndLocalize

         The code in this inner class is for demonstrating how the same convolutional
         network can simultaneously solve the twin problems of object detection and
         localization.  Note that, unlike the previous four inner classes, class
         DetectAndLocalize comes with its own implementations for the training and
         testing methods. The main reason for that is that the training for detection
         and localization must use two different loss functions simultaneously, one
         for classification of the objects and the other for regression. The function
         for testing is also a bit more involved since it must now compute two kinds
         of errors, the classification error and the regression error on the unseen
         data. Although you will find a couple of different choices for the training
         and testing functions for detection and localization inside
         DetectAndLocalize, the ones I have worked with the most are those that are
         used in the following two scripts in the Examples directory:

              run_code_for_training_with_CrossEntropy_and_MSE_Losses()

              run_code_for_testing_detection_and_localization()

    (6)  class CustomDataLoading

         This is a testbed for experimenting with a completely grounds-up attempt at
         designing a custom data loader.  Ordinarily, if the basic format of how the
         dataset is stored is similar to one of the datasets that Torchvision knows
         about, you can go ahead and use that for your own dataset.  At worst, you
         may need to carry out some light customizations depending on the number of
         classes involved, etc.  However, if the underlying dataset is stored in a
         manner that does not look like anything in Torchvision, you have no choice
         but to supply yourself all of the data loading infrastructure.  That is what
         this inner class of the DLStudio module is all about.

    (7)  class SemanticSegmentation

         This inner class is for working with the mUNet convolutional network for
         semantic segmentation of images.  This network allows you to segment out
         multiple objects simultaneously from an image.  Each object type is assigned
         a different channel in the output of the network.  So, for segmenting out
         the objects of a specified type in a given input image, all you have to do
         is examine the corresponding channel in the output.

    (8)  class Autoencoder

         The man reason for the existence of this class in DLStudio is for it to
         serve as the base class for VAE (Variational Auto-Encoder).  That way, the
         VAE class can focus exclusively on the random-sampling logic specific to
         variational encoding while the base class Autoencoder does the convolutional
         and transpose-convolutional heavy lifting associated with the usual
         encoding-decoding of image data.

    (9)  class VAE

         As mentioned above, VAE stands for "Variational Auto-Encoder". This class
         extends the base class Autoencoder with the variational logic for learning
         the latent distribution representation of a training dataset.  As to what
         that means, latent representations are based on the assumption that the
         "essence" of each sample of the input data can be represented by a vector in
         a lower-dimensional space that has a much simpler distribution (ideally an
         isotropic zero-mean and unit-covariance distribution) than what is possessed
         by the original training data samples.  Separating out the "essence" from
         the rest in the input images in this manner is referred to as
         "Disentanglement Learning."  One you have learned the latent distribution,
         you can sample it and embellish it in a Decoder to produce an output that is
         useful to the user.

    (10) class VQVAE

         VQVAE stands for "Vector Quantized Variational Auto Encoder", which is also
         frequently represented by the acronym VQ-VAE.  The concept of VQ-VAE was
         formulated in the 2018 paper "Neural Discrete Representation Learning" by
         van den Oord, Vinyals, and Kavukcuoglu. VQVAE is an important architecture
         in deep learning because it teaches us about what has come to be known as
         "Codebook Learning" for created discrete representations for images. The
         Codebook learned consists of a fixed number of embedding vectors.
         Subsequently, in an overall Encoder-Decoder architectures, you replace each
         pixel at the output of the Encoder with the closest embedding vector in the
         Codebook. The dimensionality you associate with a pixel at the output of the
         Encoder is the number of channels at that point.

    (11) class VQGAN

         VQGAN stands for "Vector Quantized Generative Adversarial Network".  There
         are two main differences between VQVAE and VQGAN: (1) The
         Encoder-VQ-Decoder" network in a VQGAN is trained through adversarial
         learning by encapsulating it in a GAN.  The concept of a GAN requires a
         Discriminator and a Generator, with the Discriminator trained to become an
         expert at recognizing the training images, while at the same time
         disbelieving the output of the Generator as looking like it came from the
         training set of images.  While you can use any run-of-the-mill discriminator
         in such adversarial learning, the Generator is nothing but our Encoder-VQ-
         Decoder network.  And (2) After training the VQGAN network, you train a
         transformer-based network for autoregressive modeling of the codebook
         indices as produced by the VectorQuantizer (VQ).  The codebook indices are
         the integer index values that point to the codebook vectors that are chosen
         as being the closest to the embedding vectors at the output of the Encoder.

    (12) class TextClassification

         The purpose of this inner class is to be able to use DLStudio for simple
         experiments in text classification.  Consider, for example, the problem of
         automatic classification of variable-length user feedback: you want to
         create a neural network that can label an uploaded product review of
         arbitrary length as positive or negative.  One way to solve this problem is
         with a Recurrent Neural Network in which you use a hidden state for
         characterizing a variable-length product review with a fixed-length state
         vector.

    (13) class TextClassificationWithEmbeddings

         This class has the same functionality as the previous text processing class
         except that now we use embeddings for representing the words.  Word
         embeddings are fixed-sized numerical vectors that are learned on the basis
         of the contextual similarity of the words. The implementation of this inner
         class uses the pre-trained 300-element word2vec embeddings as made available
         by Google for 3 million words and phrases drawn from the Google News
         dataset. In DLStudio, we access these embeddings through the popular gensim
         library.

MODULES IN THE DLStudio PLATFORM

    As stated at the beginning of the previous section, a module resides at the same
    level of software abstraction in the distribution directory as the main DLStudio
    class in the platform. Each module is defined in a separate subdirectory at the
    top level of the distribution directory.  While the main DLStudio class is
    defined in a subdirectory of the same name, the other subdirectories that contain
    the definitions for the modules are named AdversarialLearning, Seq2SeqLearning,
    DataPrediction, Transformers, GenerativeDiffusion, and MetricLearning.  What
    follows in this section are additional details regarding these co-classes:

    AdversarialLearning:
    ===============

    As I mentioned in the Introduction, the purpose of the AdversarialLearning class
    is to demonstrate probabilistic data modeling using Generative Adversarial
    Networks (GAN).  GANs use Discriminator-Generator or Critic-Generator pairs to
    learn probabilistic data models that can subsequently be used to create new image
    instances that look surprisingly similar to those in the training set.  At the
    moment, you will find the following three such pairs inside the
    AdversarialLearning class:

        1.  Discriminator-Generator DG1      ---  implements the DCGAN logic

        2.  Discriminator-Generator DG2      ---  a slight modification of the previous

        3.  Critic-Generator CG1                   ---  implements the Wasserstein GAN logic

        4.  Critic-Generator CG2                   ---  adds the Gradient Penalty to the
                                                                      Wasserstein GAN logic.

    In the ExamplesAdversarialLearning directory of the distro you will see the
    following scripts that demonstrate adversarial learning as incorporated in the
    above networks:

        1.  dcgan_DG1.py                     ---  demonstrates the DCGAN DG1

        2.  dcgan_DG2.py                     ---  demonstrates the DCGAN DG2

        3.  wgan_CG1.py                      ---  demonstrates the Wasserstein GAN CG1

        4.  wgan_with_gp_CG2.py        ---  demonstrates the Wasserstein GAN CG2

    All of these scripts use the training dataset PurdueShapes5GAN that consists of
    20,000 images containing randomly shaped, randomly colored, and randomly
    positioned objects in 64x64 arrays.  The dataset comes in the form of a gzipped
    archive named "datasets_for_AdversarialLearning.tar.gz" that is provided under
    the link "Download the image dataset for AdversarialLearning" at the top of the
    HTML version of this doc page.  See the README in the ExamplesAdversarialLearning
    directory for how to unpack the archive.

    GenerativeDiffusion
    ===============

    During the last couple of years, Denoising Diffusion has emerged as a strong
    alternative to generative data modeling.  As mentioned previously in the
    Introduction section on this webpage, learning a data model through diffusion
    involves two Markov chains, one that incrementally diffuses a training image
    until it turns into pure noise, and the other that incrementally denoises pure
    noise until what you see is something like an image in your training dataset.
    The former is called the q-chain and the latter the p-chain.  The incremental
    diffusion in the q-chain is with known amount of Gaussian isotropic noise.  In
    the p-chain, on the other hand, the goal is for a neural network to learn from
    the diffusion carried out in the q-chain how to carry out a denoising operation
    that would amount to a reversal of that diffusion.

    All of the key elements of the code that I have presented in the
    GenerativeDiffusion module are extracted from OpenAI's "Improved Diffusion"
    project at GitHub that presents a PyTorch implementation of the work authored by
    Nichol and Dhariwal in their very famous paper "Improved Denoising Diffusion
    Probabilistic Models". See the beginning part of the doc page for the
    GenerativeDiffusion module for URLs to the GitHub code and their publication.

    If you want to play with the code in GenerativeDiffusion, your starting point
    should be the README in the ExamplesDiffusion directory of DLStudio distribution.
    The script RunCodeForDiffusion.py in that directory is what you will need to use
    to train the model for your own dataset.  As mentioned earlier, the goal of
    training is to make the neural network adept at estimating the p-chain transition
    probabilities p( x_{t-1} | x_t ) at all timesteps.  Once you have finished
    training, you would need to execute the script GenerateNewImageSamples.py for
    generating new images.

    Seq2SeqLearning:
    ===========

    As mentioned earlier in the Introduction, sequence-to-sequence learning (seq2seq)
    is about predicting an outcome sequence from a causation sequence, or, said
    another way, a target sequence from a source sequence.  Automatic machine
    translation is probably one of the most popular applications of seq2seq.
    DLStudio uses English-to-Spanish translation to illustrate the programming idioms
    and the PyTorch structures you would need for writing your own code for seq2seq.

    Any attempt at seq2seq for machine translation must answer the following question
    at the outset: How to represent the words of a language for neural-network based
    processing? In general, you have two options: (1) Have your overall network learn
    on its own what are known as vector embeddings for the words; or (2) Use
    pre-trained embeddings as provided by word2vec or Fasttext.

    After you have resolved the issue of word representation, your next challenge is
    how to implement the attention mechanism that you're going to need for aligning
    the similar grammatical units in the two languages. The seq2seq code demonstrated
    in this module uses the attention model proposed by Bahdanau, Cho, and Bengio in
    the form of a separate Attention class.  The name of this attention class is
    Attention_BCB.  In a separate attention class named Attention_SR, I have also
    included the attention mechanism used by Sean Robertson in his very popular NLP
    tutorial at the main PyTorch website.

    Seq2SeqLearning contains the following two inner classes for illustrating
    seq2seq:

        1.  Seq2SeqWithLearnableEmbeddings

        2.  Seq2SeqWithPretrainedEmbeddings

    In the first of these, Seq2SeqWithLearnableEmbeddings, the words embeddings are
    learned automatically by using the nn.Embeddings layer. On the other hand, in
    Seq2SeqWithPretrainedEmbeddings, I have used the word2vec embeddings for the
    source language English and allowed the system to learn the embeddings for the
    target language Spanish.

    In order to become familiar with these classes, your best entry points would be
    the following scripts in the ExamplesSeq2SeqLearning directory:

                seq2seq_with_learnable_embeddings.py

                seq2seq_with_pretrained_embeddings.py

    DataPrediction
    ==========

    As mentioned earlier in the Introduction, time-series data prediction differs
    from the more symbolic sequence-based learning frameworks with regard to the
    following: (1) Data normalization; (2) Data chunking; and (3) Datetime
    conditioning. The reason I mention data normalization is that now you have to
    remember the scaling parameters used for data normalization since you are going
    to need to inverse-normalize the predicted values. You would want to your
    predicted values to be at the same scale as the time-series observations.  The
    second issue, data chunking, refers to the fact that the notion of a "sentence"
    does not exist in time-series data.  What that implies that the user has to
    decide how to extract sequences from arbitrary long time-series data for training
    a prediction framework.  Finally, the the third issue, datetime conditioning,
    refers to creating a multi-dimensional encoding for the datetime stamp associated
    with each observation to account for the diurnal, weekly, seasonal, and other
    such temporal effects.

    The data prediction framework in the DataPrediction part of DLStudio is based on
    the following inner class:

        pmGRU

    for "Poor Man's GRU".  This GRU is my implementation of the "Minimal Gated Unit"
    GRU variant that was first presented by Joel Heck and Fathi Salem in their paper
    "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks" and it
    combines the Update and the Reset gates of a regular GRU into a single gate
    called the Forget Gate.

    My reason for using pmGRU is purely educational. While you are likely to use
    PyTorch's GRU for any production work that requires a GRU, using a pre-programmed
    piece of code makes it more difficult to gain insights into how the logic of a
    GRU (especially with regard to the gating action it needs) is actually
    implemented.  The implementation code shown for pmGRU is supposed to help remedy
    that.

    As I mentioned in the Introduction, your main entry point for experimenting with
    data prediction is the following script in the ExamplesDataPrediction directory
    of the DLStudio distribution:

        power_load_prediction_with_pmGRU.py

    However, before you can run this script, you would need to download the training
    dataset used in this example.  See the "For Data Prediction" part of the "The
    Datasets Included" section of the doc page for that.

    Transformers
    ========

    The code in this module of DLStudio consists of three different implementations
    of the transformer architecture: (1) TransformerFG, (2) TransformerPreLN, and (3)
    visTransformer.  The first two of these are meant for seq2seq learning, as in
    language translation, and the last is for solving the problem of image
    recognition.

    TransformerFG is my implementation of the architecture as conceptualized in the
    famous paper "Attention is All You Need" by Vaswani et el.  And TransformerPreLN
    is my implementation of the original idea along with the modifications suggested
    by Xiong et al. in their paper "On Layer Normalization in the Transformer
    Architecture" for more stable learning.  The two versions of transformers differ
    in only one respect: The placement of the LayerNorm in relation to the
    architectural components related to attention and the feedforward network.
    Literally, the difference is small, yet its consequences are significant
    regarding the stability of learning.  Finally, visTransformer is my
    implementation of the Vision Transformer as presented in the famous paper "An
    Image is Worth 16x16$ Words: Transformers for Image Recognition at Scale'' by
    Dosovitskiy et al.

    The fundamentals of how the attention works in all three transformer based
    implementations in the Transformers module are exactly the same.  For
    self-attention, you associate a Query Vector Q_i and a Key Vector K_i with each
    word w_i in a sentence.  For a given w_i, the dot product of its Q_i with the K_j
    vectors for all the other words w_j is a measure of how related w_i is to each
    w_j with regard to what's needed for the translation of a source sentence into
    the target sentence.  One more vector you associate with each word w_i is the
    Value Vector V_i.  The value vectors for the words in a sentence are weighted by
    the output of the activation nn.LogSoftmax applied to the dot-products.

    The self-attention mechanism described above is half of what goes into each base
    encoder of a transformer, the other half is a feedforward network (FFN). The
    overall encoder consists of a cascade of these base encoders.  In my
    implementation, I have referred to the overall encoder as the MasterEncoder.  The
    MasterDecoder also consists of a cascade of base decoders.  A base decoder is
    similar to a base encoder except for there being a layer of cross-attention
    interposed between the self-attention layer and the feedforward network.

    Referring to the attention half of each base encoder or a decoder as one
    half-unit and the FFN as the other half unit, the problem of vanishing gradients
    that would otherwise be caused by the depth of the overall network is mitigated
    by using LayerNorm and residual connections.  In TransformerFG, on the encoder
    side, LayerNorm is applied to the output of the self-attention layer and the
    residual connection wraps around both.  Along the same lines, LayerNorm is
    applied to the output of FFN and the residual connection wraps around both.

    In TransformerPreLN, on the other hand, LayerNorm is applied to the input to the
    self-attention layer and residual connection wraps around both.  Similarly,
    LayerNorm is applied to the input to FFN and the residual connection wraps around
    both.  Similar considerations applied to the decoder side, except we now also
    have a layer of cross-attention interposed between the self-attention and FFN.

    As I mentioned in the Introduction, your main entry point for experimenting with
    the seq2seq based transformer code in DLStudio are the following two scripts in
    the ExamplesTransformers directory of the distribution:

        seq2seq_with_transformerFG.py
        seq2seq_with_transformerPreLN.py

    However, before you can run these scripts, you would need to download the
    training dataset used in these examples.  See the "For Transformers" part of the
    "The Datasets Included" section of this doc page for that.

    And your main entry point for experimenting with image recognition by playing
    with the visTransformer class are the scripts:

          image_recog_with_visTransformer.py
          test_checkpoint_for_visTransformer.py

    Both these script use the CIFAR10 dataset for demonstrating image recognition.

    MetricLearning:
    =============

    You can use this module in DLStudio to play with both supervised and unsupervised
    metric learning.  For supervised metric learning, you use the ground-truth class
    labels for mining a batch for positive and negative pairs and for constructing
    triplets. And for unsupervised metric learning, you can use the InfoNCE loss.
    All of this code in the MetricLearning module of DLStudio.

    I have also supplied the following scripts in the ExamplesMetricLearning
    directory that demonstrate how to call the functionality packaged into the
    MetricLearning module.  The names of these scripts are:

        1.  example_for_pairwise_contrastive_loss_supervised.py

        2.  example_for_triplet_loss_supervised.py

        3.  example_for_InfoNCE_loss_unsupervised.py

Examples DIRECTORY

    The Examples subdirectory in the distribution contains the following scripts:

    (1)  playing_with_reconfig.py

         Shows how you can specify a convolution network with a configuration string.
         The main DLStudio class parses the string constructs the network.

    (2)  playing_with_sequential.py

         Shows you how you can call on a custom inner class of the 'DLStudio' module
         that is meant to experiment with your own network.  The name of the inner
         class in this example script is ExperimentsWithSequential

    (3)  playing_with_cifar10.py

         This is very similar to the previous example script but is based on the
         inner class ExperimentsWithCIFAR which uses more common examples of networks
         for playing with the CIFAR-10 dataset.

    (5)  playing_with_skip_connections.py

         This script illustrates how to use the inner class BMEnet of the module for
         experimenting with skip connections in a CNN. As the script shows, the
         constructor of the BMEnet class comes with two options: skip_connections and
         depth.  By turning the first on and off, you can directly illustrate in a
         classroom setting the improvement you can get with skip connections.  And by
         giving an appropriate value to the "depth" option, you can show results for
         networks of different depths.

    (6)  custom_data_loading.py

         This script shows how to use the custom dataloader in the inner class
         CustomDataLoading of the main DLStudio class.  That custom dataloader is
         meant specifically for the PurdueShapes5 dataset that is used in object
         detection and localization experiments in DLStudio.

    (7)  object_detection_and_localization.py

         This script shows how you can use the functionality provided by the inner
         class DetectAndLocalize of the main DLStudio class for experimenting with
         object detection and localization.  Detecting and localizing (D&L) objects
         in images is a more difficult problem than just classifying the objects.
         D&L requires that your CNN make two different types of inferences
         simultaneously, one for classification and the other for localization.  For
         the localization part, the CNN must carry out what is known as
         regression. What that means is that the CNN must output the numerical values
         for the bounding box that encloses the object that was detected.  Generating
         these two types of inferences requires two DIFFERENT loss functions, one for
         classification and the other for regression.

    (8)  noisy_object_detection_and_localization.py

         This script in the Examples directory is exactly the same as the one
         described above, the only difference is that it calls on the noise-corrupted
         training and testing dataset files.  I thought it would be best to create a
         separate script for studying the effects of noise, just to allow for the
         possibility that the noise-related studies with DLStudio may evolve
         differently in the future.

    (9)  object_detection_and_localization_iou.py

         This script in the Examples directory is for experimenting with the variants
         of the IoU (Intersection over Union) loss functions provided by the class
         DIoULoss class that is a part of DLStudio's inner class DetectAndLocalize.
         This script uses the same datasets as the script mentioned in item 7 above.

    (10) semantic_segmentation.py

         This script should be your starting point if you wish to learn how to use
         the mUNet neural network for semantic segmentation of images.  As mentioned
         elsewhere in this documentation page, mUNet assigns an output channel to
         each different type of object that you wish to segment out from an
         image. So, given a test image at the input to the network, all you have to
         do is to examine each channel at the output for segmenting out the objects
         that correspond to that output channel.

    (11) run_autoencoder.py

         Even though the main purpose of the Autoencoder class in DLStudio is to
         serve as the Base class for the VAE class, this script allows you to
         experiment with just the Autoencoder class by itself.  For example, if you
         wanted to experiment with dimensionality reduction with an Autoencoder, all
         you would need to do would be to change the last or the last couple of
         layers of the Decoder in the Autoencoder class and see for yourself the
         results by running this script.


    (12) run_vae.py

         You can use this script to experiment with the variational autoencoding code
         in the VAE (Variational Auto-Encoder) inner class of DLStudio.  Variational
         autoencoding means mapping an input image to a latent vector that captures
         the "essence" of what's in the image with the assumption that the latent
         vectors form a much distribution (ideally a zero-mean, unit covariance
         Gaussian) than the original input data.  The Decoder part of a VAE samples
         the latent distribution and can be trained to create useful variants of the
         input data.

    (13) run_vae_for_image_generation.py

         This script allows you to experiment with just the Decoder part of the VAE
         (Variational Auto-Encoder) class.  The VAE Decoder is in reality a Generator
         samples a Gaussian probability distribution, as specified by its mean and
         covariance, and transforms the sample thus created into an output image that
         will bear some resemblance with the training images.  As to how close the
         output image will come to looking like your images in the training dataset
         would depend on the size of the dataset, the complexity of the images, the
         dimensionality of the latent space (this dimensionality is 8192) for the VAE
         network as implemented in DLStudio), etc.

    (14) run_vqvae.py

         Run this script to experiment with the VQVAE inner class in the main
         DLStudio class. VQVAE is about what has come to he known as Codebook
         learning for more efficient discrete representation of images with a finite
         vocabulary of embedding vectors.

    (15) run_vqgan.py

         While the overall goal in this script is the same as in the previous script
         --- Codebook learning --- this script does a couple of fancier things.
         First, it encapsulates the Encoder-VQ-Decoder network in a GAN and trains
         the network in an adversarial fashion.  The basic VQGAN network trained in
         this fashion is then used in the next script for transformer-based
         autoregressive modeling of the Codebook indices to which each input image is
         mapped.

    (16) run_vqgan_transformer.py

         This script is paired with the previous script.  That is, you must first run
         the previous script and train well the basic Encoder-VQ-Decoder network of a
         VQGAN.  Only after that, you can run this script for autoregressive modeling
         of the images based on representing the images in the Latest Space with
         sequences of integer indices, with each integer index representing a
         codebook vector.

    (17) run_vqgan_map_image_to_codebook.py

         This script is also meant to be run after you have already trained the basic
         VQGAN network with the script run_vqgan.py.  The goal of this script is to
         make it easy for you play with individual images or a batch of images to see
         the mappings between the images and the codebook vectors. See the commented
         out block at the top of this file to appreciate the reasons for why you
         might want to play with this script

    (18) text_classification_with_TEXTnet.py

         This script is your first introduction in DLStudio to a Recurrent Neural
         Network, meaning a neural-network with feedback.  Such networks are needed
         for solving problems related to variable length input data in applications
         such as text classification, sentiment analysis, machine translation, etc.
         Unfortunately, unless care is taken, the feedback in such networks results
         in long chains of dependencies and thus exacerbates the vanishing gradients
         problem.  The specific goal of this script is neural learning for automatic
         classification of product reviews.

    (19) text_classification_with_TEXTnet_word2vec.py

         This script uses the same learning network as in the previous script, but
         there is a big difference between the two.  The previous network uses
         one-hot vectors for representing the words. On the other hand, this script
         uses pre-trained word2vec embeddings.  These are fixed-sized numerical
         vectors that are learned on the basis of contextual similarities.

    (20) text_classification_with_TEXTnetOrder2.py

         As mentioned earlier for the script in item 10 above, the vanishing
         gradients problem becomes worse in neural networks with feedback.  One way
         to get around this problem is to use what's known as "gated recurrence".
         This script uses the TEXTnetOrder2 network as a stepping stone to a
         full-blown implementation of gating as provided by the nn.GRU class in item
         14 below.

    (21) text_classification_with_TEXTnetOrder2_word2vec.py

         This script uses the same network as the previous script, but now we use the
         word2vec embeddings for representing the words.

    (22) text_classification_with_GRU.py

         This script demonstrates how one can use a GRU (Gated Recurrent Unit) to
         remediate one of the main problems associated with recurrence -- vanishing
         gradients in the long chains of dependencies created by feedback.

    (23) text_classification_with_GRU_word2vec.py

         While this script uses the same learning network as the previous one, the
         words are now represented by fixed-sized word2vec embeddings.

ExamplesAdversarialLearning DIRECTORY

    The ExamplesAdversarialLearning directory of the distribution contains the
    following scripts for demonstrating adversarial learning for data modeling:

        1.  dcgan_DG1.py

        2.  dcgan_DG2.py

        3.  wgan_CG1.py

        4.  wgan_with_gp_CG2.py

    The first script demonstrates the DCGAN logic on the PurdueShapes5GAN dataset.
    In order to show the sensitivity of the basic DCGAN logic to any variations in
    the network or the weight initializations, the second script introduces a small
    change in the network.  The third script is a demonstration of using the
    Wasserstein distance for data modeling through adversarial learning.  The fourth
    script adds a Gradient Penalty term to the Wasserstein Distance based logic of
    the third script.  The PurdueShapes5GAN dataset consists of 64x64 images with
    randomly shaped, randomly positioned, and randomly colored shapes.

    The results produced by these scripts (for the constructor options shown in the
    scripts) are included in a subdirectory named RVLCloud_based_results.  If you are
    just becoming familiar with the AdversarialLearning class of DLStudio, I'd urge
    you to run the script with the constructor options as shown and to compare your
    results with those that are in the RVLCloud_based_results directory.

ExamplesDiffusion DIRECTORY

    The ExamplesDiffusion directory of DLStudio contains the following files that you
    will find helpful for your experiments with diffusion:

       0.  README

       1.  RunCodeForDiffusion.py

       2.  GenerateNewImageSamples.py

       3.  VisualizeSamples.py

    Any experiment with diffusion will involve all three scripts mentioned above.
    The script RunCodeForDiffusion.py is for training the neural network to become
    adept at learning the p-chain transition probabilities p( x_{t-1} | x_t ).  The
    script GenerateNewImageSamples.py is for generating the images using the learned
    model.  This script deposits all the generated images in a numpy archive for
    ndarrays.  The last script, VisualizeSamples.py, is for extracting the individual
    images from that archive.  Please make sure that you have gone through the README
    mentioned above before starting your experiments with the diffusion part of
    DLStudio.

ExamplesSeq2SeqLearning DIRECTORY

    The ExamplesSeq2SeqLearning directory of the distribution contains the following
    scripts for demonstrating sequence-to-sequence learning:

    (1) seq2seq_with_learnable_embeddings.py

         This script demonstrates the basic PyTorch structures and idioms to use for
         seq2seq learning.  The application example addressed in the script is
         English-to-Spanish translation.  And the attention mechanism used for
         seq2seq is the one proposed by Bahdanau, Cho, and Bengio.  This network used
         in this example calls on the nn.Embeddings layer in the encoder to learn the
         embeddings for the words in the source language and a similar layer in the
         decoder to learn the embeddings to use for the target language.

    (2) seq2seq_with_pretrained_embeddings.py

         This script, also for seq2seq learning, differs from the previous one in
         only one respect: it uses Google's word2vec embeddings for representing the
         words in the source-language sentences (English).  As to why I have not used
         at this time the pre-trained embeddings for the target language is explained
         in the main comment doc associated with the class
         Seq2SeqWithPretrainedEmbeddings.

ExamplesDataPrediction DIRECTORY

    The ExamplesDataPrediction directory of the distribution contains the following
    script for demonstrating data prediction for time-series data:

        power_load_prediction_with_pmGRU.py

    This script uses a subset of the dataset provided by Kaggle for one of their
    machine learning competitions.  The dataset consists of over 10-years worth of
    hourly electric load recordings made available by several utilities in the east
    and the Midwest of the United States.  You can download this dataset from a link
    at the top of the main DLStudio doc page.

ExamplesTransformers DIRECTORY

    The ExamplesTransformers directory of the distribution contains the following
    four scripts for experimenting with transformers in DLStudio:

        seq2seq_with_transformerFG.py
        seq2seq_with_transformerPreLN.py

        image_recog_with_visTransformer.py
        test_checkpoint_for_visTransformer.py

    The first two scripts deal with English-to-Spanish translation in a manner
    similar to what's demonstrated by the code in the Seq2SeqLearning module and the
    example scripts associated with that module. The last two relate to my
    demonstration of image recognition with a transformer based implementation.  I
    have used the CFAR10 dataset for image recognition.

ExamplesMetricLearning DIRECTORY

    The ExamplesMetricLearning directory at top level of the distribution contains
    the following scripts:

        1.  example_for_pairwise_contrastive_loss_supervised.py

        2.  example_for_triplet_loss_supervised.py

        3.  example_for_InfoNCE_loss_unsupervised.py

    As the names imply, the first two are for supervised metric learning and the last
    for unsupervised.

THE DATASETS INCLUDED

    [must be downloaded separately]

   FOR THE MAIN DLStudio CLASS and its INNER CLASSES

        Download the dataset archive 'datasets_for_DLStudio.tar.gz' through the link
        "Download the image datasets for the main DLStudio Class" provided at the top
        of this documentation page and store it in the 'Example' directory of the
        distribution.  Subsequently, as you are in the Examples directory, execute the
        following command:

                tar zxvf datasets_for_DLStudio.tar.gz

        This command will create a 'data' subdirectory in the 'Examples' directory
        and deposit the datasets mentioned below in that subdirectory.

         FOR OBJECT DETECTION AND LOCALIZATION

        Training a CNN for object detection and localization requires training and
        testing datasets that come with bounding-box annotations. This module comes
        with the PurdueShapes5 dataset for that purpose.  I created this
        small-image-format dataset out of my admiration for the CIFAR-10 dataset as
        an educational tool for demonstrating classification networks in a classroom
        setting. You will find the following dataset archive files in the "data"
        subdirectory of the "Examples" directory of the distro:

            (1)  PurdueShapes5-10000-train.gz
                 PurdueShapes5-1000-test.gz

            (2)  PurdueShapes5-20-train.gz
                 PurdueShapes5-20-test.gz

        The number that follows the main name string "PurdueShapes5-" is for the
        number of images in the dataset.  You will find the last two datasets, with
        20 images each, useful for debugging your logic for object detection and
        bounding-box regression.

        As to how the image data is stored in the archives, please see the main
        comment block for the inner class CustomLoading in this file.

         FOR DETECTING OBJECTS IN NOISE-CORRUPTED IMAGES

        In terms of how the image data is stored in the dataset files, this dataset
        is no different from the PurdueShapes5 dataset described above.  The only
        difference is that we now add varying degrees of noise to the images to make
        it more challenging for both classification and regression.

        The archive files you will find in the 'data' subdirectory of the 'Examples'
        directory for this dataset are:

            (3)  PurdueShapes5-10000-train-noise-20.gz
                 PurdueShapes5-1000-test-noise-20.gz

            (4)  PurdueShapes5-10000-train-noise-50.gz
                 PurdueShapes5-1000-test-noise-50.gz

            (5)  PurdueShapes5-10000-train-noise-80.gz
                 PurdueShapes5-1000-test-noise-80.gz

        In the names of these six archive files, the numbers 20, 50, and 80 stand for
        the level of noise in the images.  For example, 20 means 20% noise.  The
        percentage level indicates the fraction of the color value range that is
        added as randomly generated noise to the images.  The first integer in the
        name of each archive carries the same meaning as mentioned above for the
        regular PurdueShapes5 dataset: It stands for the number of images in the
        dataset.

         FOR SEMANTIC SEGMENTATION

        Showing interesting results with semantic segmentation requires images that
        contains multiple objects of different types.  A good semantic segmenter
        would then allow for each object type to be segmented out separately from an
        image.  A network that can carry out such segmentation needs training and
        testing datasets in which the images come up with multiple objects of
        different types in them. Towards that end, I have created the following
        dataset:

            (6) PurdueShapes5MultiObject-10000-train.gz
                PurdueShapes5MultiObject-1000-test.gz

            (7) PurdueShapes5MultiObject-20-train.gz
                PurdueShapes5MultiObject-20-test.gz

        The number that follows the main name string "PurdueShapes5MultiObject-" is
        for the number of images in the dataset.  You will find the last two
        datasets, with 20 images each, useful for debugging your logic for semantic
        segmentation.

        As to how the image data is stored in the archive files listed above, please
        see the main comment block for the class

            PurdueShapes5MultiObjectDataset

        As explained there, in addition to the RGB values at the pixels that are
        stored in the form of three separate lists called R, G, and B, the shapes
        themselves are stored in the form an array of masks, each of size 64x64, with
        each mask array representing a particular shape. For illustration, the
        rectangle shape is represented by the first such array. And so on.

         FOR TEXT CLASSIFICATION

        My experiments tell me that, when using gated RNNs, the size of the
        vocabulary can significantly impact the time it takes to train a neural
        network for text modeling and classification.  My goal was to provide curated
        datasets extract from the Amazon user-feedback archive that would lend
        themselves to experimentation on, say, your personal laptop with a
        rudimentary GPU like the Quadro.  Here are the new datasets you can now
        download from the main documentation page for this module:


                 sentiment_dataset_train_200.tar.gz        vocab_size = 43,285
                 sentiment_dataset_test_200.tar.gz

                 sentiment_dataset_train_40.tar.gz         vocab_size = 17,001
                 sentiment_dataset_test_40.tar.gz

                 sentiment_dataset_train_400.tar.gz        vocab_size = 64,350
                 sentiment_dataset_test_400.tar.gz

        As with the other datasets, the integer in the name of each dataset is the
        number of reviews collected from the 'positive.reviews' and the
        'negative.reviews' files for each product category.  Therefore, the dataset
        with 200 in its name has a total of 400 reviews for each product category.
        Also provided are two datasets named "sentiment_dataset_train_3.tar.gz" and
        sentiment_dataset_test_3.tar.gz" just for the purpose of debugging your code.

        The last dataset, the one with 400 in its name, was added in Version 1.1.3 of
        the module.

   FOR Seq2Seq LEARNING

        For sequence-to-sequence learning with DLStudio, you can download an
        English-Spanish translation corpus through the following archive:

            en_es_corpus_for_seq2sq_learning_with_DLStudio.tar.gz

        This data archive is a lightly curated version of the main dataset posted at
        "http://www.manythings.org/anki/" by the folks at "tatoeba.org".  My
        alterations to the original dataset consist mainly of expanding the
        contractions like "it's", "I'm", "don't", "didn't", "you'll", etc., into
        their "it is", "i am", "do not", "did not", "you will", etc. The original
        form of the dataset contains 417 such unique contractions.  Another
        alteration I made to the original data archive is to surround each sentence
        in both English and Spanish by the "SOS" and "EOS" tokens, with the former
        standing for "Start of Sentence" and the latter for "End of Sentence".

        Download the above archive in the ExamplesSeq2Seq2Learning directory and
        execute the following command in that directory:

            tar zxvf en_es_corpus_for_seq2sq_learning_with_DLStudio.tar.gz

        This command will create a 'data' subdirectory in the directory
        ExamplesSeq2Seq2Learning and deposit the following dataset archive in that
        subdirectory:

            en_es_8_98988.tar.gz

        Now execute the following in the 'data' directory:

            tar zxvf en_es_8_98988.tar.gz

        With that, you should be able to execute the Seq2SeqLearning based scripts in
        the 'ExamplesSeq2SeqLearning' directory.

   FOR ADVERSARIAL LEARNING AND DIFFUSION

        Download the dataset archive

            datasets_for_AdversarialLearning.tar.gz

        through the link "Download the image dataset for AdversarialLearning"
        provided at the top of the HTML version of this doc page and store it in the
        'ExamplesAdversarialLearning' directory of the distribution.  Subsequently,
        execute the following command in the directory 'ExamplesAdversarialLearning':

            tar zxvf datasets_for_AdversarialLearning.tar.gz

        This command will create a 'dataGAN' subdirectory and deposit the following
        dataset archive in that subdirectory:

            PurdueShapes5GAN-20000.tar.gz

        Now execute the following in the "dataGAN" directory:

            tar zxvf PurdueShapes5GAN-20000.tar.gz

        With that, you should be able to execute the adversarial learning based
        scripts in the 'ExamplesAdversarialLearning' directory.

        NOTE ADDED IN VERSION 2.5.1: This dataset is also used for the three scripts
        related to autoencoding and variational autoencoding in the Examples
        directory of the distribution.

   FOR DATA PREDICTION

        Download the dataset archive

            dataset_for_DataPrediction.tar.gz

        into the ExamplesDataPrediction directory of the DLStudio distribution.
        Next, execute the following command in that directory:

            tar zxvf dataset_for_DataPrediction.tar.gz

        That will create data directory named "dataPred" in the
        ExamplesDataPrediction directory.  With that you should be able to execute
        the data prediction script in that directory.

   FOR TRANSFORMERS

        For the seq2seq learning part of the Transformers module in DLStudio,
        download the dataset archive

            en_es_corpus_for_learning_with_Transformers.tar.gz

        into the ExamplesTransformers directory of the DLStudio distribution.  Next,
        execute the following command in that directory:

            tar zxvf en_es_corpus_for_learning_with_Transformers.tar.gz

        That will create a 'data' subdirectory in the ExamplesTransformers directory
        and deposit in that subdirectory the following archives

            en_es_xformer_8_10000.tar.gz
            en_es_xformer_8_90000.tar.gz

        These are both derived from the same data source as in the dataset for the
        examples associated with the Seq2SeqLearning module.  The first has only
        10,000 pars of English-Spanish sentences and meant primarily for debugging
        purposes.  The second contains 90000 pairs of such sentences.  The number '8'
        in the dataset names means that no sentence contains more than 8 real words.
        With the "SOS" and "EOS" tokens used as sentence delimiters, the maximum
        number of words in each sentence in either language is 10.

BUGS

    Please notify the author if you encounter any bugs.  When sending email, please
    place the string 'DLStudio' in the subject line to get past the author's spam
    filter.

ACKNOWLEDGMENTS

    Thanks to Praneet Singh and Noureldin Hendy for their comments related to the
    buggy behavior of the module when using the 'depth' parameter to change the size
    of a network. Thanks also go to Christina Eberhardt for reminding me that I
    needed to change the value of the 'dataroot' parameter in my Examples scripts
    prior to packaging a new distribution.  Their feedback led to Version 1.1.1 of
    this module.  Regarding the changes made in Version 1.1.4, one of them is a fix
    for the bug found by Serdar Ozguc in Version 1.1.3. Thanks Serdar.

    Version 2.0.3: I owe thanks to Ankit Manerikar for many wonderful conversations
    related to the rapidly evolving area of generative adversarial networks in deep
    learning.  It is obviously important to read research papers to become familiar
    with the goings-on in an area.  However, if you wish to also develop deep
    intuitions in those concepts, nothing can beat having great conversations with a
    strong researcher like Ankit.  Ankit is finishing his Ph.D. in the Robot Vision
    Lab at Purdue.

    Version 2.2.2: My laboratory's (RVL) journey into the world of transformers began
    with a series of lab seminars by Constantine Roros and Rahul Deshmukh.  Several
    subsequent conversations with them were instrumental in helping me improve the
    understanding I had gained from the seminars.  Additional conversations with
    Rahul about the issue of masking were important to how I eventually implemented
    those ideas in my code.

    Rahul Deshmukh discovered the reason as to why my implementation of the skip
    connection code was not working with the more recent versions of PyTorch.  My
    problem was using in-place operations in the forward() of the networks that
    called for connection skipping. This led to the release of Version 2.3.3 of
    DLStudio.

    The main reason for Version 2.3.4 was my new design for the SkipBlock class and
    also my easier-to-understand code for the BMEnet class that showcases the
    importance of providing shortcut paths in a computational graph using skip
    blocks.  After I broadcast that code to the students in my DL class at Purdue,
    Cheng-Hao Chen reported that when a SkipBlock was asked to change the channels
    from its input to its output but without downsampling the input, that elicited an
    error from the system.  Cheng-Hao also provided a correction for the error.
    Thanks, Cheng-Hao!

    Aditya Chauhan proved to be a great sounding board in my journey into the land of
    diffusion that led to Version 2.4.2.  I particularly appreciated Aditya's help in
    understanding how the attention mechanism worked in the OpenAI code library at
    GitHub.  Aditya is working for his PhD in RVL.  Thanks, Aditya!

    Version 2.5.0 is a result of Rahul Deshmukh insisting that the number of
    learnable parameters in a transformer must not depend on the maximum expected
    length for the input sequence --- which was not the case with the previous
    versions of the transformer code in DLStudio. As it turned out, my implementation
    of the FFN layer in the basic transformer encoder/decoder blocks was not in
    keeping with the design laid out in the original paper by Vaswani et al.  This
    problem is now fixed in Version 2.5.0. Rahul is at the very end of his Ph.D
    program in RVL at Purdue.  Thanks, Rahul!

    The main reason for Version 2.5.3 is Aditya Chauhan's strongly held opinion that
    the nn.Softmax normalization of the "Q.K^T" dot-products for Attention
    calculations must be along the word-axis for the K-tensors and NOT along the
    word-axis for the Q-tensors. His reasoning is compelling because, with the
    normalization along the K-axis, the individual rows of the normalized dot-product
    possess a more natural probability based interpretation: For each word in the
    Query tensor, the numbers in the corresponding row in the normalized dot product
    are the probabilities of the other words being relevant to the query word.  I
    wish to thank Aditya for sharing his insights with me.

    As to how Version 2.5.7 came about, Somosmita Mitra brought to my attention that
    there was an error in the order of the two normalizations that are applied to the
    Q.K^T do-products when calculating attention.  My normalization with respect to
    the size of the embeddings was outside the nn.Softmax normalization.  It needed
    to be inside.  Somosmita is working on her PhD in RVL at Purdue.  Thanks,
    Somosmita!

ABOUT THE AUTHOR

    The author, Avinash Kak, is a professor of Electrical and Computer Engineering
    at Purdue University.  For all issues related to this module, contact the
    author at kak@purdue.edu. If you send email, please place the string
    "DLStudio" in your subject line to get past the author's spam filter.

COPYRIGHT

    Python Software Foundation License

    Copyright 2026 Avinash Kak

@endofdocs

Imported Modules

torch.nn.functional
PIL.Image
PIL.ImageFilter
copy
glob
gzip

logging
math
torch.nn
numpy
numbers
torch.optim

os
pickle
matplotlib.pyplot
pymsgbox
random
re

sys
time
torch
torchmetrics
torchvision
torchvision.transforms

Classes

builtins.object

DLStudio

class DLStudio(builtins.object)

DLStudio(*args, **kwargs)

Methods defined here:

__init__(self, *args, **kwargs): Initialize self. See help(type(self)) for accurate signature.

build_convo_layers(self, configs_for_all_convo_layers)

build_fc_layers(self)

check_a_sampling_of_images(self): Displays the first batch_size number of images in your dataset.

display_tensor_as_image(self, tensor, title=''): This method converts the argument tensor into a photo image that you can display in your terminal screen. It can convert tensors of three different shapes into images: (3,H,W), (1,H,W), and (H,W), where H, for height, stands for the number of pixels in the vertical direction and W, for width, for the same along the horizontal direction. When the first element of the shape is 3, that means that the tensor represents a color image in which each pixel in the (H,W) plane has three values for the three color channels. On the other hand, when the first element is 1, that stands for a tensor that will be shown as a grayscale image. And when the shape is just (H,W), that is automatically taken to be for a grayscale image.

imshow(self, img): called by display_tensor_as_image() for displaying the image

load_cifar_10_dataset(self): In the code shown below, the call to "ToTensor()" converts the usual int range 0-255 for pixel values to 0-1.0 float vals and then the call to "Normalize()" changes the range to -1.0-1.0 float vals. For additional explanation of the call to "tvt.ToTensor()", see Slide 31 of my Week 2 slides at the DL course website. And see Slides 32 and 33 for the syntax "tvt.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))". In this call, the three numbers in the first tuple change the means in the three color channels and the three numbers in the second tuple change the standard deviations according to the formula: image_channel_val = (image_channel_val - mean) / std The end result is that the values in the image tensor will be normalized to fall between -1.0 and +1.0. If needed we can do inverse normalization by image_channel_val = (image_channel_val * std) + mean

load_cifar_10_dataset_with_augmentation(self): In general, we want to do data augmentation for training:

parse_config_string_for_convo_layers(self): Each collection of 'n' otherwise identical layers in a convolutional network is specified by a string that looks like: "nx[a,b,c,d]-MaxPool(k)" where n = num of this type of convo layer a = number of out_channels [in_channels determined by prev layer] b,c = kernel for this layer is of size (b,c) [b along height, c along width] d = stride for convolutions k = maxpooling over kxk patches with stride of k Example: "n1x[a1,b1,c1,d1]-MaxPool(k1) n2x[a2,b2,c2,d2]-MaxPool(k2)"

run_code_for_testing(self, net, display_images=False)

run_code_for_training(self, net, display_images=False)

save_model(self, model): Save the trained model to a disk file

Data descriptors defined here:

__dict__: dictionary for instance variables (if defined)

__weakref__: list of weak references to the object (if defined)

Data and other attributes defined here:

AttentionHead = <class 'DLStudio.DLStudio.AttentionHead'>: Borrowed from the Transformers module of DLStudio

Autoencoder = <class 'DLStudio.DLStudio.Autoencoder'>: The man reason for the existence of this inner class in DLStudio is for it to serve as the base class for VAE (Variational Auto-Encoder). That way, the VAE class can focus exclusively on the random-sampling logic specific to variational encoding while the base class Autoencoder does the convolutional and transpose-convolutional heavy lifting associated with the usual encoding-decoding of image data. Class Path: DLStudio -> Autoencoder

BMEnet = <class 'DLStudio.DLStudio.BMEnet'>: This educational class is meant for illustrating the concepts related to the use of skip connections in neural network. It is now well known that deep networks are difficult to train because of the vanishing gradients problem. What that means is that as the depth of network increases, the loss gradients calculated for the early layers become more and more muted, which suppresses the learning of the parameters in those layers. An important mitigation strategy for addressing this problem consists of creating a CNN using blocks with skip connections. With the code shown in this inner class of the module, you can now experiment with skip connections in a CNN to see how a deep network with this feature might improve the classification results. As you will see in the code shown below, the network that allows you to construct a CNN with skip connections is named BMEnet. As shown in the script playing_with_skip_connections.py in the Examples directory of the distribution, you can easily create a CNN with arbitrary depth just by using the "depth" constructor option for the BMEnet class. The basic block of the network constructed by BMEnet is called SkipBlock which, very much like the BasicBlock in ResNet-18, has a couple of convolutional layers whose output is combined with the input to the block. Note that the value given to the "depth" constructor option for the BMEnet class does NOT translate directly into the actual depth of the CNN. [Again, see the script playing_with_skip_connections.py in the Examples directory for how to use this option.] The value of "depth" is translated into how many "same input and output channels" and the "same input and output sizes" instances of SkipBlock to use between successive instances of downsampling and channel-doubling instances of SkipBlock. Class Path: DLStudio -> BMEnet

BasicDecoderWithMasking = <class 'DLStudio.DLStudio.BasicDecoderWithMasking'>: Borrowed from the Transformers module of DLStudio

CustomDataLoading = <class 'DLStudio.DLStudio.CustomDataLoading'>: This is a testbed for experimenting with a completely grounds-up attempt at designing a custom data loader. Ordinarily, if the basic format of how the dataset is stored is similar to one of the datasets that the Torchvision module knows about, you can go ahead and use that for your own dataset. At worst, you may need to carry out some light customizations depending on the number of classes involved, etc. However, if the underlying dataset is stored in a manner that does not look like anything in Torchvision, you have no choice but to supply yourself all of the data loading infrastructure. That is what this inner class of the main DLStudio class is all about. The custom data loading exercise here is related to a dataset called PurdueShapes5 that contains 32x32 images of binary shapes belonging to the following five classes: 1. rectangle 2. triangle 3. disk 4. oval 5. star The dataset was generated by randomizing the sizes and the orientations of these five patterns. Since the patterns are rotated with a very simple non-interpolating transform, just the act of random rotations can introduce boundary and even interior noise in the patterns. Each 32x32 image is stored in the dataset as the following list: [R, G, B, Bbox, Label] where R : is a 1024 element list of the values for the red component of the color at all the pixels B : the same as above but for the green component of the color G : the same as above but for the blue component of the color Bbox : a list like [x1,y1,x2,y2] that defines the bounding box for the object in the image Label : the shape of the object I serialize the dataset with Python's pickle module and then compress it with the gzip module. You will find the following dataset directories in the "data" subdirectory of Examples in the DLStudio distro: PurdueShapes5-10000-train.gz PurdueShapes5-1000-test.gz PurdueShapes5-20-train.gz PurdueShapes5-20-test.gz The number that follows the main name string "PurdueShapes5-" is for the number of images in the dataset. You will find the last two datasets, with 20 images each, useful for debugging your logic for object detection and bounding-box regression. Class Path: DLStudio -> CustomDataLoading

DetectAndLocalize = <class 'DLStudio.DLStudio.DetectAndLocalize'>: The purpose of this inner class is to focus on object detection in images --- as opposed to image classification. Most people would say that object detection is a more challenging problem than image classification because, in general, the former also requires localization. The simplest interpretation of what is meant by localization is that the code that carries out object detection must also output a bounding-box rectangle for the object that was detected. You will find in this inner class some examples of LOADnet classes meant for solving the object detection and localization problem. The acronym "LOAD" in "LOADnet" stands for "LOcalization And Detection" The different network examples included here are LOADnet1, LOADnet2, and LOADnet3. For now, only pay attention to LOADnet2 since that's the class I have worked with the most for the 1.0.7 distribution. Class Path: DLStudio -> DetectAndLocalize

EmbeddingGenerator = <class 'DLStudio.DLStudio.EmbeddingGenerator'>

ExperimentsWithCIFAR = <class 'DLStudio.DLStudio.ExperimentsWithCIFAR'>: Class Path: DLStudio -> ExperimentsWithCIFAR

ExperimentsWithSequential = <class 'DLStudio.DLStudio.ExperimentsWithSequential'>: Demonstrates how to use the torch.nn.Sequential container class Class Path: DLStudio -> ExperimentsWithSequential

MasterDecoderWithMasking = <class 'DLStudio.DLStudio.MasterDecoderWithMasking'>: Borrowed from the Transformers module of DLStudio

Net = <class 'DLStudio.DLStudio.Net'>

ScheduledOptim = <class 'DLStudio.DLStudio.ScheduledOptim'>: As in the Transformers module of DLStudio, for the scheduling of the learning rate during the warm-up phase of training TransformerFG, I have borrowed the class shown below from the GitHub code made available by Yu-Hsiang Huang at: https://github.com/jadore801120/attention-is-all-you-need-pytorch

SelfAttention = <class 'DLStudio.DLStudio.SelfAttention'>: Borrowed from the Transformers module of DLStudio

SemanticSegmentation = <class 'DLStudio.DLStudio.SemanticSegmentation'>: The purpose of this inner class is to be able to use the DLStudio platform for experiments with semantic segmentation. At its simplest level, the purpose of semantic segmentation is to assign correct labels to the different objects in a scene, while localizing them at the same time. At a more sophisticated level, a system that carries out semantic segmentation should also output a symbolic expression based on the objects found in the image and their spatial relationships with one another. The workhorse of this inner class is the mUNet network that is based on the UNET network that was first proposed by Ronneberger, Fischer and Brox in the paper "U-Net: Convolutional Networks for Biomedical Image Segmentation". Their Unet extracts binary masks for the cell pixel blobs of interest in biomedical images. The output of their Unet can therefore be treated as a pixel-wise binary classifier at each pixel position. The mUnet class, on the other hand, is intended for segmenting out multiple objects simultaneously form an image. [A weaker reason for "Multi" in the name of the class is that it uses skip connections not only across the two arms of the "U", but also also along the arms. The skip connections in the original Unet are only between the two arms of the U. In mUnet, each object type is assigned a separate channel in the output of the network. This version of DLStudio also comes with a new dataset, PurdueShapes5MultiObject, for experimenting with mUnet. Each image in this dataset contains a random number of selections from five different shapes, with the shapes being randomly scaled, oriented, and located in each image. The five different shapes are: rectangle, triangle, disk, oval, and star. Class Path: DLStudio -> SemanticSegmentation

TextClassification = <class 'DLStudio.DLStudio.TextClassification'>: The purpose of this inner class is to be able to use the DLStudio platform for simple experiments in text classification. Consider, for example, the problem of automatic classification of variable-length user feedback: you want to create a neural network that can label an uploaded product review of arbitrary length as positive or negative. One way to solve this problem is with a recurrent neural network in which you use a hidden state for characterizing a variable-length product review with a fixed-length state vector. This inner class allows you to carry out such experiments. Class Path: DLStudio -> TextClassification

TextClassificationWithEmbeddings = <class 'DLStudio.DLStudio.TextClassificationWithEmbeddings'>: The text processing class described previously, TextClassification, was based on using one-hot vectors for representing the words. The main challenge we faced with one-hot vectors was that the larger the size of the training dataset, the larger the size of the vocabulary, and, therefore, the larger the size of the one-hot vectors. The increase in the size of the one-hot vectors led to a model with a significantly larger number of learnable parameters --- and, that, in turn, created a need for a still larger training dataset. Sounds like a classic example of a vicious circle. In this section, I use the idea of word embeddings to break out of this vicious circle. Word embeddings are fixed-sized numerical representations for words that are learned on the basis of the similarity of word contexts. The original and still the most famous of these representations are known as the word2vec embeddings. The embeddings that I use in this section consist of pre-trained 300-element word vectors for 3 million words and phrases as learned from Google News reports. I access these embeddings through the popular Gensim library. Class Path: DLStudio -> TextClassificationWithEmbeddings

TransformerFG = <class 'DLStudio.DLStudio.TransformerFG'>: I have borrowed from the DLStudio's Transformers module. "FG" stands for "First Generation" --- which is the Transformer as originally proposed by Vaswani et al.

VAE = <class 'DLStudio.DLStudio.VAE'>: VAE stands for "Variational Auto Encoder". These days, you are more likely to see it written as "variational autoencoder". I consider VAE as one of the foundational neural architectures in Deep Learning. VAE is based on the new celebrated 2014 paper "Auto-Encoding Variational Bayes" by Kingma and Welling. The idea is for the Encoder part of an Encoder-Decoder pair to learn the probability distribution for the Latent Space Representation of a training dataset. Described loosely, the latent vector z for an input image x would be the "essence" of what x is depicting. Presumably, after the latent distribution has been learned, the Decoder should be able to transform any "noise" vector sampled from the latent distribution and convert it into the sort of output you would see during the training process. In case you are wondering about the dimensionality of the Latent Space, consider the case that the input images are eventually converted into 8x8 pixel arrays, with each pixel represented by a 128-dimensional embedding. In a vectorized representation, this implies an 8192-dimensional space for the Latent Distribution. The mean (mu) and the log-variance values (logvar) values learned by the Encoder would represent vectors in an 8,192 dimensional space. The Decoder's job would be sample this distribution and attempt a reconstruction of what the user wants to see at the output of the Decoder. As you can see, the VAE class is derived from the parent class Autoencoder. Bulk of the computing in VAE is done through the functionality packed into the Autoencoder class. Therefore, in order to fully understand the VAE implementation here, your starting point should be the code for the Autoencoder class. Class Path: DLStudio -> VAE

VQGAN = <class 'DLStudio.DLStudio.VQGAN'>

VQVAE = <class 'DLStudio.DLStudio.VQVAE'>: VQVAE is an important architecture in deep learning because it teaches us about what has come to be known as "Codebook Learning" for more efficient discrete representation of images with a finite vocabulary of embedding vectors. VQVAE stands for "Vector Quantized Variational Auto Encoder", which is also frequently represented by the acronym VQ-VAE. The concept of VQ-VAE was formulated in the 2018 publication "Neural Discrete Representation Learning" by van den Oord, Vinyals, and Kavukcuoglu. For the case of images, VQ-VAE means that we want to represent an input image using a user-specified number of embedding vectors. You could think of the set of embedding vectors as constituting a fixed-size vocabulary for representing the input data. To make the definition of Codebook Learning more specific, say we are using an Encoder-Decoder to create such a fixed-vocabulary based representation for the images. Let's assume that the Encoder converts each input batch of images into a (B,C,H,W) shaped tensor where the height H and the width W are likely to be small numbers, say 8 each, and C is likely to be, say, 128. Let's also say that the batch size is 256. The total number of pixels in all the batch instances at the output of the Encoder will be B*H*W. I'll represent this number of pixels with the notation BHW. For the example numbers used above, BHW will be equal to 256*8*8 = 16384. Taking cognizance of the channel axis, we can say that each of the 16,384 pixels at the output of the Encoder is represented by a 128 element vector along the channel axis. As things stand, each C-dimensional pixel based vector at the output of the Encoder will be a continuous valued vector. The goal of VQ-VAE is define a Codebook of K vectors, each of dimension D, with the idea that each of the C-dimensional BHW vectors at the output of the Encode will be replaced by the closest of the K D-dimensional vectors in the Codebook. For practical reasons, we require D=C. The Decoder's job then is to try its best to recreate the input using the Codebook approximations at the output of the Encoder. The goal of VQ-VAE is to demonstrate that it is possible to learn a Codebook with K elements that can subsequently be used to represent any input. You can think of the learned Codebook vectors as the quantized versions of what the Encoder presents at its output. As you can see, the VQVAE class is derived from the parent class Autoencoder. Bulk of the computing in VQVAE is done through the functionality packed into the Autoencoder class. Therefore, in order to fully understand the VQVAE implementation here, your starting point should be the code for the Autoencoder class. Note that the VQVAE code presented here is still tentative. Most of the heavy lifting at the moment is done by the two Vector Representation classes I have borrowed from "zalandoresearch" at GitHub: https://github.com/zalandoresearch/pytorch-vq-vae Class Path: DLStudio -> VQVAE

Functions
		static_var(varname, value) `## Python does not have a decorator for declaring static vars. But you can use ## the following for achieving the same effect. I believe I saw it at stackoverflow.com:`


		__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2026 Avinash Kak. Python Software Foundation.' __date__ = '2026-April-30' __url__ = 'https://engineering.purdue.edu/kak/distDT/DLStudio-2.5.8.html' __version__ = '2.5.8'

Author
		Avinash Kak (kak@purdue.edu)