GenerativeDiffusionCodeOnly.py

# -*- coding: utf-8 -*- __version__ = '2.4.8' __author__ = "Avinash Kak (kak@purdue.edu)" __date__ = '2024-April-27' __url__ = 'https://engineering.purdue.edu/kak/distDLS/DLStudio-2.4.8.html' __copyright__ = "(C) 2024 Avinash Kak. Python Software Foundation." __doc__ = ''' You are looking at the GenerativeDiffusion co-class file in the DLStudio platform. For the overall documentation on DLStudio, visit: https://engineering.purdue.edu/kak/distDLS/ ************************ REGARDING THE ORIGIN OF THE CODE IN THIS FILE: Most of the code you see in this file was drawn from the sources at the GitHub project "Improved Diffusion" at https://github.com/openai/improved-diffusion The source code at the above website is the implementation for the work described in the paper "Improved Denoising Diffusion Probabilistic Models" by Nichol and Dhariwal that you can download from: https://arxiv.org/pdf/2102.09672.pdf That GitHub code library consist of roughly 3900 lines of Python distributed over 10 files in three different directories. The code you'll see here contains around 750 lines of that code and that too after several changes to simplify the organization of the code and to also simplify the call syntax for many of the functions. As just one example of my simplification of code organization, the original code library at GitHub uses the GaussianDiffusion class through a subclass named SpacedDiffusion that is defined in a utility file called script_util.py. I got rid of the SpacedDiffusion class and incorporated that functionality in a rather small function inside the GaussianDiffusion class. Along the same lines, I have also either eliminated some of the functionality associated with the UNetModel class or incorporated it in the definition of UNetModel itself. Another change I made to the UNetModel class was to make the use of "attention" an option in the constructor call for UNetModel. I did so because attention does not seem to matter if you are using diffusion based modeling for, say, 64x64 images --- even the images from the CelebA dataset as long as you resize them to 64x64. Obviously, the GitHub code library is much more general and will allow you to do the kinds of experiments (such as conditional diffusion) that my smaller version of the code base will not. In presenting the code you see here, my main goal is to be able to teach efficiently the core ideas of diffusion based modeling of data. My hope is that after the students have understood the core ideas, they can always go to the original code base at GitHub for more sophisticated experiments. ************************ INTRODUCTION TO GENERATIVE DIFFUSION FOR DATA MODELING: This module in DLStudio addresses the other approach to generative data modeling --- the one based on what's known as "Denoising Diffusion". As explained in my Week 11 slides, the idea is to have two Markov processes, with one that subjects a training image incrementally to diffusing Markov transitions until it becomes pure noise, and the other that starts with Gaussian isotropic noise and denoises it incrementally until what emerges is an image that looks like those in your training data. We denote the two Markov chains as the p-chain and the q-chain, with the former standing for progressively denoising isotropic Gaussian noise until an image emerges, and the latter for a progressive applications of diffusion to an image from the training dataset until it turns into noise. It is important to memorize what the names "p-chain" and "q-chain" stand for since the same notion is used in the code. Again, "p" stands for denoising and "q" stands for diffusion. Another thing you want to commit to memory is that temporal progression means two different things in the two chains. Let's say you have chosen T for the number of timesteps in the two chains. If you assume that the timesteps progress from 0 to T in the forward q-chain, then you must also assume that the timesteps progress from T to 0 in the reverse p-chain. What that implies is that a one-step transition in the q-chain means going from timestep t-1 to t. On the other hand, the corresponding timestep transition in the p-chain PROGRESSES from t to t-1. Let's now talk about the nature of the noise we are going to use in our framework. In principle, the noise that you start with for the purpose of its transformation into a recognizable image can be from any probability distribution. However, in practice, you are likely to use isotropic Gaussian noise because of its unique property that multiple consecutive transitions in a Markov chain can be combined into a single calculation. The above observation plays a central role if how you create a software implementation for generative data modeling with diffusion. Each training cycle starts with acquiring a fresh image from the training dataset, randomly choosing a timestep t in the forward q-chain, and, through a single calculation, estimating the q-chain transition probability q(x_t | x_0). This calculation is based on the formula given by Eq. (67) of my Week 11 slides. Up to this point there is no learning involved and that's because there are no learnable parameters in the forward q-chain. While the amount of noise that is injected into the data at each transition in the forward q-chain is set by the user, how much denoising to carry out at the corresponding transition in the reverse p-chain is determined by a neural network whose job is to estimate the amount of denoising that, in a sense, would be "exact" opposite of the extent of diffusion carried at the corresponding transition in the forward q-chain. Here is a summary of how to train the neural network mentioned above: --- At each iteration of training the neural network, randomly choose a timestep t from the range that consists of T timesteps. --- Apply a single cumulative q-chain transition to the input training image that would be equivalent to taking the input image through t consecutive transitions in the q-chain. --- For each q-chain transition to the timestep t, use the Bayes' Rule to estimate the posterior probability q( x_{t-1} | x_t, x_0 ) from the Markov transition probability q( x_t | x0, x_{t-1} ). --- Use the posterior probabilities mentioned above as the target for training the neural network whose job is to estimate the transition probability p( x_{t-1} | x_t ) in the reverse p-chain. The loss function for training the neural network could be the KL-Divergence between the posterior q( x_{t-1} | x_t, x_0 ) and the predicted p( x_{t-1} | x_t ). Another possibility for the loss would be the MSE error between the isotropic noise that was injected in the q-chain transition in question and the prediction of the same in the p-chain by using the posterior estimates for the mean and the variance using the transition probability p( x_{t-1} | x_t ) that you get from the neural network. Yet another possibility is to directly form an estimate for the input image x_0 using the above-mentioned posterior estimates for the mean and the variance and then construct an MSE loss based on the difference between the estimated x_0 and its true value. As should be clear from the above description, the sole goal of training the neural network is to make it an expert at the prediction of the denoising transition probabilities p( x_{t-1} | x_t ). Typically, you carry out the training in an infinite loop while spiting out the checkpoints every so often. When you are ready to see the image generation power of a checkpoint, you start with isotropic Gaussian noise as the input and take it through all of the T timestep p-chain transitions that should lead to a recognizable image. ************************** The ExamplesDiffusion Directory: You will find the following scripts in the directory ExamplesDiffusion: 0. README 1. RunCodeForDiffusion.py 2. GenerateNewImageSamples.py 3. VisualizeSamples.py As explained in the README, you will need all three Python scripts listed above to do any experiments with diffusion in DLStudio. Of these, the first, RunCodeForDiffusion.py, is for training the neural network to become an expert in estimating the p-chain transition probabilities p( x | x ) t-1 t The timestamp subscripts on the data x in this transition probability may look a bit strange to those who are not familiar with the notion of the reverse Markov chain called p-chain in diffusion modeling. In a reverse Markov chain, the time PROGRESSES from some user-specified T to 0. Therefore, the timestamp t-1 comes AFTER the timestamp t. To run the training code on your image dataset, you will have to point the "data_dir" constructor option of the GenerativeDiffusion class in the script RunCodeForDiffusion.py to your dataset directory. Depending on your training dataset, you may also have to set some of the other constructor parameters in the script RunCodeForDiffusion.py. For example, if the image size in your dataset is different from 64x64, you will have to supply the image_size parameter for the GenerativeDiffusion class. After the model has been trained, you must call the script GenerateNewImageSamples.py to generate new image samples from the learned model. These images are placed in a numpy ndarray archive. Finally, you must call the script VisualizeSamples.py to create the individual images from the numpy archive. Please make sure to go through the README in the ExamplesDiffusion directory before you start playing with the code in this file. @endofdocs ''' import sys,os,os.path import math import random import matplotlib.pyplot as plt import time import glob import copy import enum import numpy as np from PIL import Image import torch import torch.nn as nn import torch.nn.functional as F from torch.optim import AdamW import torchvision ########################################################################################################################### ############################################# Start of GenerativeDiffusion ############################################# class GenerativeDiffusion(object): def __init__(self, *args, **kwargs ): if args: raise ValueError( '''GenerativeDiffusion constructor can only be called with keyword arguments for the following keywords: epochs, lr, learning_rate, batch_size, momentum, image_size, data_dir, dataroot, path_saved_model, use_gpu, ngpu, dlstudio, num_channels, device, log_interval, save_interval, ema_rate, diffusion, network''') allowed_keys = 'dataroot','data_dir','num_channels', 'lr', 'image_size','path_saved_model','momentum','learning_rate','epochs','batch_size', \ 'classes','use_gpu','ngpu','dlstudio','log_interval','save_interval','ema_rate','diffusion','network','clip_denoised','num_samples',\ 'batch_size_image_generation','gen_new_images' keywords_used = kwargs.keys() for keyword in keywords_used: if keyword not in allowed_keys: raise SyntaxError(keyword + ": Wrong keyword used --- check spelling") learning_rate = epochs = batch_size = convo_layers_config = momentum = gen_new_images = None image_size = fc_layers_config = lr = data_dir = num_channels = lr = dataroot = path_saved_model = classes = use_gpu = None latent_vector_size = ngpu = save_interval = log_interval = clipping_threshold = clip_denoised = num_samples = batch_size_image_generation = None if 'data_dir' in kwargs : data_dir = kwargs.pop('data_dir') if 'num_channels' in kwargs : num_channels = kwargs.pop('num_channels') if 'lr' in kwargs : lr = kwargs.pop('lr') if 'batch_size' in kwargs : batch_size = kwargs.pop('batch_size') if 'image_size' in kwargs : image_size = kwargs.pop('image_size') if 'log_interval' in kwargs : log_interval = kwargs.pop('log_interval') if 'save_interval' in kwargs : save_interval = kwargs.pop('save_interval') if 'ema_rate' in kwargs : ema_rate = kwargs.pop('ema_rate') if 'diffusion' in kwargs : diffusion = kwargs.pop('diffusion') if 'network' in kwargs : network = kwargs.pop('network') if 'gen_new_images' in kwargs : gen_new_images = kwargs.pop('gen_new_images') if 'ngpu' in kwargs : ngpu = kwargs.pop('ngpu') if 'path_saved_model'in kwargs : path_saved_model = kwargs.pop('path_saved_model') if 'clip_denoised' in kwargs : clip_denoised = kwargs.pop('clip_denoised') if 'num_samples' in kwargs : num_samples = kwargs.pop('num_samples') if 'batch_size_image_generation' in kwargs : batch_size_image_generation = kwargs.pop('batch_size_image_generation') if gen_new_images is None: gen_new_images = False if data_dir: self.data_dir = data_dir if num_channels: self.num_channels = num_channels if lr: self.lr = lr if batch_size: self.batch_size = batch_size if image_size: self.image_size = image_size if log_interval: self.log_interval = log_interval if save_interval: self.save_interval = save_interval if ema_rate: self.ema_rate = ema_rate if diffusion: self.diffusion = diffusion if network: self.network = network if path_saved_model: self.path_saved_model = path_saved_model if path_saved_model and (gen_new_images is False): if os.path.exists(path_saved_model): files = glob.glob(path_saved_model + "/*") for file in files: if os.path.isfile(file): os.remove(file) else: os.mkdir(path_saved_model) if ngpu: self.ngpu = ngpu if ngpu > 0: if torch.cuda.is_available(): self.device = torch.device("cuda:0") else: self.device = torch.device("cpu") if clip_denoised: self.clip_denoised = clip_denoised if num_samples: self.num_samples = num_samples if batch_size_image_generation: self.batch_size_image_generation = batch_size_image_generation if clipping_threshold: self.clipping_threshold = clipping_threshold self.training_iteration_counter = 0 class ImageDataset(torch.utils.data.Dataset): """ Source: https://github.com/openai/improved-diffusion Access the image dataset. Its __getitem__ makes available one image at a time as a tensor. The Dataloader to be defined later creates tensor for a batch of images. """ def __init__(self, image_size, image_paths): super(GenerativeDiffusion.ImageDataset, self).__init__() self.image_size = image_size self.image_paths = image_paths def __len__(self): return len(self.image_paths) def __getitem__(self, idx): path = self.image_paths[idx] pil_image = Image.open(path) width,height = pil_image.size if any( (width != 64, height != 64) ): pil_image = pil_image.resize( (64,64), Image.ANTIALIAS ) arr = np.array(pil_image.convert("RGB")) arr = arr.astype(np.float32) / 127.5 - 1 return np.transpose(arr, [2, 0, 1]) def run_code_for_training(self): """ Somewhat modified version of the code at: https://github.com/openai/improved-diffusion """ quartile_losses_dict = {'mse_loss' : 0.0, 'mse_loss_q3': 0.0, 'mse_loss_q2': 0.0, 'mse_loss_q1': 0.0, 'mse_loss_q0': 0.0, 'grad_norm': 0.0, 'iteration' : 0, 'samples' : 0 } def clear_quartile_losses_dict(): for key in quartile_losses_dict: if key in ['mse_loss', 'mse_loss_q3', 'mse_loss_q2', 'mse_loss_q1', 'mse_loss_q0', 'grad_norm']: quartile_losses_dict[key] = 0.0 diffusion = self.diffusion diffusion.update_betas_for_diffusion() network = self.network ## Note that network.parameters() returns a Python generator object: network_params = list(network.parameters()) opt = AdamW(network_params, lr=self.lr, weight_decay=0.0) ema_rate = [self.ema_rate] ema_params = [ copy.deepcopy(network_params) for _ in range(len(ema_rate)) ] network = network.to(self.device) def calc_loss_dict(diffusion, ts, losses): """ The loss that is calculated for backpropagation is only with regard to the neural network's ability to match in the p-chain the posterior of the transition probability as used in the q-chain. That loss is not a direct indicator of how the p-chain would perform with regard generating new images from noise. The quartile losses as calculated here are meant to address that problem. Recall that for generating a new image, you have to go through all T transitions in the p-chain. """ for key, values in losses.items(): oldval = quartile_losses_dict[key] quartile_losses_dict[key] = oldval * self.training_iteration_counter / (self.training_iteration_counter + 1) + \ values.mean().item() / (self.training_iteration_counter + 1) # Log the quantiles (four quartiles, in particular). for sub_t, sub_loss in zip(ts.cpu().numpy(), values.detach().cpu().numpy()): quartile = int(4 * sub_t / diffusion.num_diffusion_timesteps) oldval = quartile_losses_dict[f"{key}_q{quartile}"] quartile_losses_dict[f"{key}_q{quartile}"] = oldval * self.training_iteration_counter / (self.training_iteration_counter + 1) + \ sub_loss / (self.training_iteration_counter + 1) def _log_grad_norm(): sqsum = 0.0 for p in network.parameters(): if p.grad is None: continue sqsum += (p.grad ** 2).sum().item() oldval = quartile_losses_dict["grad_norm"] quartile_losses_dict["grad_norm"] = oldval * self.training_iteration_counter / (self.training_iteration_counter + 1) + \ np.sqrt(sqsum) / (self.training_iteration_counter + 1) def _network_params_to_state_dict(network_params): state_dict = network.state_dict() for i, (name, _value) in enumerate(network.named_parameters()): assert name in state_dict state_dict[name] = network_params[i] return state_dict def load_data(): print("creating data loader...") if os.path.exists(self.data_dir): all_files = glob.glob(self.data_dir + "/*") dataset = self.ImageDataset( self.image_size, all_files) loader = torch.utils.data.DataLoader(dataset, batch_size=self.batch_size, shuffle=True, num_workers=1, drop_last=True ) while True: yield from loader data = load_data() print("training...") while True: opt.zero_grad() batch = next(data) batch = batch.to(self.device) ## Uniformly sample the 1000 integer timestep integers in the range (0, num_diffusion_timesteps) indices_np = np.random.choice(self.diffusion.num_diffusion_timesteps, size=(self.batch_size)) t = torch.from_numpy(indices_np).long().to(self.device) losses = diffusion.training_losses( network, batch, t ) calc_loss_dict( diffusion, t, {k: v for k, v in losses.items()} ) loss = (losses["mse_loss"]).mean() loss.backward() _log_grad_norm() opt.step() for rate, params in zip(ema_rate, ema_params): update_ema(params, network_params, rate=rate) quartile_losses_dict['iteration'] = self.training_iteration_counter quartile_losses_dict['samples'] = self.training_iteration_counter * self.batch_size if self.training_iteration_counter % self.log_interval == 0: d = quartile_losses_dict mse_loss_dict = {} names_deleted = [] print("\n\n---------------------------") for (name, val) in d.items(): if "mse_loss" in name: mse_loss_dict[name] = val print("TRAINING ITERATION " + " : " + str("%d" % d["iteration"])) print() print("training samples used " + " : " + str("%d" % d["samples"])) print("grad_norm" + " : " + str("%.6f" % d['grad_norm'])) print() for (name, val) in sorted(mse_loss_dict.items()): print(name + " : " + str("%.8f" % val)) print("---------------------------\n\n") clear_quartile_losses_dict() if self.training_iteration_counter % self.save_interval == 0: print("\n\nSaving checkpoint at training iteration: %d\n\n" % self.training_iteration_counter) state_dict = _network_params_to_state_dict(network_params) filename = f"/ema_{rate}_{(self.training_iteration_counter):06d}.pt" torch.save(state_dict, self.path_saved_model + filename ) self.training_iteration_counter += 1 # Save the last checkpoint if it wasn't already saved. if (self.training_iteration_counter - 1) % self.save_interval != 0: state_dict = _network_params_to_state_dict(network_params) filename = f"/ema_{rate}_{(self.training_iteration_counter):06d}.pt" torch.save(state_dict, self.path_saved_model + filename ) ######################################### End of GenerativeDiffusion class ########################################### ########################################################################################################################### ########################################################################################################################### ####################################### Start Definition of Gaussian Diffusion #################################### def _extract_into_tensor(arr, timesteps, broadcast_shape): """ Source: https://github.com/openai/improved-diffusion Reformat the values in a 1-D numpy array supplied by the parameter 'arr' for a batch of indices. The parameter 'arr' refers to the numpy array of alpha-beta values mentioned on Slide 162 of my Week 11 slides. The "batch of indices" refers to the 1D tensor of randomly chosen timestamps supplied through the parameter 'timesteps'. For example, if the batch size for model training is 8, the value of the parameter 'timestamps' will look like tensor([315, 541, 70, 49, 726, 818, 149, 160], device='cuda:0') with each timestamp chosen randomly from the range (1,999) when T=1000. The parameter 'broadcast_shape' is the standard (B,C,H,W) for a data tensor flowing through a network. The goal of this function is to first convert 'arr' into a tensor and then to reshape the tensor to look like the one with four axes, as in the shape (B,C,H,W). At the same time, we want the values in 'arr' to be along the batch axis of the outgoing tensor. """ res = torch.from_numpy(arr).to(device=timesteps.device)[timesteps].float() while len(res.shape) < len(broadcast_shape): res = res[..., None] ## Each iteration adds one more axis with no data return res.expand(broadcast_shape) class GaussianDiffusion: """ Source: https://github.com/openai/improved-diffusion This class is for subjecting a training image to Markov transitions that inject small amounts of isotropic Gaussian noise into the image until you end with pure zero-mean isotropic Gaussian noise. For a theoretical description of the process, see how the q-chain works in my Week 11 slides. As for the changes from the GitHub code, I have dropped its subclass that was named SpacedDiffusion. The purpose served by that subclass is now taken care of by the method named "update_betas_for_diffusion()" in the class presented here. Other changes include dropping the associated classes ModelMeanType and ModelVarType and the code related to those classes from the main class shown below., """ def __init__( self, num_diffusion_timesteps ): self.num_diffusion_timesteps = num_diffusion_timesteps beta_start = 0.0001 beta_end = 0.02 betas = np.linspace( beta_start, beta_end, self.num_diffusion_timesteps, dtype=np.float64 ) # Use float64 for accuracy. betas = np.array(betas, dtype=np.float64) self.betas = betas assert len(betas.shape) == 1, "betas must be 1-D" assert (betas > 0).all() and (betas <= 1).all() alphas = 1.0 - betas self.alphas_cumprod = np.cumprod(alphas, axis=0) self.alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1]) self.alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0) assert self.alphas_cumprod_prev.shape == (self.num_diffusion_timesteps,) # Forward Markov Chain: Calculates q(x_t | x_{t-1}) for diffusion: self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod) self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod) self.log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod) self.sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod) self.sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1) # calculations for posterior q(x_{t-1} | x_t, x_0) self.posterior_variance = ( betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod) ) # log calculation clipped because the posterior variance is 0 at the # beginning of the diffusion chain. self.posterior_log_variance_clipped = np.log( np.append(self.posterior_variance[1], self.posterior_variance[1:]) ) self.posterior_mean_coef1 = ( betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod) ) self.posterior_mean_coef2 = ( (1.0 - self.alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - self.alphas_cumprod) ) def update_betas_for_diffusion(self): """ Presumably, this updating of the betas makes the diffusion process more robust when you skip timesteps to take advantage of the property of the Gaussian diffusion process that allows you to combine multiple timesteps into a single invocation of diffusion. """ use_timesteps = range(self.num_diffusion_timesteps) last_alpha_cumprod = 1.0 new_betas = [] for i, alpha_cumprod in enumerate(self.alphas_cumprod): if i in use_timesteps: new_betas.append(1 - alpha_cumprod / last_alpha_cumprod) last_alpha_cumprod = alpha_cumprod self.betas = np.array(new_betas) def q_sample(self, x_start, t, noise): """ Source: https://github.com/openai/improved-diffusion This is in the forward q-chain. Given the training image x_0, it calculates q( x_t | x_0 ) for a randomly chosen timestep t. The parameter x_start is is the same thing as x_0 in the slides. This is an implementation of Eq. (68) on Slide 154. """ assert noise.shape == x_start.shape return ( _extract_into_tensor( self.sqrt_alphas_cumprod, t, x_start.shape) * x_start + _extract_into_tensor( self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise ) def q_posterior_mean_and_variance(self, x_start, x_t, t): """ Source: https://github.com/openai/improved-diffusion Compute the mean and variance of the diffusion posterior: q(x_{t-1} | x_t, x_0) In the forward direction, the transitions in the q-chain are from timestep t-1 to the timestep t. After having made the forward transition, we now want to use the Bayes' Rule to go "backwards" from t to t-1 in order to figure out the mean and the variance of the posterior probability distribution at the previous timestep t-1 given that we have the value x_t and ALSO the starting value x_0. These posterior estimates can then serve as the targets for the denoising neural network that is learning how best to make the FORWARD transition from timestep t to the timestep t-1 in the reverse p-chain. """ assert x_start.shape == x_t.shape posterior_mean = ( _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t ) posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape) posterior_log_variance_clipped = _extract_into_tensor( self.posterior_log_variance_clipped, t, x_t.shape ) assert ( posterior_mean.shape[0] == posterior_variance.shape[0] == posterior_log_variance_clipped.shape[0] == x_start.shape[0] ) return posterior_mean, posterior_variance, posterior_log_variance_clipped def predict_xstart_in_pchain( self, model, x, t, clip_denoised=True ): """ Source: https://github.com/openai/improved-diffusion This function seeks to predict the training image x0 at each timestep in the reverse p-chain. For that it needs the output of the neural network as well as the p-chain variances calculated in the __init__() of the GaussianDiffusion class. The parameter 'model' stands for the neural-netork being used. To understand what exact timestep corresponds to t, note that this function is invoked after the forward q-chain has made the t-1 to t transition. Note that t will be set to a tensor of timesteps, one for each batch sample. When the parameter 'clip_denoised' is set to True, the output of the neural network is clipped to be in the interval [-1, 1]. """ B, C = x.shape[:2] assert t.shape == (B,) model_output = model(x, t) ## See Section 3.1 of Ho et el. paper for why the variance of the output ## of the neural network is set to the quantity self.posterior_variance that ## was calculated in the __init__() of GaussianDiffusion class: model_variance, model_log_variance = self.posterior_variance, self.posterior_log_variance_clipped model_variance = _extract_into_tensor(model_variance, t, x.shape) model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape) if clip_denoised: pred_xstart = self._predict_xstart_from_model_output(x_t=x, t=t, model_output=model_output).clamp(-1,1) else: pred_xstart = self._predict_xstart_from_model_output(x_t=x, t=t, model_output=model_output) model_mean, _, _ = self.q_posterior_mean_and_variance( x_start=pred_xstart, x_t=x, t=t ) assert ( model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shape ) return { "mean": model_mean, "variance": model_variance, "log_variance": model_log_variance, "pred_xstart": pred_xstart, } def training_losses(self, model, x_start, t): """ Source: https://github.com/openai/improved-diffusion Compute training losses for a single timestep. :param model: the model to evaluate loss on. :param x_start: the [N x C x ...] tensor of inputs. :param t: a batch of timestep indices. :param noise: if specified, the specific Gaussian noise to try to remove. :return: a dict with the key "loss" containing a tensor of shape [N]. Some mean or variance settings may also have other keys. """ def mean_flat(tensor): """ Take the mean over all non-batch dimensions. """ return tensor.mean(dim=list(range(1, len(tensor.shape)))) noise = torch.randn_like(x_start) x_t = self.q_sample(x_start, t, noise) terms = {} model_output = model(x_t, t) target = noise assert model_output.shape == target.shape == x_start.shape terms["mse_loss"] = mean_flat((target - model_output) ** 2) return terms def _predict_xstart_from_model_output(self, x_t, t, model_output): """ Source: https://github.com/openai/improved-diffusion """ assert x_t.shape == model_output.shape return ( _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * model_output ) def p_sampler_for_image_generation( self, model, shape, device, clip_denoised=True ): """ Based on the code at: Source: https://github.com/openai/improved-diffusion After the model has been trained, you can use this function to generate a batch of images by deploying the reverse p-chain through all its transitions. """ assert isinstance(shape, (tuple, list)) img = torch.randn(*shape, device=device) indices = list(range(self.num_diffusion_timesteps))[::-1] for i in indices: t = torch.tensor([i] * shape[0], device=device) with torch.no_grad(): out = self.p_sample( model, img, t, clip_denoised=clip_denoised ) img = out["sample"] return img def p_sample( self, model, x, t, clip_denoised=True ): """ Source: https://github.com/openai/improved-diffusion Sample x_{t-1} from the model at the given timestep. Recall that the timestep t-1 comes AFTER the timestep t in the p-chain. :param model: the model to sample from. :param x: the current tensor at x_{t-1}. :param t: the value of t, starting at 0 for the first diffusion step. :param clip_denoised: if True, clip the x_start prediction to [-1, 1]. :return: a dict containing the following keys: - 'sample': a random sample from the model. - 'pred_xstart': a prediction of x_0. """ out = self.predict_xstart_in_pchain( model, x, t, clip_denoised=clip_denoised ) noise = torch.randn_like(x) nonzero_mask = ( (t != 0).float().view(-1, *([1] * (len(x.shape) - 1))) ) # no noise when t == 0 sample = out["mean"] + nonzero_mask * torch.exp(0.5 * out["log_variance"]) * noise return {"sample": sample} ############################################# END of Gaussian Diffusion ################################################ ########################################################################################################################### ########################################################################################################################### ############################################### Start of Class UNetModel ################################################## class AttentionBlock(torch.nn.Module): """ Source: https://github.com/openai/improved-diffusion I am assuming that you already have a good understanding of QKV Attention as originally formulated in the Vaswani et al paper on Transformers and how the same attention mechanism was used in the Vision Transformer, ViT. There are significant differences in those formulations of Attention and how the same concept works here for the case of diffusion. If you are not familiar with the original formulation of Attention as mentioned above, you might wish to consult my Week 14 Slides at the Deep Learning website at Purdue. The notion of the embedding vector representation of the basic units of the input data plays a fundamental role in the original formulation of Attention. For example, in seq-2-seq learning for language translation, each word in a sentence will be represented by an embedding vector of size M of floating-point numbers. Single-headed attention consists of associating with each embedding vector a Query vector Q, a Key vector K, and a Value vector V. The QKV vectors for the different input units interact through dot-products for each input unit to figure out how it should attend to the other input units. And that's what's referred to as the Attention mechanism. Multi-headed attention does the same thing but by first segmenting the embedding vectors into P segments where P is the number of Attention Heads. Subsequently, the QKV attention is calculated for each segment in exactly the same manner as for Single-headed attention. The same notion is used in UNetModel for inter-pixel attention at a couple of different levels in the UNet. The data that is input into the UNet is of shape (B,C,H,W). For calculating the inter-pixel attention, for each pixel in the HxW array, we consider the C floating-point values along the channel axis as the embedding vector representation of that pixel. And, then, speaking a bit loosely, the rest is the same as before. More accurately, though, we do two things before invoking Attention in this manner: (1) We first flatten the HxW array of pixels into a 1-dimensional pixel array --- just to make it easier to write the dot-product code later. (2) We use a 1-dimensional convolution on the 1-dimensional array of pixels to convert the C channels associated with each pixel into a 3*C channels. Since the channel axis is used as the embedding vector at each pixel, increasing the number of channels gives us more latitude in dividing the channel axis into portions reserved for Q, K, and V. For a deeper reason as to why we need this C-channel to 3*C-channel transformation of the input, read the end of the next paragraph. An interesting difference between the formulation of Attention as in Vaswani et al. and the same mechanism for inter-pixel attention as implemented below is the absence of the matrices that multiply the embedding vectors for the calculation of Q, K, and V. In what follows, those three matrices are incorporated implicitly in the matrix operator used for the 1-dimensional convolution carried out by 'self.qkv' that is declared in the constructor shown below. It is this operator that increases the the number of output channels from C to 3*C. Since, under the hood, a convolution in PyTorch is implemented with a matrix-vector product (as explained in my Week 8 slides at the Purdue DL website), we can conceive of the matrix being segmented along its row-axis into three different parts, one that outputs the Q vector, the second that outputs the K vector, and third that output the V vector. """ def __init__(self, time_embed_dim, num_heads=1, use_checkpoint=False): super().__init__() self.time_embed_dim = time_embed_dim self.num_heads = num_heads self.norm = nn.GroupNorm(32, time_embed_dim) self.qkv = nn.Conv1d(time_embed_dim, time_embed_dim * 3, 1) self.proj_out = zero_module(nn.Conv1d(time_embed_dim, time_embed_dim, 1)) def forward(self, x): ## The variable 'x' starts out being of shape (B,C,H,W). The following statement stores ## away the (H,W) spatial dimensions of the input. That is, the variable 'spatial' will be ## set to (H,W) of the image input x: b, im_channels, *spatial = x.shape ## The number of channels in the input image needs to be the same as the 'time_embed_dim' assert(x.shape[1] == self.time_embed_dim) ## From this point on, the shape of the input will change from (B,C,H,W) to (B,C,H*W). ## That is, up to this point, the pixels formed a 2D array of shape HxW. But, from this point, ## they will constitute a 1D signal with H*W pixels in it. So, if H=W=16, we will basically ## end up with a 1-dimensional signal or a single vector with 256 elements. x = x.reshape(b, self.time_embed_dim, -1) ## The 1-dimensional convolution called for by self.qkv below changes the shape of the input ## from (B,C,H*W) to (B,3*C,H*W). So we are tripling the number of channels in the ## 1-dimensional representation of the input data. For single-headed attention, the idea ## is to use the first set of the C output channels for Q, the second set of C channels for ## K, and the third set of C channels for V --- a la Vaswani transformers. For ## multi-headed attention, we would first divide the 3*C channels amongst the different ## heads, and then, for each head, divide the channels by 3 for specifying the channels for ## Q, K, and V. ## ## Next Consider Multi-Headed Attention: ## ## Let's assume that you are planning to use N heads. You will still divide the 3*C ## channels into three equal parts as for the case of single-headed attention. However, ## you will divide the 3*C channels amongst the N heads. That is, each head will get ## (3*C)/N channels. And, for each, you would divide the (3*C)/N channels amongst Q, K, ## and V. It would amount to assigning C/N channels to each of Q, K, and V, for every ## attention head. qkv = self.qkv(self.norm(x)) ## Note that the variable qkv defined above is just the same as the input x with its ## pixels in the flattened vector and with its number of channels three times the number ## of input channels. ## In the following statement, note that qkv.shape[2] equals the total number of pixels in ## the input --- although now they are in the form of a flattened vector. The tensor ## reshaping shown below allows each attention head to operate independently on its section ## of the channel Axis. It is independent in the same sense as would be case for ## processing of each batch instance. Therefore, for applying the QKV attention formula, we ## can place each attention head on the same footing as each batch instance. qkv = qkv.reshape(b * self.num_heads, -1, qkv.shape[2]) ## Now we are ready to invoke Vaswani's formula for QKV attention. The following returns ## the number of channels to be devoted to each of Q, K, V in each attention head. ch = qkv.shape[1] // 3 ## This splits the channel axis of the qkv tensor into three parts along the channel axis. q, k, v = torch.split(qkv, ch, dim=1) ## This calculates the denominator needed for Eq. (3) on Slide 18 of my Week 18 Lecture on ## transformers: scale = 1 / math.sqrt(math.sqrt(ch)) ## Caculating the attention means computing (Q.K^T)V. The statement below calculates the dot ## product Q.K^T. See Eq. (3) on Slide 18: weight = torch.einsum( "bct,bcs->bts", q * scale, k * scale ) ## Subsequently, you torch.softmax() to the product (Q.K^T): weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype) ## And, finally, multiply the above result with V: h = torch.einsum("bts,bcs->bct", weight, v) h = h.reshape(b, -1, h.shape[-1]) h = self.proj_out(h) return (x + h).reshape(b, im_channels, *spatial) class UNetModel(nn.Module): """ Source: https://github.com/openai/improved-diffusion :param in_channels: channels in the input Tensor. :param model_channels: base channel count for the model. :param out_channels: channels in the output Tensor. :param num_res_blocks: number of residual blocks per downsample. :param attention_resolutions: a collection of downsample rates at which attention will take place. May be a set, list, or tuple. For example, if this contains 4, then at 4x downsampling, attention will be used. :param channel_mult: channel multiplier for each level of the UNet. :param num_heads: the number of attention heads in each attention layer. """ def __init__( self, in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, channel_mult=(1, 2, 4, 8), num_heads=1, attention = True ): super().__init__() self.in_channels = in_channels self.model_channels = model_channels self.out_channels = out_channels self.num_res_blocks = num_res_blocks self.attention_resolutions = attention_resolutions self.channel_mult = channel_mult self.num_heads = num_heads self.attention = attention time_embed_dim = model_channels * 4 ##------------------------------ Inner utility classes for UNetModel ------------------------ class SiLU(torch.nn.Module): def forward(self, x): return x * torch.sigmoid(x) class TimestepBlock(nn.Sequential, nn.Module): """ A sequential module that passes timestep embeddings to the children that support it as an extra input. """ def forward(self, x, emb): for layer in self: if isinstance(layer, TimestepBlock): x = layer(x, emb) else: x = layer(x) return x class Upsample(nn.Module): def __init__(self, channels ): super().__init__() self.channels = channels self.conv = nn.Conv2d(channels, channels, 3, padding=1) def forward(self, x): assert x.shape[1] == self.channels x = F.interpolate(x, scale_factor=2, mode="nearest") x = self.conv(x) return x class Downsample(nn.Module): def __init__(self, channels): super().__init__() self.channels = channels stride = 2 self.op = nn.Conv2d(channels, channels, 3, stride=stride, padding=1) def forward(self, x): assert x.shape[1] == self.channels return self.op(x) class ResBlock(TimestepBlock): """ A residual block that can OPTIONALLY change the number of channels. :param channels: the number of input channels. :param emb_channels: the number of timestep embedding channels. :param out_channels: if specified, the number of out channels. :param use_conv: if True and out_channels is specified, use a spatial convolution instead of a smaller 1x1 convolution to change the channels in the skip connection. """ def __init__(self, channels, emb_channels, out_channels=None): super().__init__() self.channels = channels self.emb_channels = emb_channels ## number of timestep embedding channels self.out_channels = out_channels or channels class SiLU(nn.Module): def forward(self, x): return x * torch.sigmoid(x) self.in_layers = nn.Sequential( nn.GroupNorm(32, channels), SiLU(), nn.Conv2d(channels, self.out_channels, 3, padding=1), ) self.emb_layers = nn.Sequential( SiLU(), nn.Linear(emb_channels, self.out_channels), ) self.out_layers = nn.Sequential( nn.GroupNorm(32, self.out_channels), SiLU(), zero_module( nn.Conv2d(self.out_channels, self.out_channels, 3, padding=1) ), ) if self.out_channels == channels: self.skip_connection = nn.Identity() self.skip_connection = nn.Conv2d(channels, self.out_channels, 1) def forward(self, x, emb): h = self.in_layers(x) emb_out = self.emb_layers(emb).type(h.dtype) while len(emb_out.shape) < len(h.shape): emb_out = emb_out[..., None] h = h + emb_out h = self.out_layers(h) return self.skip_connection(x) + h ##---------------------- END END of Inner utility classes for UNetModel --------------------- self.time_embed = nn.Sequential( nn.Linear(model_channels, time_embed_dim), SiLU(), nn.Linear(time_embed_dim, time_embed_dim), ) self.input_blocks = nn.ModuleList( [ TimestepBlock( nn.Conv2d( in_channels, model_channels, 3, padding=1 ) ) ] ) input_block_chans = [model_channels] ch = model_channels ds = 1 for level, mult in enumerate(channel_mult): for _ in range(num_res_blocks): layers = [ ResBlock( ch, time_embed_dim, out_channels=mult * model_channels ) ] ch = mult * model_channels if self.attention: if ds in attention_resolutions: layers.append( AttentionBlock( ch, num_heads=num_heads ) ) self.input_blocks.append(TimestepBlock(*layers)) input_block_chans.append(ch) if level != len(channel_mult) - 1: self.input_blocks.append( TimestepBlock(Downsample(ch)) ) input_block_chans.append(ch) ds *= 2 if self.attention: self.middle_block = TimestepBlock( ResBlock( ch, time_embed_dim ), AttentionBlock(ch, num_heads=num_heads), ResBlock( ch, time_embed_dim ) ) else: self.middle_block = TimestepBlock( ResBlock( ch, time_embed_dim ), ResBlock( ch, time_embed_dim ) ) self.output_blocks = torch.nn.ModuleList([]) for level, mult in list(enumerate(channel_mult))[::-1]: for i in range(num_res_blocks + 1): layers = [ ResBlock( ch + input_block_chans.pop(), time_embed_dim, out_channels=model_channels * mult ) ] ch = model_channels * mult if self.attention: if ds in attention_resolutions: layers.append( AttentionBlock( ch ) ) if level and i == num_res_blocks: layers.append( Upsample(ch) ) ds //= 2 self.output_blocks.append(TimestepBlock(*layers)) self.out = nn.Sequential( nn.GroupNorm(32, ch), SiLU(), zero_module(nn.Conv2d(model_channels, out_channels, 3, padding=1)), ) def forward(self, x, timesteps, y=None): hs = [] emb = self.time_embed(timestep_embedding(timesteps, self.model_channels)) h = x.type(self.inner_dtype) for module in self.input_blocks: h = module(h, emb) hs.append(h) h = self.middle_block(h, emb) for module in self.output_blocks: cat_in = torch.cat([h, hs.pop()], dim=1) h = module(cat_in, emb) h = h.type(x.dtype) return self.out(h) @property def inner_dtype(self): """ Required by the forward() of this class, UNetModel Get the dtype used by the torso of the model. """ return next(self.input_blocks.parameters()).dtype ################################################# End of UNetModel Class ################################################ ########################################################################################################################### ########################################################################################################################### ################################################# Start of Class mUNet #################################################### class mUNet(nn.Module): """ I have included this class here as a "remote possibility" replacement for the UNetModel class in the code shown above. To test out that role for this class will require considerable hyperparameter tuning of at least the beta coefficients. I have not yet had time to do that. """ def __init__(self, in_channels, model_channels, out_channels, skip_connections=True, depth=16): super().__init__() self.in_channels = in_channels self.model_channels = model_channels self.out_channels = out_channels self.time_embed_dim = self.model_channels * 4 * 24 self.depth = depth // 2 self.conv_in = nn.Conv2d(3, 64, 3, padding=1) time_embed_dim = self.time_embed_dim ##------------------------------ Inner utility classes for mUNet ---------------------------- class SiLU(torch.nn.Module): def forward(self, x): return x * torch.sigmoid(x) class SkipBlockDN(nn.Module): def __init__(self, in_ch, out_ch, downsample=False, skip_connections=True): super(SkipBlockDN, self).__init__() self.downsample = downsample self.skip_connections = skip_connections self.in_ch = in_ch self.out_ch = out_ch self.convo1 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1) self.convo2 = nn.Conv2d(in_ch, out_ch, 3, stride=1, padding=1) self.bn1 = nn.BatchNorm2d(out_ch) self.bn2 = nn.BatchNorm2d(out_ch) if downsample: self.downsampler = nn.Conv2d(in_ch, out_ch, 1, stride=2) def forward(self, x): identity = x out = self.convo1(x) out = self.bn1(out) out = nn.functional.relu(out) if self.in_ch == self.out_ch: out = self.convo2(out) out = self.bn2(out) out = nn.functional.relu(out) if self.downsample: out = self.downsampler(out) identity = self.downsampler(identity) if self.skip_connections: if self.in_ch == self.out_ch: out = out + identity else: out = out + torch.cat((identity, identity), dim=1) return out class SkipBlockUP(nn.Module): def __init__(self, in_ch, out_ch, upsample=False, skip_connections=True): super(SkipBlockUP, self).__init__() self.upsample = upsample self.skip_connections = skip_connections self.in_ch = in_ch self.out_ch = out_ch self.convoT1 = nn.ConvTranspose2d(in_ch, out_ch, 3, padding=1) self.convoT2 = nn.ConvTranspose2d(in_ch, out_ch, 3, padding=1) self.bn1 = nn.BatchNorm2d(out_ch) self.bn2 = nn.BatchNorm2d(out_ch) if upsample: self.upsampler = nn.ConvTranspose2d(in_ch, out_ch, 1, stride=2, dilation=2, output_padding=1, padding=0) def forward(self, x): identity = x out = self.convoT1(x) out = self.bn1(out) out = nn.functional.relu(out) out = nn.ReLU(inplace=False)(out) if self.in_ch == self.out_ch: out = self.convoT2(out) out = self.bn2(out) out = nn.functional.relu(out) if self.upsample: out = self.upsampler(out) identity = self.upsampler(identity) if self.skip_connections: if self.in_ch == self.out_ch: out = out + identity else: out = out + identity[:,self.out_ch:,:,:] return out ##---------------------- END END of Inner utility classes for mUNet ------------------------ self.time_embed = torch.nn.Sequential( torch.nn.Linear(self.model_channels, time_embed_dim), SiLU(), torch.nn.Linear(time_embed_dim, time_embed_dim), ) ## For the DN arm of the U: self.bn1DN = nn.BatchNorm2d(64) self.bn2DN = nn.BatchNorm2d(128) self.skip64DN_arr = nn.ModuleList() for i in range(self.depth): self.skip64DN_arr.append(SkipBlockDN(64, 64, skip_connections=skip_connections)) self.skip64dsDN = SkipBlockDN(64, 64, downsample=True, skip_connections=skip_connections) self.skip64to128DN = SkipBlockDN(64, 128, skip_connections=skip_connections ) self.skip128DN_arr = nn.ModuleList() for i in range(self.depth): self.skip128DN_arr.append(SkipBlockDN(128, 128, skip_connections=skip_connections)) self.skip128dsDN = SkipBlockDN(128,128, downsample=True, skip_connections=skip_connections) ## For the UP arm of the U: self.bn1UP = nn.BatchNorm2d(128) self.bn2UP = nn.BatchNorm2d(64) self.skip64UP_arr = nn.ModuleList() for i in range(self.depth): self.skip64UP_arr.append(SkipBlockUP(64, 64, skip_connections=skip_connections)) self.skip64usUP = SkipBlockUP(64, 64, upsample=True, skip_connections=skip_connections) self.skip128to64UP = SkipBlockUP(128, 64, skip_connections=skip_connections ) self.skip128UP_arr = nn.ModuleList() for i in range(self.depth): self.skip128UP_arr.append(SkipBlockUP(128, 128, skip_connections=skip_connections)) self.skip128usUP = SkipBlockUP(128,128, upsample=True, skip_connections=skip_connections) self.conv_out = nn.ConvTranspose2d(64, 3, 3, stride=2,dilation=2,output_padding=1,padding=2) def forward(self, x, timesteps): emb = self.time_embed(timestep_embedding(timesteps, self.model_channels)) x += emb.view(x.shape) ## Going down to the bottom of the U: x = nn.MaxPool2d(2,2)(nn.functional.relu(self.conv_in(x))) for i,skip64 in enumerate(self.skip64DN_arr[:self.depth//4]): x = skip64(x) num_channels_to_save1 = x.shape[1] // 2 save_for_upside_1 = x[:,:num_channels_to_save1,:,:].clone() x = self.skip64dsDN(x) for i,skip64 in enumerate(self.skip64DN_arr[self.depth//4:]): x = skip64(x) x = self.bn1DN(x) num_channels_to_save2 = x.shape[1] // 2 save_for_upside_2 = x[:,:num_channels_to_save2,:,:].clone() x = self.skip64to128DN(x) for i,skip128 in enumerate(self.skip128DN_arr[:self.depth//4]): x = skip128(x) x = self.bn2DN(x) num_channels_to_save3 = x.shape[1] // 2 save_for_upside_3 = x[:,:num_channels_to_save3,:,:].clone() for i,skip128 in enumerate(self.skip128DN_arr[self.depth//4:]): x = skip128(x) x = self.skip128dsDN(x) ## Coming up from the bottom of U on the other side: x = self.skip128usUP(x) for i,skip128 in enumerate(self.skip128UP_arr[:self.depth//4]): x = skip128(x) x[:,:num_channels_to_save3,:,:] = save_for_upside_3 x = self.bn1UP(x) for i,skip128 in enumerate(self.skip128UP_arr[:self.depth//4]): x = skip128(x) x = self.skip128to64UP(x) for i,skip64 in enumerate(self.skip64UP_arr[self.depth//4:]): x = skip64(x) x[:,:num_channels_to_save2,:,:] = save_for_upside_2 x = self.bn2UP(x) x = self.skip64usUP(x) for i,skip64 in enumerate(self.skip64UP_arr[:self.depth//4]): x = skip64(x) x[:,:num_channels_to_save1,:,:] = save_for_upside_1 x = self.conv_out(x) return x ################################################## End of mUNet Class ################################################## ########################################################################################################################### ########################################################################################################################### ############################################## Start of utilty functions ################################################## def timestep_embedding(timesteps, dim, max_period=10000): """ Source: https://github.com/openai/improved-diffusion Create sinusoidal timestep embeddings. :param timesteps: a 1-D Tensor of N indices, one per batch element. These may be fractional. :param dim: the dimension of the output. :param max_period: controls the minimum frequency of the embeddings. :return: an [N x dim] Tensor of positional embeddings. """ half = dim // 2 freqs = torch.exp( -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half ).to(device=timesteps.device) args = timesteps[:, None].float() * freqs[None] embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1) if dim % 2: embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1) return embedding def update_ema(target_params, source_params, rate=0.99): """ Source: https://github.com/openai/improved-diffusion Update target parameters to be closer to those of source parameters using an exponential moving average. :param target_params: the target parameter sequence. :param source_params: the source parameter sequence. :param rate: the EMA rate (closer to 1 means slower). """ for targ, src in zip(target_params, source_params): targ.detach().mul_(rate).add_(src.cpu(), alpha=1 - rate) def zero_module(module): """ Source: https://github.com/openai/improved-diffusion Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module ############################################### End of utilty functions ################################################### ########################################################################################################################### ##______________________________ Test code follows _________________________________ if __name__ == '__main__': pass