The motivation for this venture was the blog “Simple Neural Network on MNIST Handwritten Digit Datasetꜛ” post by Muhammad Ardi. In his post, Muhammad provides a simple implementation of a neural network using keras. I thought it would be interesting to see how we can achieve the same results by using only NumPy. Let’s see how we can do that.
Before we start, we need to make sure that we have the necessary packages installed. We will just use NumPy, Matplotlib, and keras for loading the MNIST dataset. Let’s create a new conda environment and install the required packages.
conda create -n numpy_ann python=3.11
conda activate numpy_ann
conda install -y mamba
mamba install -y numpy matplotlib keras ipykernel
If you are using an Apple Silicon chip, follow these instructions to install TensorFlow (TensorFlow is required by keras).
First, we need to initialize the parameters of the neural network. We will use a simple feedforward neural network with one hidden layer. The input layer will have 784 neurons (28x28 pixels), the hidden layer will have 5 neurons, and the output layer will have 10 neurons (one for each digit class):
def initialize_network(input_size, hidden_size, output_size):
"""
This function initializes the parameters of the neural network.
W1 and W2 are weight matrices for the input layer and the hidden layer, respectively.
b1 and b2 are bias vectors for the hidden layer and the output layer, respectively.
"""
np.random.seed(42) # for reproducibility
network = {
'W1': np.random.randn(input_size, hidden_size) * 0.01,
'b1': np.zeros((1, hidden_size)),
'W2': np.random.randn(hidden_size, output_size) * 0.01,
'b2': np.zeros((1, output_size))
}
return network
input_size = 28*28 # MNIST images have 28x28 pixels
hidden_size = 5 # number of neurons in the hidden layer
output_size = 10 # Number of classes (the digits 0-9)
network = initialize_network(input_size, hidden_size, output_size)
In the defined function above, W1
and W2
are weight matrices for the input layer and the hidden layer, respectively. b1
and b2
are bias vectors for the hidden layer and the output layer, respectively.
Initializing the network parameters is a crucial step. We want to initialize the weights and biases such that the network can learn effectively. We will use small random values for the weights and zeros for the biases. This is a common practice in neural network initialization. The small random values help to break the symmetry of the weights and prevent the network from getting stuck during training.
The next step is to implement the forward propagation algorithm. This is the process of computing the output of the network given an input. We will use the sigmoid function,
\[\sigma(z) = \frac{1}{1 + e^{-z}},\]as the activation function for the hidden layer and the softmax function,
\[\text{softmax}(z) = \frac{e^{z}}{\sum_{i=1}^{n} e^{z_i}},\]for the output layer:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def softmax(z):
exp_scores = np.exp(z)
return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
The feedforward algorithm is implemented as follows:
def feedforward(network, X):
z1 = X.dot(network['W1']) + network['b1']
a1 = sigmoid(z1)
z2 = a1.dot(network['W2']) + network['b2']
a2 = softmax(z2)
activations = {
'a1': a1,
'a2': a2
}
return activations
The feedforward
function takes the input X
and propagates it through the network to compute the output. It returns the activations of the hidden layer (a1
) and the output layer (a2
).
After defining the forward propagation algorithm, we need to define the loss function. The loss function measures how well the network is performing. We will use the cross-entropy loss function,
Let $a^{[2]}$ be the output of the neural network and $Y$ the actual labels of the dataset, where $m$ is the number of training examples. The cost function $J$ can be formulated as follows:
\[J(a^{[2]}, Y) = -\frac{1}{m} \sum_{i=1}^{m} \left[ Y^{(i)} \log(a^{[2](i)}) + (1 - Y^{(i)}) \log(1 - a^{[2](i)}) \right]\]with
The cross-entropy loss which is commonly used for classification problems. It measures the “distance” between the predicted probabilities $a^{[2]}$ and the actual labels $Y$, with the aim of minimizing this distance:
def compute_cost(a2, Y):
m = Y.shape[0] # number of examples
log_probs = np.multiply(np.log(a2), Y) + np.multiply((1 - Y), np.log(1 - a2))
cost = -np.sum(log_probs) / m
return cost
The compute_cost
function takes the output of the network (a2
) and the true labels (Y
) and computes the cross-entropy loss.
Next, we need to implement the backpropagation algorithm. This is the process of computing the gradients of the loss function with respect to the network parameters. These gradients are used to update the parameters during training. Let’s denote again:
The backpropagation process calculates the gradient of the cost function with respect to each parameter in the network. The steps are as follows:
1. Calculate the error in the output layer ($dz^{[2]}$):
\[dz^{[2]} = a^{[2]} - Y\]2. Calculate the gradients for the output layer (Layer 2) weights ($dW^{[2]}$) and biases ($db^{[2]}$):
\[dW^{[2]} = \frac{1}{m} a^{[1]T} dz^{[2]}\] \[db^{[2]} = \frac{1}{m} \sum dz^{[2]}\]3. Calculate the error in the hidden layer ($dz^{[1]}$):
\[dz^{[1]} = dz^{[2]} W^{[2]T} * a^{[1]} * (1 - a^{[1]})\]*
denotes element-wise multiplication. Here, the operation $a^{[1]} * (1 - a^{[1]})$ applies the derivative of the sigmoid function (since sigmoid was used as the activation function in the hidden layer), which is necessary for the chain rule in backpropagation.
4. Calculate the gradients for the hidden layer (Layer 1) weights ($dW^{[1]}$) and biases ($db^{[1]}$):
\[dW^{[1]} = \frac{1}{m} X^T dz^{[1]}\] \[db^{[1]} = \frac{1}{m} \sum dz^{[1]}\]The gradients ($dW^{[1]}$, $db^{[1]}$, $dW^{[2]}$, $db^{[2]}$) are then used to update the network’s parameters, aiming to minimize the cost function through gradient descent.
This algorithm can be implemented in Python as follows:
def backpropagate(network, activations, X, Y):
m = X.shape[0]
# output from the feedforward:
a1, a2 = activations['a1'], activations['a2']
# error during output:
dz2 = a2 - Y
dW2 = (1 / m) * np.dot(a1.T, dz2)
db2 = (1 / m) * np.sum(dz2, axis=0, keepdims=True)
# error in the hidden layer:
dz1 = np.dot(dz2, network['W2'].T) * a1 * (1 - a1)
dW1 = (1 / m) * np.dot(X.T, dz1)
db1 = (1 / m) * np.sum(dz1, axis=0, keepdims=True)
# gradients:
gradients = {
'dW1': dW1,
'db1': db1,
'dW2': dW2,
'db2': db2
}
return gradients
The backpropagate
function takes the input X
, the true labels Y
, and the activations from the feedforward step. As described above, it computes the gradients of the loss with respect to the parameters of the network and returns the gradients.
The last step is to define the main training loop. This loop iterates over the training data and updates the parameters of the network using the gradients computed by the backpropagation algorithm:
def update_parameters(network, gradients, learning_rate):
network['W1'] -= learning_rate * gradients['dW1']
network['b1'] -= learning_rate * gradients['db1']
network['W2'] -= learning_rate * gradients['dW2']
network['b2'] -= learning_rate * gradients['db2']
def train_network(network, X_train, Y_train, X_val, Y_val, num_iterations=1000, learning_rate=0.1):
train_costs = []
val_costs = []
for i in range(num_iterations):
# training:
train_activations = feedforward(network, X_train)
train_cost = compute_cost(train_activations['a2'], Y_train)
train_costs.append(train_cost)
gradients = backpropagate(network, train_activations, X_train, Y_train)
update_parameters(network, gradients, learning_rate)
# validation:
val_activations = feedforward(network, X_val)
val_cost = compute_cost(val_activations['a2'], Y_val)
val_costs.append(val_cost)
if i % 100 == 0:
print(f"Costs after iteration {i}: Training {train_cost}, Validation {val_cost}")
return train_costs, val_costs
The train_network
function takes the network parameters, the training data, the validation data, and the hyperparameters of the training process. It iterates over the training data, computes the loss, and updates the parameters of the network using the gradients. It also computes the loss on the validation data and stores the training and validation costs for later analysis.
We can also define a function to make predictions using the trained network. This will help us later to further evaluate the performance of the network on the test data:
def predict(network, X):
activations = feedforward(network, X)
predictions = np.argmax(activations['a2'], axis=1)
return predictions
Now that we have defined the neural network and the training process, we can load the MNIST dataset and train the network. We will use the keras
library to load the dataset (the only time we use keras
in this example):
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Here some images along with their true labels:
We need to flatten the input data and normalize it, as the network expects the input to be a vector of pixel values ranging between 0 and 1:
X_train_flat = X_train.reshape(X_train.shape[0], -1) / 255.
X_test_flat = X_test.reshape(X_test.shape[0], -1) / 255.
We also need to convert the labels to “one-hot” format, as the output layer of the network requires a binary vector for each label. The “one-hot” conversion converts, e.g., the label 3 to [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]:
def convert_to_one_hot(Y, C):
Y = np.eye(C)[Y.reshape(-1)]
return Y
num_classes = 10 # number of classes (digits 0-9) in the MNIST dataset
y_train_one_hot = convert_to_one_hot(y_train, num_classes)
y_test_one_hot = convert_to_one_hot(y_test, num_classes)
Now we can train the network using the training data. As hpyerparameters, we will use 1000 iterations and a learning rate of 0.5. You can experiment with different hyperparameters to see how they affect the performance of the network:
# set the hyperparameters:
epochs = 1000 # here, epochs equals the number of iterations since we use the entire dataset for each iteration (no mini-batch)
learning_rate = 0.5
# train the network:
train_costs, val_costs = train_network(network, X_train_flat,
y_train_one_hot, X_test_flat, y_test_one_hot,
num_iterations=epochs, learning_rate=learning_rate)
Training the network will take some time, depending on the number of iterations, the learning rate, and the machine you are using. Running the training loop will give you an output similar to the following:
Costs after iteration 0: Training 3.250864945250842, Validation 3.250411860389282
Costs after iteration 100: Training 2.7007775438283583, Validation 2.6767576337216497
Costs after iteration 200: Training 1.8507350583935092, Validation 1.8343369824764788
Costs after iteration 300: Training 1.487676493296171, Validation 1.472138991701572
Costs after iteration 400: Training 1.2611294493021457, Validation 1.2442551904888854
Costs after iteration 500: Training 1.1039540285064955, Validation 1.0883506980077577
Costs after iteration 600: Training 1.0002200773938172, Validation 0.98698917788099
Costs after iteration 700: Training 0.9303917677855934, Validation 0.9192967613186466
Costs after iteration 800: Training 0.8812264423101561, Validation 0.8718480425948801
Costs after iteration 900: Training 0.8447855877156378, Validation 0.8368302438646806
After training, you can plot the training and validation costs to see how the network’s performance changes over time:
fig = plt.figure(figsize=(6, 4))
plt.plot(train_costs, label='Training Loss')
plt.plot(val_costs, label='Validation Loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.title('Training and validation loss curves')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.legend()
plt.tight_layout()
plt.show()
The plot shows you how the training and validation costs change over the course of the training process. You can use this information to tune the hyperparameters of the network and improve its performance. Since the loss curves are decreasing and the validation loss is close to the training loss, we can assume that the network is learning effectively for the chosen hyperparameters.
In order to further validate the performance of the network, we can make predictions on the test data and compute the accuracy of the network:
# predict the test data:
predictions = predict(network, X_test_flat)
# calculate the accuracy:
actual_labels = y_test
accuracy = np.mean(predictions == actual_labels)
print(f'Accuracy of the network on the test data: {accuracy * 100:.2f}%')
Accuracy of the network on the test data: 86.69%
The accuracy of 86.69% is a good result for our simple feedforward neural network. Feel free to play around with different hyperparameters to see if you can improve the performance further.
To see, how our model predicts the test data, we can plot some of the test images along with their predicted labels:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
ax.imshow(X_test[i], cmap='gray')
ax.set_title(f'predicted label: {predictions[i]}')
ax.axis('off')
plt.show()
As we can see, the network is able to correctly classify most of the test images. Some digits are more difficult to classify than others, but overall the network is doing a good job.
I was actually surprised how easy it is to build a simple feedforward neural network from scratch just using NumPy. It also ran quite fast on my machine. The network was able to achieve a good accuracy on the MNIST dataset, which is a good starting point for further experimentation. You can experiment with different network architectures, hyperparameters, and training strategies to improve the performance of the network. You can also try training the network on other datasets to see how it performs on different types of data. Introducing mini-batch training, regularization, and other techniques to improve the performance of the network would also be a good next step.
I hope this post has given a good starting point for a better intuition about how neural networks work and what is happening under the hood when using libraries like keras or TensorFlow. If you have any comments or ideas for other experiments, feel free to share them in the comments below. I would be happy to hear from you.
The complete code used in this post can be found in this GitHub repositoryꜛ.
]]>I think the logos are a nice addition to the release notes, making each version’s launch a bit more special. I’m certainly looking forward to seeing the logos for future releases.
Mastodonꜛ, an open-source and decentralized social network, provides an excellent platform for constructive, and meaningful conversations. Traditional comment systems can sometimes limit the depth and breadth of interactions. By contrast, Mastodon’s nature as a social media network inherently encourages more dynamic communication, facilitating an enriching dialogue among like-minded people. And given the fact that anyone can join the network and log in from both desktop and mobile devices, it will also be much easier to engage with the blog and its content.
And there are further invaluable benefits for the readers. Since Mastodon prioritizes user privacy and control, the users have more control over their posts and interactions. User data isn’t shared with third parties, and users have full control over their data, including the ability to export and delete it at any time. Traditional comment systems often fall short in this. Furthermore, users don’t have to create a new account on another external platform just to post a comment on the blog, provided they already have a Mastodon account. I believe the less friction there is, the more likely people are to engage.
To engage with the new system, you will need a Mastodon account. The setup processꜛ is straightforward, and you can join any public server aka Mastodon-community or even start your own. I have created a list of useful links to help you get started.
The new comment system will work as follows. Each blog post will be associated to a corresponding post on my Mastodon account. To comment on the blog post, you simply reply to the corresponding Mastodon post. The link to the Mastodon post will be displayed at the bottom of each blog post. Your reply will then appear both as a comment below the blog post and in your Mastodon feed, further extending your reach and connectivity.
On Mastodon, you will have the option to edit your post afterwards or even delete it. You will also receive a notification when someone replies to your comment. This way, you can stay up to date with the latest posts and discussions.
I actually took inspiration from the blogs of Carl Schwanꜛ, Julian Fietkauꜛ, David Revoyꜛ, and Cassidy James Blaedeꜛ, each of whom developed their own individual solution for including Mastodon comments on their static website. Each solution provides a slightly different approach to integrate the Mastodon comment system, I encourage you to check them out.
I decided to go for the solution of Cassidy Jamesꜛ. His solution avoids clicking on a button to load the comments, thus further reduces friction for the readers. You can read a full description of his solution on his blogꜛ.
In the following, I describe how I implemented his solution on my static Jekyll website using the Minimal Mistakes theme, just to demonstrate how you could implement it on your own website.
The core of the integration is a HTML code including liquid and JavaScript commands. You can find the original source of the code in Cassidy James’ GitHub repositoryꜛ. I slightly modified that code to fit my needs. I additionally added the option to check whether a Mastodon post is associated to the blog post and to display a message if no Mastodon post is found. You can find the modified version hereꜛ. I only made some minor changes to the original script, thus it doesn’t matter which version you use. To integrate the code into your Jekyll website using the Minimal Mistakes theme, just place the HTML file into the _inlcudes
folder of your Jekyll project.
The folder _layouts
contains the layout files for your website. The file single.html
is the default layout file for your blog posts. To tell your website to use the new comment system, you have to change or add the following code to that file:
<div class="page__mycomments" id="comments">
{%if page.comments%}
<hr>
{% include mastodon-comments.html url=page.url %}
{%endif%}
</div>
Change mastodon-comments.html
to the name of the HTML file you placed in the _includes
folder in the previous step.
Repeat this step for all other layout files you want to use the new comment system with.
_config.yml
fileThe _config.yml
file contains all the configuration variables for your website. To use the new comment system, you have to change or add the following variables therein:
comments:
provider: false # turn off Minimal Mistakes' default comment system
# your Mastodon host (the server where you have the Mastodon account):
host: sigmoid.social
# vanity domain (optional); host will be used if omitted:
domain: sigmoid.social
# your Mastodon username:
username: pixeltracker
# API token to fetch more than 60 replies to any given blog post (optional):
token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# verified users (optional):
verified:
- XY@XY.XY
# default id (serves as a fallback if no id is specified in front-matter):
id: 110645586336088202
With the first setting, we turn off the default comment system of the Minimal Mistakes theme by setting provider
to false
.
host
and username
are the minimum required variables. host
is the server where you have your Mastodon account. username
is your Mastodon username. The HTML script will use these variables to fetch the corresponding Mastodon post.
domain
is optional and can be used if you have a vanity domain. If you don’t have a vanity domain, just omit this variable.
token
is optional and can be used to fetch more than 60 replies to any given blog post. Cassidy James found out that the Mastodon API limits the number of replies to 60. If you want to fetch more than 60 replies, you have to create an application or a bot account with the read:statuses
scope. You can find a description of how to do this in the Mastodon documentationꜛ. If you don’t want to fetch more than 60 replies, just omit this variable. The application/bot account-bypass is necessary to avoid giving access to your private statuses to anyone who visits your website. You can read more about this issue in this Mastodon postꜛ by Cassidy James. I have not yet tried this, so I can’t provide any further information on this point.
verified
is optional and can be used to add additional verified usernames (e.g. other authors or known friends).
id
is optional and serves as a fallback if no id is specified in the YAML front matter of a blog post. This is another addition I made to the original HTML script. How to find the ID of a Mastodon post is described in the next section.
In each blog post for which you want to enable comments, you have to add the following variables to the YAML front matter:
comments:
id: 110645586336088202
#host: sigmoid.social
#user: pixeltracker
id
is the ID of the corresponding Mastodon post. You can find the ID in the URL of the Mastodon post. For example, the ID of the Mastodon post
https://sigmoid.social/@pixeltracker/110645586336088202
is 110645586336088202
.
You can overwrite the variables host
and user
previously defined in the _config.yml
file for each blog post. This is useful if you want to use a different Mastodon account for a specific blog post or if your blog has more than one author.
Last but not least, we need to style the new comment section. Cassidy James’ SCSS file is an excellent starting point! You can find the original source in his GitHub repositoryꜛ. I put my slightly modified version hereꜛ. To use the provided SCSS code in your Minimal Mistakes theme, just replace the .comment
part in the _sass/minimal-mistakes/_page.scss
file with the content of the SCSS file.
And that’s it! Following these five steps, you should be able to integrate the new comment system into your Jekyll website using the Minimal Mistakes theme.
And also a huge thanks to Cassidy James for providing this solution!
Since I haven’t posted about all my blog posts on Mastodon in the past, I will add the missing posts in the next couple of weeks. In the transition time, I will temporarily link the comments of blog posts, which do not yet have a linked counterpart in Mastodon, to a dummy post. This way, you can already test the new comment system.
I’m curious about your feedback. You can try the new comment system already by replying to this post in the comment section below. Feel free to share your thoughts and suggestions.
]]>Traditional GANs lack control over the types of images they generate. In contrast, conditional Generative Adversarial Networks (cGANs) enable the control over the output of the generator. cGANs are an extension of the original GAN framework where both the generator and discriminator are conditioned on some additional information $y$. This information can be any kind of auxiliary information, such as class labels or data from other modalities. By conditioning the model on additional information, it can be directed to generate data with specific attributes.
The cGAN framework comprises two key components: the generator $G$ and the discriminator $D$. The generator is tasked with creating synthetic data, while the discriminator works as a classifier to distinguish between real and synthetic data. The generator $G$ takes in a latent vector $z$ and the condition $y$, and generates data $G(z, y)$. The discriminator $D$ receives either real data $x$ and the condition $y$, or synthetic data and the condition $y$, and outputs a score $D(x, y)$ representing the authenticity of the received data.
The training process of a cGAN is similar to that of a regular GAN, except for the inclusion of the conditional vector at both the generator and discriminator levels. The process consists of a two-player minimax game where the generator attempts to fool the discriminator by generating synthetic data as close as possible to the real data, while the discriminator tries to distinguish between the real and synthetic data. The objective function of a cGAN is defined as follows:
\[\min_G \max_D V(D, G) = E_x\sim p_{data}(x)[\log D(x|y)]\] \[+ E_z\sim p_z(z)[\log(1 - D(G(z|y))]\]Here, $E$ represents the expectation, $p_{data}(x)$ is the true data distribution, and $p_z(z)$ is the input noise distribution. $x\sim p_{data}(x)$ means that $x$ is drawn from the true data distribution $p_{data}(x)$, and $z\sim p_z(z)$ means that $z$ is drawn from the input noise distribution $p_z(z)$.
During the training process, the weights of the generator and discriminator are updated alternately. First, the discriminator’s weights are updated while keeping the generator’s weights fixed, and then the generator’s weights are updated while keeping the discriminator’s weights fixed.
The code base of the following example comes from this Keras tutorialꜛ.
We start with the imports and defining the constants and hyperparameters. num_channels
refers to the number of color channels, which is 1 for grayscale images like MNIST. num_classes
is the number of distinct classes (in our case 10 for 10 digits), and latent_dim
represents the size of the random noise vector used for generating images:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import imageio
# %% Constants and hyperparameters
batch_size = 64
num_channels = 1
num_classes = 10
image_size = 28
latent_dim = 128
Next, we load the MNIST dataset and preprocessing it. The images are normalized to the range of $[0, 1]$, reshaped to ensure they have a channel dimension, and their labels are one-hot encodedꜛ. The dataset is then shuffled and batched:
# use all the available examples from both the training and test sets:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
all_digits = np.concatenate([x_train, x_test])
all_labels = np.concatenate([y_train, y_test])
# scale the pixel values to [0, 1] range, add a channel dimension to
# the images, and one-hot encode the labels:
all_digits = all_digits.astype("float32") / 255.0
all_digits = np.reshape(all_digits, (-1, 28, 28, 1))
all_labels = keras.utils.to_categorical(all_labels, 10)
# create tf.data.Dataset:
dataset = tf.data.Dataset.from_tensor_slices((all_digits, all_labels))
dataset = dataset.shuffle(buffer_size=1024).batch(batch_size)
print(f"Shape of training images: {all_digits.shape}")
print(f"Shape of training labels: {all_labels.shape}")
Then, we define the generator and discriminator models. The generator takes a noise vector and a class label as input, merges them, and generates an image. The discriminator takes an image and a class label as input, merges them, and classifies whether the image is real or fake:
# calculating the number of input channels for the generator and discriminator:
generator_in_channels = latent_dim + num_classes
discriminator_in_channels = num_channels + num_classes
print(generator_in_channels, discriminator_in_channels)
# create the discriminator:
discriminator = keras.Sequential(
[keras.layers.InputLayer((28, 28, discriminator_in_channels)),
layers.Conv2D(64, (3, 3), strides=(2, 2), padding="same"),
layers.LeakyReLU(alpha=0.2),
layers.Conv2D(128, (3, 3), strides=(2, 2), padding="same"),
layers.LeakyReLU(alpha=0.2),
layers.GlobalMaxPooling2D(),
layers.Dense(1)],
name="discriminator")
# create the generator:
generator = keras.Sequential(
[keras.layers.InputLayer((generator_in_channels,)),
# we want to generate 128 + num_classes coefficients to reshape into a
# 7x7x(128 + num_classes) map:
layers.Dense(7 * 7 * generator_in_channels),
layers.LeakyReLU(alpha=0.2),
layers.Reshape((7, 7, generator_in_channels)),
layers.Conv2DTranspose(128, (4, 4), strides=(2, 2), padding="same"),
layers.LeakyReLU(alpha=0.2),
layers.Conv2DTranspose(128, (4, 4), strides=(2, 2), padding="same"),
layers.LeakyReLU(alpha=0.2),
layers.Conv2D(1, (7, 7), padding="same", activation="sigmoid")],
name="generator")
Now we are ready to define the cGAN model:
class ConditionalGAN(keras.Model):
def __init__(self, discriminator, generator, latent_dim):
super().__init__()
self.discriminator = discriminator
self.generator = generator
self.latent_dim = latent_dim
self.gen_loss_tracker = keras.metrics.Mean(name="generator_loss")
self.disc_loss_tracker = keras.metrics.Mean(name="discriminator_loss")
@property
def metrics(self):
return [self.gen_loss_tracker, self.disc_loss_tracker]
def compile(self, d_optimizer, g_optimizer, loss_fn):
super().compile()
self.d_optimizer = d_optimizer
self.g_optimizer = g_optimizer
self.loss_fn = loss_fn
def train_step(self, data):
# unpack the data:
real_images, one_hot_labels = data
# add dummy dimensions to the labels so that they can be concatenated with
# the images:
# this is for the discriminator:
image_one_hot_labels = one_hot_labels[:, :, None, None]
image_one_hot_labels = tf.repeat(
image_one_hot_labels, repeats=[image_size * image_size])
image_one_hot_labels = tf.reshape(
image_one_hot_labels, (-1, image_size, image_size, num_classes))
# sample random points in the latent space and concatenate the labels:
# this is for the generator:
batch_size = tf.shape(real_images)[0]
random_latent_vectors = tf.random.normal(shape=(batch_size, self.latent_dim))
random_vector_labels = tf.concat(
[random_latent_vectors, one_hot_labels], axis=1)
# decode the noise (guided by labels) to fake images:
generated_images = self.generator(random_vector_labels)
# combine them with real images. Note that we are concatenating the labels
# with these images here:
fake_image_and_labels = tf.concat([generated_images, image_one_hot_labels], -1)
real_image_and_labels = tf.concat([real_images, image_one_hot_labels], -1)
combined_images = tf.concat(
[fake_image_and_labels, real_image_and_labels], axis=0)
# assemble labels discriminating real from fake images:
labels = tf.concat(
[tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0)
# train the discriminator:
with tf.GradientTape() as tape:
predictions = self.discriminator(combined_images)
d_loss = self.loss_fn(labels, predictions)
grads = tape.gradient(d_loss, self.discriminator.trainable_weights)
self.d_optimizer.apply_gradients(
zip(grads, self.discriminator.trainable_weights))
# sample random points in the latent space:
random_latent_vectors = tf.random.normal(shape=(batch_size, self.latent_dim))
random_vector_labels = tf.concat(
[random_latent_vectors, one_hot_labels], axis=1)
# assemble labels that say "all real images":
misleading_labels = tf.zeros((batch_size, 1))
# train the generator:
with tf.GradientTape() as tape:
fake_images = self.generator(random_vector_labels)
fake_image_and_labels = tf.concat([fake_images, image_one_hot_labels], -1)
predictions = self.discriminator(fake_image_and_labels)
g_loss = self.loss_fn(misleading_labels, predictions)
grads = tape.gradient(g_loss, self.generator.trainable_weights)
self.g_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))
# monitor loss:
self.gen_loss_tracker.update_state(g_loss)
self.disc_loss_tracker.update_state(d_loss)
return {
"g_loss": self.gen_loss_tracker.result(),
"d_loss": self.disc_loss_tracker.result()}
Finally, we initiate the cGAN model and train it:
cond_gan = ConditionalGAN(
discriminator=discriminator, generator=generator, latent_dim=latent_dim)
cond_gan.compile(
d_optimizer=keras.optimizers.Adam(learning_rate=0.0003),
g_optimizer=keras.optimizers.Adam(learning_rate=0.0003),
loss_fn=keras.losses.BinaryCrossentropy(from_logits=True))
cond_gan.fit(dataset, epochs=20)
# save the model weights for later use:
cond_gan.save_weights('cGAN_model_weights_MNIST')
At the end, we extract the trained generator from the trained cGAN:
trained_gen = cond_gan.generator
We will generate new images in such a way, that we interpolate between two classes of digits (start_class
and end_class
). Starting from one class, in the interpolation process we gradually change the class label towards the second class while generating images, which give a smooth transition of images from one class to another. We do so, to generate an animation of the interpolated images later. The animation helps us to get a better impression of how the GAN gradually transforms images of one digit into another. This process offers valuable insights into how the GAN captures and manipulates the underlying data distribution.
First, we need to set the number of intermediate images, that are generated in between the interpolation, and sample some noise for the interpolation:
num_interpolation = 50
# sample noise for the interpolation:
interpolation_noise = tf.random.normal(shape=(1, latent_dim))
interpolation_noise = tf.repeat(interpolation_noise, repeats=num_interpolation)
interpolation_noise = tf.reshape(interpolation_noise, (num_interpolation, latent_dim))
Next, we define the interpolation function that generates the images:
def interpolate_class(first_number, second_number):
# Convert the start and end labels to one-hot encoded vectors.
first_label = keras.utils.to_categorical([first_number], num_classes)
second_label = keras.utils.to_categorical([second_number], num_classes)
first_label = tf.cast(first_label, tf.float32)
second_label = tf.cast(second_label, tf.float32)
# Calculate the interpolation vector between the two labels.
percent_second_label = tf.linspace(0, 1, num_interpolation)[:, None]
percent_second_label = tf.cast(percent_second_label, tf.float32)
interpolation_labels = (
first_label * (1 - percent_second_label) + second_label * percent_second_label)
# Combine the noise and the labels and run inference with the generator.
noise_and_labels = tf.concat([interpolation_noise, interpolation_labels], 1)
fake = trained_gen.predict(noise_and_labels)
return fake
Now we are ready to generate the interpolated images and save them as a GIF:
start_class = 6
end_class = 1
fake_images = interpolate_class(start_class, end_class)
fake_images *= 255.0
converted_images = fake_images.astype(np.uint8)
converted_images = tf.image.resize(converted_images, (96, 96)).numpy().astype(np.uint8)
imageio.mimsave("animation.gif", converted_images[:,:,:,0])
Here is the resulting GIF:
and two snapshots from the animation, highlighting the quality of the generated start and end image:
The results are already quite impressive, considering the simplicity of the model. To further improve the model performance and stability and to shorten the time to convergence, we could consider converting the cGAN into a conditioned Wasserstein GAN (WGAN). You can find out more about how to implement Wasserstein GANs in this post.
In summary, conditional Generative Adversarial Networks (cGANs) are a powerful class of generative models. They augment the GAN architecture with the capacity to conditionally generate data, thereby adding a controllable aspect to the data generation process. This makes them a highly practical, applicable and also fun tool for generating data.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below.
]]>In the usual setup of a Generative Adversarial Network (GAN), we have two components bot a generator and a discriminator. The generator’s task is to generate samples that resemble the true data, while the discriminator’s job is to differentiate between the real and generated data. The generator and discriminator play a kind of game, where the generator is trying to fool the discriminator, and the discriminator is trying not to be fooled.
This dynamic creates a situation where the quality of the generator’s outputs is somewhat dependent on the discriminator’s performance. If the discriminator is too weak, it might not provide meaningful feedback to the generator, resulting in poor quality generated samples. If the discriminator is too strong, it might provide overly harsh feedback, leading to instability in training. This necessitates careful balancing of the training of these two components, often a challenging task.
Wasserstein (WGAN) improves upon the original GAN by using the Wasserstein distance as the loss function, which provides a smooth, differentiable metric that correlates better with the visual quality of generated samples. This model thus has the advantage of overcoming common GAN issues like mode collapse and vanishing gradients. In the previous post we explored, how we can rewrite a default GAN to a Wasserstein GAN by changing the loss function of the generator and discriminator.
On the other hand, a WGAN using the Wasserstein distance more directly fundamentally alters this setup. This variation omits the discriminator network entirely, instead computing the Wasserstein distance directly between the distributions of real and generated data. Here, the cost matrix derived from the optimal transport problem provides an exact measure of the distance between the two distributions, serving as a direct performance metric for the generator. This direct computation offers several advantages. Firstly, it gives a more direct and accurate measure of the generator’s performance since it is based on the exact metric (Wasserstein distance) we’re interested in minimizing. Secondly, it simplifies the training process by removing the need for a discriminator, thereby eliminating the challenge of balancing the training of two adversarial components. Lastly, by directly computing and minimizing the Wasserstein distance, the generator learns to model the true data distribution more robustly and stably. Of course, the efficacy of this approach can be dependent on the specific application and the dimensionality of the data.
Let’s see how we can implement the described alternative approach in Python. The code base of the following example comes from this tutorialꜛ of the Python Optimal Transport (POT) library. It uses minibatches to optimize the Wasserstein distance between the real and generated data distributions at each iteration.
For the sake of simplicity, we will use a cross-like distribution as target distribution. The generator will learn to imitate this distribution by generating samples that follow the same distribution. We will use the ot.emd2()
function from the POT library to compute the Wasserstein distance between the real and generated data distributions. This function implements the Earth Mover’s Distance (EMD) algorithm, which solves the optimal transport problem and returns the optimal transport cost matrix. This cost matrix provides an exact measure of the distance between the two distributions, serving as a direct performance metric for the generator.
Let’s start by importing the necessary libraries and generating the target distribution:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import torch
from torch import nn
import ot
# generate the target distribution:
torch.manual_seed(1)
sigma = 0.1
n_dims = 2
n_features = 2
def get_data(n_samples):
# set the thickness of the cross:
thickness = 0.2
# half samples from vertical line, half from horizontal:
x_vert = torch.randn(n_samples // 2, 2) * sigma
x_vert[:, 0] *= thickness # For vertical line, x-coordinate is always within the range [-thickness/2, thickness/2]
x_horiz = torch.randn(n_samples // 2, 2) * sigma
x_horiz[:, 1] *= thickness # For horizontal line, y-coordinate is always within the range [-thickness/2, thickness/2]
x = torch.cat((x_vert, x_horiz), 0)
return x
# plot the distributions
plt.figure(figsize=(5, 5))
x = get_data(500)
plt.figure(1)
plt.scatter(x[:, 0], x[:, 1], label='Data samples from $\mu_d$', alpha=0.5)
plt.title('Data distribution')
plt.legend()
Here is how the target distribution looks like:
Next, we define the generator model. The generator is a simple multilayer perceptron (MLP) model, consisting of three fully connected layers and ReLU activation functions. It takes in a random noise vector and outputs a two-dimensional vector, intended to replicate the data distribution
# define the MLP model
class Generator(torch.nn.Module):
def __init__(self):
super(Generator, self).__init__()
self.fc1 = nn.Linear(n_features, 200)
self.fc2 = nn.Linear(200, 500)
self.fc3 = nn.Linear(500, n_dims)
self.relu = torch.nn.ReLU() # instead of Heaviside step fn
def forward(self, x):
output = self.fc1(x)
output = self.relu(output) # instead of Heaviside step fn
output = self.fc2(output)
output = self.relu(output)
output = self.fc3(output)
return output
We train the generator using a gradient descent optimization algorithm, RMSprop. For each training iteration, we:
ot.emd2()
, which calculates the EMD between the two sets of samples.This approach deviates significantly from the typical training regimen of a GAN, where a discriminator model is trained alongside the generator, and their losses are jointly optimized. Here, the generator’s performance is gauged directly on the Wasserstein distance, providing an exact measure of how close the generator’s distribution is to the target distribution.
G = Generator()
optimizer = torch.optim.RMSprop(G.parameters(), lr=0.00019, eps=1e-5)
# number of iteration and size of the batches:
n_iter = 200 # set to 200 for doc build but 1000 is better ;)
size_batch = 500
# generate static samples to see their trajectory along training:
n_visu = 100
xnvisu = torch.randn(n_visu, n_features)
xvisu = torch.zeros(n_iter, n_visu, n_dims)
ab = torch.ones(size_batch) / size_batch
losses = []
for i in range(n_iter):
# generate noise samples:
xn = torch.randn(size_batch, n_features)
# generate data samples:
xd,_ = get_data(size_batch)
# generate sample along iterations:
xvisu[i, :, :] = G(xnvisu).detach()
# generate samples and compte distance matrix:
xg = G(xn)
M = ot.dist(xg, xd)
loss = ot.emd2(ab, ab, M)
losses.append(float(loss.detach()))
if i % 10 == 0:
print("Iter: {:3d}, loss={}".format(i, losses[-1]))
loss.backward()
optimizer.step()
del M
The rest of the code is dedicated to visualizing the results. For the sake of brevity, I will not show these parts here, but you can find the full code in the GitHub repository mentioned at the end of this post.
Here are the results of the generator’s training:
The performance of the generator is evaluated by monitoring the Wasserstein distance along the iterations. As expected, we see this distance decreasing over time, indicating that the generated distribution is progressively converging to the real data distribution. Furthermore, the snapshots of the generated samples show that the generator successfully learns to mimic the cross-like distribution of the real data.
Since I fell in love with the animation of the generator’s training, I also run the script on two further target distributions: a sinusoidal distribution,
def get_data(n_samples):
# Generates a 2D dataset of samples forming a sine wave with noise.
x = torch.linspace(-np.pi, np.pi, n_samples).view(-1, 1)
y = torch.sin(x) + sigma * torch.randn(n_samples, 1)
data = torch.cat((x, y), 1)
data_sample_name = 'sine'
return data, data_sample_name
and a circular distribution,
def get_data(n_samples):
# Generates a 2D dataset of samples forming a circle with noise.
c = torch.rand(size=(n_samples, 1))
angle = c * 2 * np.pi
x = torch.cat((torch.cos(angle), torch.sin(angle)), 1)
x += torch.randn(n_samples, 2) * sigma
data_sample_name = 'circle'
return x, data_sample_name
The latter is from the original POT documentation tutorial. Here are the training results:
The advantage of the demonstrated approach lies in its direct computation of the Wasserstein distance using optimal transport methods, resulting in a more straightforward and intuitive understanding of how the generator improves over time. It abstains from the need for a discriminator network and the challenge of balancing its training with the generator’s. Consequently, it results in a stable and robust generative model that directly optimizes the very metric (Wasserstein distance) that WGANs were designed to improve.
In conclusion, I think employing the ot.emd2()
function to compute the Wasserstein distance provides an insightful perspective on the WGAN framework. By focusing on the Wasserstein distance directly, it allows for an intuitive understanding of the generator’s performance and mitigates several challenges associated with traditional GAN training. Despite the evident differences in the approaches, both offer valuable insights into the functioning and benefits of Wasserstein GANs, and their choice is influenced by the problem specifics and computational resources.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below.
]]>Generative Adversarial Networks (GANs), introduced by Goodfellow et al. in 2014ꜛ, are a class of generative models that have gained significant popularity in recent years. GANs are a type of unsupervised learning model that can learn to generate data samples that follow the true data distribution. They are typically used to generate images, videos, and audio samples.
GANs consist of two principal components: the generator and the discriminator. The generator tries to generate data samples that follow the true data distribution, while the discriminator’s role is to differentiate between real and generated data samples.
The loss function in GANs is defined based on the game-theoretic concept of a two-player minimax gameꜛ. The generator aims to minimize the function, whereas the discriminator strives to maximize it. Let $G$ be the generator and $D$ the discriminator. The value function $V(G, D)$ for the minimax game is as follows:
\[\min_G \max_D V(D, G) = E_{x\sim p_data(x)}[\log D(x)]\] \[+ E_{z\sim p_z(z)}[\log(1 - D(G(z)))]\]Here, the first term of the value function represents the expected value of the log-probability of the discriminator correctly classifying a real sample. The second term indicates the expected value of the log-probability of the discriminator incorrectly classifying a sample from the generator.
Despite the theoretical elegance and practical success of GANs, they are prone to some issues:
To mitigate the issues related to GANs, Arjovsky et al.ꜛ introduced Wasserstein GANs (WGANs) in 2017. WGANs leverage the concept of the Earth Mover’s (Wasserstein) distance to measure the distance between the real and generated distributions.
The fundamental innovation in WGAN is the replacement of the standard GAN loss function with the Wasserstein loss function. This change offers a more stable training process, primarily because the Wasserstein distance provides meaningful and smooth gradients almost everywhere.
The WGAN value function is defined as follows:
\[\min_G \max_D V(D, G) = E_{x\sim p_data(x)}[D(x)] -\] \[E_{z\sim p_z(z)}[D(G(z))]\]Note that to ensure the Lipschitz continuityꜛ of the discriminator function, WGANs use weight clippingꜛ or gradient penaltyꜛ.
From a performance perspective, WGANs typically generate higher quality samples compared to standard GANs, especially with a lower number of training epochs. Moreover, WGANs have shown better resistance to mode collapseꜛ and provide more stable and reliable training convergence.
However, GANs and WGANs both have their unique strengths and potential use cases. Standard GANs are relatively simple to understand and implement, and they have a wide range of variations and extensions for diverse applications. On the other hand, WGANs, with their theoretical robustness, offer an excellent solution to the typical problems encountered in GANs, making them suitable for applications where model stability is crucial.
Before we dive into the implementation of WGANs, let’s first take a look at a standard GAN. We will use the MNIST datasetꜛ of handwritten digits for this task. The dataset consists of 60,000 training images and 10,000 test images. Each image is a 28x28 grayscale image of a handwritten digit.
We use the code from the TensorFlow tutorial on DCGANꜛ as a starting point. DCGAN stands for Deep Convolutional Generative Adversarial Network, proposed by Radford et al. in 2015ꜛ. It is a type of GAN that uses convolutional layers in the discriminator and generator networks. I just modified the code in such a way, that it also stores and plots the average loss of the generator and discriminator during the training process.
The code starts with importing the necessary libraries and loading the MNIST dataset:
import tensorflow as tf
import glob
import imageio
import matplotlib.pyplot as plt
import numpy as np
import PIL
from tensorflow.keras import layers
import time
from IPython import display
import os
# check whether GAN_images folder is already there, otherwise create it:
if not os.path.exists('GAN_images'):
os.makedirs('GAN_images')
# %% LOAD DATA AND DEFINE MODEL PARAMETERS
(train_images, train_labels), (_, _) = tf.keras.datasets.mnist.load_data()
train_images = train_images.reshape(train_images.shape[0], 28, 28, 1).astype('float32')
train_images = (train_images - 127.5) / 127.5 # Normalize the images to [-1, 1]
BUFFER_SIZE = 60000
BATCH_SIZE = 256
# batch and shuffle the data:
train_dataset = tf.data.Dataset.from_tensor_slices(train_images).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
The training data is then batched and shuffled. The batch size is set to 256 and the buffer size to 60,000. The buffer size is the number of elements from the dataset from which the new dataset will sample.
Next, we define the generator and discriminator models. The generator uses the tf.keras.layers.Conv2DTranspose
(upsampling) layers to produce an image from a seed (random noise):
def make_generator_model():
model = tf.keras.Sequential()
model.add(layers.Dense(7*7*256, use_bias=False, input_shape=(100,)))
model.add(layers.BatchNormalization())
model.add(layers.LeakyReLU())
model.add(layers.Reshape((7, 7, 256)))
assert model.output_shape == (None, 7, 7, 256) # Note: None is the batch size
model.add(layers.Conv2DTranspose(128, (5, 5), strides=(1, 1), padding='same', use_bias=False))
assert model.output_shape == (None, 7, 7, 128)
model.add(layers.BatchNormalization())
model.add(layers.LeakyReLU())
model.add(layers.Conv2DTranspose(64, (5, 5), strides=(2, 2), padding='same', use_bias=False))
assert model.output_shape == (None, 14, 14, 64)
model.add(layers.BatchNormalization())
model.add(layers.LeakyReLU())
model.add(layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh'))
assert model.output_shape == (None, 28, 28, 1)
return model
generator = make_generator_model()
noise = tf.random.normal([1, 100])
generated_image = generator(noise, training=False)
plt.imshow(generated_image[0, :, :, 0], cmap='gray')
The discriminator is a CNN-based image classifier:
def make_discriminator_model():
model = tf.keras.Sequential()
model.add(layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same',
input_shape=[28, 28, 1]))
model.add(layers.LeakyReLU())
model.add(layers.Dropout(0.3))
model.add(layers.Conv2D(128, (5, 5), strides=(2, 2), padding='same'))
model.add(layers.LeakyReLU())
model.add(layers.Dropout(0.3))
model.add(layers.Flatten())
model.add(layers.Dense(1))
return model
discriminator = make_discriminator_model()
decision = discriminator(generated_image)
print (decision)
The loss functions for the generator and discriminator are defined as follows:
# this method returns a helper function to compute cross entropy loss:
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True)
# discriminator loss:
def discriminator_loss(real_output, fake_output):
real_loss = cross_entropy(tf.ones_like(real_output), real_output)
fake_loss = cross_entropy(tf.zeros_like(fake_output), fake_output)
total_loss = real_loss + fake_loss
return total_loss
# generator loss:
def generator_loss(fake_output):
return cross_entropy(tf.ones_like(fake_output), fake_output)
We use cross-entropy loss for both the generator and discriminator. The discriminator loss function compares the discriminator’s predictions on real images to an array of 1s, and the discriminator’s predictions on fake (generated) images to an array of 0s. The generator loss function compares the discriminator’s predictions on fake images to an array of 1s.
As optimizer, we use the Adam optimizer with a learning rate of 0.0001:
generator_optimizer = tf.keras.optimizers.Adam(1e-4)
discriminator_optimizer = tf.keras.optimizers.Adam(1e-4)
# save checkpoints:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(generator_optimizer=generator_optimizer,
discriminator_optimizer=discriminator_optimizer,
generator=generator,
discriminator=discriminator)
The main training loop is defined as follows:
noise_dim = 100
num_examples_to_generate = 16
seed = tf.random.normal([num_examples_to_generate, noise_dim])
gen_losses = []
disc_losses = []
avg_gen_losses_per_epoch = []
avg_disc_losses_per_epoch = []
# This annotation causes the function to be "compiled".
@tf.function
def train_step(images):
noise = tf.random.normal([BATCH_SIZE, noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
generated_images = generator(noise, training=True)
real_output = discriminator(images, training=True)
fake_output = discriminator(generated_images, training=True)
gen_loss = generator_loss(fake_output)
disc_loss = discriminator_loss(real_output, fake_output)
gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)
generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))
return gen_loss, disc_loss
def train(dataset, epochs):
for epoch in range(epochs):
start = time.time()
for image_batch in dataset:
gen_loss, disc_loss = train_step(image_batch)
gen_losses.append(gen_loss)
disc_losses.append(disc_loss)
# calculate average generator and discriminator loss for the current epoch:
avg_gen_loss_this_epoch = np.mean(gen_losses)
avg_disc_loss_this_epoch = np.mean(disc_losses)
# append these averages to our new lists:
avg_gen_losses_per_epoch.append(avg_gen_loss_this_epoch)
avg_disc_losses_per_epoch.append(avg_disc_loss_this_epoch)
# clear the lists for the next epoch:
gen_losses.clear()
disc_losses.clear()
# produce images for the GIF as you go:
display.clear_output(wait=True)
generate_and_save_images(generator,
epoch + 1,
seed)
# save the model every 15 epochs:
if (epoch + 1) % 15 == 0:
checkpoint.save(file_prefix = checkpoint_prefix)
print ('Time for epoch {} is {} sec'.format(epoch + 1, time.time()-start))
# generate after the final epoch:
display.clear_output(wait=True)
generate_and_save_images(generator,
epochs,
seed)
return avg_gen_losses_per_epoch, avg_disc_losses_per_epoch
We also define a function to generate and save images:
def generate_and_save_images(model, epoch, test_input):
# Note, that `training` is set to False. This is so all layers run in
# inference mode (batchnorm).
predictions = model(test_input, training=False)
fig = plt.figure(figsize=(4, 4))
for i in range(predictions.shape[0]):
plt.subplot(4, 4, i+1)
plt.imshow(predictions[i, :, :, 0] * 127.5 + 127.5, cmap='gray')
plt.axis('off')
# annotate the figure with the epoch number
plt.suptitle(f"Epoch: {epoch}", fontsize=16)
plt.savefig('GAN_images/image_at_epoch_{:04d}.png'.format(epoch))
plt.show()
Finally, we train the model for 50 epochs:
# define the training parameters:
EPOCHS = 50
avg_gen_losses_per_epoch, avg_disc_losses_per_epoch = train(train_dataset, EPOCHS)
After the training, we create a GIF of the generated images and plot the average generator and discriminator loss as a function of the training epochs:
# restore the latest checkpoint:
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
# display a single image using the epoch number:
def display_image(epoch_no):
return PIL.Image.open('GAN_images/image_at_epoch_{:04d}.png'.format(epoch_no))
display_image(EPOCHS)
anim_file = 'GAN_images/depp_conv_gan.gif'
with imageio.get_writer(anim_file, mode='I') as writer:
filenames = glob.glob('GAN_images/image*.png')
filenames = sorted(filenames)
for filename in filenames:
image = imageio.imread(filename)
writer.append_data(image)
image = imageio.imread(filename)
writer.append_data(image)
# plot losses:
plt.figure(figsize=(10,5))
plt.title("Average Generator and Discriminator Loss During Training")
plt.plot(avg_gen_losses_per_epoch,label="Generator")
plt.plot(avg_disc_losses_per_epoch,label="Discriminator")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
Here is the resulting GIF, showing the generated images during the training process:
And the according loss of the generator and discriminator:
The GAN nearly needs the full range of 50 epochs to converge and is able to generate images that resemble handwritten digits latest at the end of the training process. The difference between the generator and discriminator losses is large at the beginning and diminishes slowly over time. Further improvements could be achieved by increasing the number of training epochs or tuning the hyperparameters. However, for now we will leave it at that and turn to the implementation of the WGAN, just to see how it performs in comparison to the GAN under the given conditions.
To implement the Wasserstein GAN (WGAN), we use the same code as above. We will implement the WGAN with gradient penalty (WGAN-GP). The necessary main modifications to the code are as follows:
Here are the corresponding changes in the code:
# Changing to RMSprop optimizers with learning rate 0.00005 (typical values for WGANs):
generator_optimizer = tf.keras.optimizers.RMSprop(0.00005)
discriminator_optimizer = tf.keras.optimizers.RMSprop(0.00005)
# Changing the last layer of the discriminator to have a linear activation:
def make_discriminator_model():
model = tf.keras.Sequential()
model.add(layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same',
input_shape=[28, 28, 1]))
model.add(layers.LeakyReLU())
model.add(layers.Conv2D(128, (5, 5), strides=(2, 2), padding='same'))
model.add(layers.LeakyReLU())
model.add(layers.Flatten())
model.add(layers.Dense(1)) # No activation here
return model
# Changing to Wasserstein loss:
def discriminator_loss(real_output, fake_output):
real_loss = -tf.reduce_mean(real_output)
fake_loss = tf.reduce_mean(fake_output)
return real_loss + fake_loss
def generator_loss(fake_output):
return -tf.reduce_mean(fake_output)
# Adding gradient penalty (GP) for the discriminator:
def gradient_penalty(real_images, fake_images):
alpha = tf.random.uniform(shape=[real_images.shape[0], 1, 1, 1], minval=0., maxval=1.)
diff = fake_images - real_images
interpolated = real_images + alpha * diff
with tf.GradientTape() as gp_tape:
gp_tape.watch(interpolated)
pred = discriminator(interpolated, training=True)
grads = gp_tape.gradient(pred, [interpolated])[0]
norm = tf.sqrt(tf.reduce_sum(tf.square(grads), axis=[1, 2, 3]))
gp = tf.reduce_mean((norm - 1.)**2)
return gp
# Including the gradient penalty in the training step:
@tf.function
def train_step(real_images):
noise = tf.random.normal([BATCH_SIZE, noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
generated_images = generator(noise, training=True)
real_output = discriminator(real_images, training=True)
fake_output = discriminator(generated_images, training=True)
gen_loss = generator_loss(fake_output)
disc_loss = discriminator_loss(real_output, fake_output)
gp = gradient_penalty(real_images, generated_images)
disc_loss += gp * 10 # The gradient penalty weight is typically set to 10
gen_gradients = gen_tape.gradient(gen_loss, generator.trainable_variables)
disc_gradients = disc_tape.gradient(disc_loss, discriminator.trainable_variables)
discriminator_optimizer.apply_gradients(zip(disc_gradients, discriminator.trainable_variables))
generator_optimizer.apply_gradients(zip(gen_gradients, generator.trainable_variables))
return gen_loss, disc_loss
# Modifying the train function to perform 5 discriminator updates per generator update:
def train(dataset, epochs):
for epoch in range(epochs):
start = time.time()
for image_batch in dataset:
for _ in range(5): # Update the discriminator 5 times
gen_loss, disc_loss = train_step(image_batch)
gen_losses.append(gen_loss)
disc_losses.append(disc_loss)
# Rest of the training loop remains the same...
This is a basic implementation of a WGAN-GP. For full details and variations of this algorithm, please refer to the original paper, “Improved Training of Wasserstein GANs” by Gulrajani et al., 2017ꜛ. The complete code can be found in the GitHub repository linked at the end of this post.
Here is the resulting GIF:
And the according loss of the generator and discriminator:
The WGAN converges much faster than the GAN. Already after 5 to 7 epochs, the WGAN is able to generate images that resemble handwritten digits. However, the behavior of the loss curves in the WGAN is quite interesting and distinct from the standard GAN. The loss of the discriminator, starting at around -2.5, converges to 0 after 7 epochs, while the loss of the generator keeps improving over the 50 epochs from around -11 to -8, i.e., the gap between the discriminator and generator loss continuously diminishes. To properly understand this, let’s take a closer look at the loss functions used in both models.
In a standard GAN, the discriminator loss is the binary cross-entropy loss, which tries to correctly classify real and fake (generated) samples. This loss is always positive, with a value of zero indicating perfect classification. The generator loss, also a binary cross-entropy loss, tries to ‘fool’ the discriminator into misclassifying fake samples as real. This loss is minimized when the generator can fool the discriminator most of the time.
In contrast, a WGAN replaces the cross-entropy loss functions with a Wasserstein loss function. This is defined as the difference between the average scores assigned by the discriminator to the real and fake samples.
So, for the generator loss (which tries to minimize this difference), when the generator starts getting good at generating realistic samples, the discriminator’s scores for the real and fake samples will be close, making the difference (and hence the loss) small.
Coming to the aspect of the observed negative losses in WGAN, this is an inherent feature of the Wasserstein loss. The discriminator in a WGAN is trained to maximize the difference between the average scores for the real and fake samples ($\max_D [E[D(x)] - E[D(G(z))]]$). This means the discriminator tries to assign higher scores to real samples compared to the fake ones.
During the early stages of training, when the generator isn’t producing very realistic samples, the discriminator can easily differentiate and assign significantly lower scores to the fake samples, resulting in a larger difference and hence a larger (negative) loss for the generator.
As the generator improves and starts to produce more realistic samples, the discriminator finds it harder to differentiate between the real and fake samples. The scores for real and fake samples get closer, and the difference (and hence the generator’s loss) reduces in magnitude. This is why the generator’s loss improves from around -11 to -8.
In essence, the negative value of the loss in WGAN is not a sign of something wrong but rather a characteristic feature of the Wasserstein loss function.
The Wasserstein distance that WGAN uses for its loss function provides smooth and meaningful gradients almost everywhere. This is a key advantage of the WGAN and makes the training process more stable. As a result, WGANs often converge faster than standard GANs.
Moreover, unlike the original GANs, the training of WGAN doesn’t involve a balancing act between the generator and the discriminator. The two networks are not competing in a zero-sum game, but rather cooperating to minimize a common loss function. This change results in a stable training process, even if the discriminator is temporarily winning or losing. This aspect could explain the quick convergence of the discriminator’s loss to zero in our case.
In conclusion, the behavior of the loss curves in WGAN and the resulting faster convergence underline the practical advantages of using Wasserstein loss and its utility in training stable and efficient generative models.
Wasserstein Generative Adversarial Networks (WGANs) represent a significant advancement in the field of generative models. Their unique features make them an excellent choice for a variety of applications requiring the generation of realistic data.
The most distinctive advantage of WGANs lies in their utilization of the Wasserstein distance in their loss function. This fundamentally changes the training dynamics of generative models, addressing several limitations associated with the traditional GANs.
In summary, WGANs have proven to be a significant advancement in the field of generative models. Their unique features make them an excellent choice for working with generative models. However, it is important to note that WGANs are not a panacea for all the problems associated with GANs. They have their own limitations and are not always the best choice for every application. Nevertheless, they are a powerful tool.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below.
]]>The Wasserstein distance, also known as the Earth Mover’s Distance (EMD), quantifies the minimum ‘cost’ required to transform one probability distribution into another. If we visualize our distributions as heaps of soil spread out over a landscape, the Wasserstein distance gives the minimum amount of work needed to reshape the first heap into the second. In mathematical terms, for two probability measures $\mu$ and $\nu$ on $\mathbb{R}^{d}$, the p-Wasserstein distance is defined as:
\[W_{p}(\mu, \nu) = \left(\inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathbb{R}^{d} \times \mathbb{R}^{d}} ||x-y||^{p} d\gamma(x, y)\right)^{\frac{1}{p}}\]Where $\Gamma(\mu, \nu)$ denotes the set of all joint distributions on $\mathbb{R}^{d} \times \mathbb{R}^{d}$ whose marginals are respectively $\mu$ and $\nu$ on the first and second factors.
The Kullback-Leibler Divergence (KL Divergence), also known as relative entropy, is a measure of how one probability distribution diverges from a second, expected probability distribution. It is widely used in information theory to measure the ‘information loss’ when one distribution is used to approximate another. For two discrete probability distributions $P$ and $Q$, the KL Divergence is defined as:
\[D_{KL}(P||Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}\]In the case of continuous distributions, the summation is replaced by integration:
\[D(P || Q) = \int P(x) \log \left(\frac{P(x)}{Q(x)}\right) \, dx\]One key characteristic of KL Divergence is that it is not symmetric: $D_{KL}(P||Q) \neq D_{KL}(Q||P)$.
When D(P||Q) is 0, it indicates that the two distributions are identical. Larger values imply greater dissimilarity between the distributions.
The KL Divergence derives from the definition of entropy. Using the property of logarithms, we can rewrite the term $\log \frac{P(x_i)}{Q(x_i)}$ as:
\[\log \frac{P(x_i)}{Q(x_i)} = \log P(x_i) - \log Q(x_i)\]Substituting this into the KL Divergence formula:
\[\begin{align*} D_{KL}(P || Q) &= \sum_{i} P(x_i) \log \frac{P(x_i)}{Q(x_i)} \\ & = \sum_{i} P(x_i) (\log P(x_i) - \log Q(x_i)) \\ & = \sum_{i} P(x_i) \log P(x_i) - \sum_{i} P(x_i) \log Q(x_i) \end{align*}\]The first term is the entropy $H(X)$ of $X$, and the second term is the cross-entropy $H(X, Y)$ between $X$ and $Y$ (with $Q$ as the reference distribution). Therefore, we can rewrite the KL Divergence as:
\[D_{KL}(P || Q) = H(X) - H(X, Y)\]The Jensen-Shannon Divergence (JS Divergence) is a method of measuring the similarity between two probability distributions. It is symmetric, unlike the KL Divergence, and is derived from the KL Divergence. The JS Divergence between two discrete distributions $P$ and $Q$ is defined as:
\[JSD(P||Q) = \frac{1}{2} D_{KL}(P||M) + \frac{1}{2} D_{KL}(Q||M)\]where $M$ is the average of $P$ and $Q$, i.e., $M = \frac{1}{2}(P+Q)$. The JS Divergence is bounded between 0 and 1. A JS Divergence of 0 indicates that the two distributions are identical, while a JS Divergence of 1 indicates that the two distributions are completely dissimilar.
The Total Variation Distance (/)TV Distance) provides a simple and intuitive metric of the difference between two probability distributions. It is often used in statistical hypothesis testing and quantum information. For two discrete probability distributions $P$ and $Q$, the TV Distance is given by:
\[D_{TV}(P,Q) = \frac{1}{2} \sum_{i} |P(x_i) - Q(x_i)|\]$P(x_i)$ and $Q(x_i)$ are the probabilities of the random variable $X$ taking the value $x_i$ for distributions $P$ and $Q$, respectively. The TV Distance is bounded between 0 and 1, where 0 indicates identical distributions, and 1 indicates completely dissimilar distributions.
The Bhattacharyya coefficient measures the overlap between two probability distributions. It is commonly used in various fields, including statistics, pattern recognition, and image processing, and has applications in tasks such as image matching, feature extraction, and clustering, where it helps measure the similarity between feature distributions.
For two discrete probability distributions $P(x)$ and $Q(x)$, defined over the same set of events or random variables $x$, the Bhattacharyya coefficient $BC(P, Q)$ is defined as the sum of the square root of the product of the probabilities of corresponding events:
\[BC(P, Q) = \sum_{i} \sqrt{P(x_i) \cdot Q(x_i)}\]For continuous probability distributions, the sum is replaced by an integral:
\[BC(P, Q) = \int \sqrt{P(x) \cdot Q(x)} \, dx\]The Bhattacharyya coefficient ranges from 0 to 1. A value of 0 indicates no overlap between the two distributions (completely dissimilar), while a value of 1 indicates complete overlap (identical distributions).
The Bhattacharyya coefficient is used to compute the Bhattacharyya distance $D_B(P, Q)$, which is obtained by taking the negative logarithm of the coefficient:
\[D_B(P, Q) = -\log(BC(P, Q))\]The Bhattacharyya distance is also commonly used to compare probability distributions and quantify their dissimilarity.
To gain a practical understanding of how these distance metrics behave, we can generate samples from two-dimensional Gaussian distributions and compute these distances as we vary the parameters of the distributions:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import entropy
from scipy.stats import wasserstein_distance
from sklearn.neighbors import KernelDensity
# generate two uniform, random sample sets, with adjustable mean and std:
mean1, std1 = 0, 1 # Distribution 1 parameters (mean and standard deviation)
mean2, std2 = 1, 1 # Distribution 2 parameters (mean and standard deviation)
n = 1000
np.random.seed(0) # Set random seed for reproducibility
sample1 = np.random.normal(mean1, std1, n)
np.random.seed(10) # Set random seed for reproducibility
sample2 = np.random.normal(mean2, std2, n)
# plot the two samples:
plt.figure(figsize=(6,3))
plt.plot(sample1, label='Distribution 1', lw=1.5)
plt.plot(sample2, label='Distribution 2', linestyle="--", lw=1.5)
plt.title('Samples')
plt.legend()
plt.show()
Thw tow samples sets, generated with different random seeds, are not 100% identical, but quite similar. To calculate the distances, we need to estimate the probability distributions from the samples. One way to do this is by using kernel density estimation (KDE):
# calculate KDE for the samples:
x = np.linspace(-5, 7, 1000) # X values for KDE
kde_sample1 = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(sample1[:, None])
pdf_sample1 = np.exp(kde_sample1.score_samples(x[:, None]))
kde_sample2 = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(sample2[:, None])
pdf_sample2 = np.exp(kde_sample2.score_samples(x[:, None]))
# we normalize the distributions to make sure they sum to 1:
pdf_sample1 /= np.sum(pdf_sample1)
pdf_sample2 /= np.sum(pdf_sample2)
# plot the distributions:
plt.figure(figsize=(6,3))
plt.plot(pdf_sample1, label='Distribution 1', lw=2.5)
plt.plot(pdf_sample2, label='Distribution 2', linestyle="--", lw=2.5)
plt.title('Probability Distributions')
plt.legend()
plt.show()
Now, we can compute the distances between the distributions:
# calculate the Wasserstein distance:
wasserstein_dist = wasserstein_distance(x, x, pdf_sample1, pdf_sample2)
print(f"Wasserstein Distance: {wasserstein_dist}")
# calculate the KL divergence between the two distributions:
epsilon = 1e-12
kl_divergence = entropy(pdf_sample1+epsilon, pdf_sample2+epsilon)
print(f"KL Divergence: {kl_divergence}")
# calculate the average distribution M:
pdf_avg = 0.5 * (pdf_sample1 + pdf_sample2)
# calculate the Jensen-Shannon divergence:
kl_divergence_p_m = entropy(pdf_sample1, pdf_avg)
kl_divergence_q_m = entropy(pdf_sample2, pdf_avg)
js_divergence = 0.5 * (kl_divergence_p_m + kl_divergence_q_m)
print(f"Jensen-Shannon Divergence: {js_divergence}")
# calculate the Total Variation distance:
tv_distance = 0.5 * np.sum(np.abs(pdf_sample1 - pdf_sample2))
print(f"Total Variation Distance: {tv_distance}")
# calculate the Bhattacharyya distance:
bhattacharyya_coefficient = np.sum(np.sqrt(pdf_sample1 * pdf_sample2))
bhattacharyya_distance = -np.log(bhattacharyya_coefficient)
print(f"Bhattacharyya Distance: {bhattacharyya_distance}")
Wasserstein Distance: 1.0307000718747261
KL Divergence: 0.6004132992619513
Jensen-Shannon Divergence: 0.12323025153055375
Total Variation Distance: 0.41218145262070593
Bhattacharyya Distance: 0.14151124108902025
To get a better impression of how these metrics behave, we can plot them as a function of the parameters of the distributions. Let’s begin by varying the mean of one of the distributions:
As we shift the two distributions apart from each other, all metrics behave differently:
The Wasserstein distance increases almost linearly within the calculated range, i.e., it responds with a constant increase to the shift of the two almost identical distributions.
The KL Divergence increases more rapidly, indicating a higher sensitivity to the shift. As long as the second distribution shifts apart from the first, but still overlaps with it, the KL Divergence shows s steep increase. However, once the second distribution moves beyond the first, the KL Divergence would become infinite due to the logarithm. However, we added a small $\epsilon\ll 1$ to prevent this from happening, so that we have further values which we can plot.
Both the JS Diversity and the TV Distance increase more slowly and converge to 1, indicating that the two probability distributions are becoming more and more dissimilar as their means diverge.
The Bahattacharyya coefficient starts with a value of 1, indicating that the two distributions are identical. This is not surprising, as at the beginning, the means of both distributions are equal. As the second distribution moves away from the first, the coefficient decreases until it converges to a value close to 0, indicating that the two distributions are completely dissimilar. Accordingly, the Bahattacharyya distance begins with a value of 0 and starts to increase more rapidly as the two distributions move further apart from each other.
Now, let’s have a look at how the metrics behave when we vary the standard deviation of one of the distributions:
Let’s focus on what happens when the standard deviation of the second distribution is larger than one. Again, the metrics show increasing, but different behaviors:
The Wasserstein distance increases the fastest, indicating that it is the most sensitive to the change in the standard deviation, followed by the KL Divergence. Both show that the dissimilarity between the distributions increase as the standard deviation increases.
Both the JS Diversity and the TV Distance increase more slowly. However, this time, they converge to a much lower value than 1, indicating a dissimilarity between the distributions, but not as strong as in the previous case.
Also the Bahattacharyya diverges more slowly and converges to a value $\ll 1$, also indicating that the two distributions are dissimilar, but not as much as in the previous case.
Each of these metrics responds differently to shifts and variances in the probability distributions, revealing their unique sensitivities and properties. The Wasserstein distance, characterized by a nearly linear increase in response to distributional shifts, shows its particular strength in identifying structural changes in the underlying distributions. This feature makes it favored when working with generative models (e.g., GANs), where the positioning of the distributions in the feature space is crucial.
The KL Divergence, with its steep increase followed by a plateau (avoiding infinity due to a small $\epsilon$ addition), reflects its sensitivity towards divergences and its potential for capturing information loss when one distribution is used to represent another. Its propensity for rapid growth makes it ideal for applications where sensitivity to divergence is vital, such as in Variational Autoencoders (VAEs).
The JS Divergence and TV Distance, converging to a value of 1 as distributions diverge, provide a more tempered and normalized measure of dissimilarity. Their slower increase suggests that these metrics are less sensitive to extreme changes, making them appropriate for scenarios that require stable measurements, such as text classification or natural language processing.
Finally, the Bhattacharyya Distance and coefficient offer a unique perspective by measuring the overlap between two statistical samples. As distribution parameters diverge, the coefficient decreases (and distance increases) at a faster pace, making it robust and valuable in tasks requiring decisive binary decisions, such as pattern recognition tasks.
Notably, all metrics show distinct responses to changes in standard deviation, reinforcing the notion that the choice of metric should be informed by the specificities of the data and the problem at hand. The Wasserstein Distance and KL Divergence proved to be the most responsive to changes in standard deviation, while the JS Divergence, TV Distance, and Bhattacharyya Distance showed slower convergence and less sensitivity to such changes.
To conclude, the appropriate choice of a probabilistic distance metric depends not just on the theoretical properties of these metrics, but more importantly on the practical implications of these properties in the context of the specific machine learning task at hand. From comparing the performance of different models, assessing the similarity of different clusters, to testing statistical hypotheses, these metrics provide a mathematical backbone that supports machine learning. Rigorous examination and understanding of these metrics’ behavior under different conditions is a key aspect of their effective utilization.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below or reach out to me on Mastodonꜛ.
]]>We discussed the Wasserstein distance and its mathematical foundation already on some of my recent posts. To recap, the Wasserstein distance is a measure of the distance between two probability distributions that takes the geometry of the data space into account. The distance is computed by finding the optimal transport plan that transforms one distribution into the other with the minimum overall cost. The $p$-Wasserstein distance between two probability distributions $P$ and $Q$ defined on a metric space $(X, d)$ is defined as follows:
\[W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P,Q)} \int_{X \times X} d(x,y)^p d\gamma(x,y) \right)^{1/p}\]where $\Gamma(P,Q)$ is the set of all couplings of $P$ and $Q$, and $d(x,y)$ is the distance between $x$ and $y$ in $X$. $p$ controls the order of the distance metric, with $p=1$ being the 1-Wasserstein distance, $p=2$ the 2-Wasserstein distance, and so on. The 1-Wasserstein distance is also known as the Earth Mover’s distance (EMD), since it can be interpreted as the minimum amount of ‘work’ required to transform one distribution into the other.
The Wasserstein distance provides a powerful and flexible distance measure between distributions. However, it is computationally demanding to calculate, especially for high-dimensional data. See, for example, the discussion in the post on the Sinkhorn algorithm.
To mitigate the computational difficulties associated with the Wasserstein distance, the Sliced Wasserstein Distance (SWD) was introduced. This method uses the simple idea of transforming a complex high-dimensional problem into a collection of easier 1-dimensional problems.
The SWD between two probability distributions $P$ and $Q$ in $\mathbb{R}^d$ is defined as follows:
\[SWD(P, Q) = \int_{S^{d-1}} W_1(P_\theta, Q_\theta) d\theta\]where $S^{d-1}$ is the unit sphere in $\mathbb{R}^d$, $P_\theta$ and $Q_\theta$ are the 1D distributions of $P$ and $Q$ projected onto the direction $\theta$, and $W_1$ is the 1-Wasserstein distance. Essentially, the SWD slices the distributions into multiple 1D distributions and computes the 1-Wasserstein distance for each slice, then averages these distances.
Although the SWD is an approximation of the true Wasserstein distance, it’s more computationally efficient and has been found to be useful in practice, particularly for training generative models.
The L2 norm, also known as the Euclidean distance, is a standard measure of distance between two points in a Euclidean space. Given two vectors $x$ and $y$ in $\mathbb{R}^d$, the L2 norm is defined as:
\[\|x - y\|_2 = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}\]While the L2 norm is simpler and faster to compute than the Wasserstein or sliced Wasserstein distances, it doesn’t take into account the geometry of the data space, and so it may not accurately represent the distance between or dissimilarity of two distributions.
To gain a practical understanding of how these distance metrics behave, we can generate samples from two-dimensional Gaussian distributions and compute these distances as we vary the parameters of the distributions.
Let’s first create two identical 2D Gaussian sample sets:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from scipy.linalg import norm
import ot
# generate two 2D gaussian samples sets:
n = 50 # nb samples
m1 = np.array([0, 0])
m2 = np.array([4, 4])
s_1 = 1
s_2 = 1
cov1 = np.array([[s_1, 0], [0, s_1]])
cov2 = np.array([[s_2, 0], [0, s_2]])
np.random.seed(0)
xs = ot.datasets.make_2D_samples_gauss(n, m1, cov1)
np.random.seed(0)
xt = ot.datasets.make_2D_samples_gauss(n, m2, cov2)
# plot the distributions:
fig = plt.figure(figsize=(5, 5))
plt.plot(xs[:, 0], xs[:, 1], '+', label=f'Source (random normal,\n $\mu$={m1}, $\sigma$={s_1})')
plt.plot(xt[:, 0], xt[:, 1], 'x', label=f'Target (random normal,\n $\mu$={m2}, $\sigma$={s_2})')
plt.legend(loc=0, fontsize=10)
plt.xlabel('x')
plt.ylabel('y')
plt.axis('equal')
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.title(f'Source and target distributions')
plt.tight_layout()
plt.show()
The means and standard deviations are adjustable, so we can vary them to see how the distances change.
Next, we calculate the standard Wasserstein distance between the two samples using the Python Optimal Transport (POT) libraryꜛ, which is based on linear programming:
# loss matrix:
M = np.sum((xs[:, np.newaxis, :] - xt[np.newaxis, :, :]) ** 2, axis=-1)
M /= M.max()
# transport plan:
G0 = ot.emd(a, b, M)
# Wasserstein distance:
w_dist = np.sum(G0 * M)
The sliced Wasserstein distance is calculated by first projecting the samples onto a random direction and then computing the 1-Wasserstein distance between the projected samples. We use the
ot.sliced_wasserstein_distance
function from the POT library, where this calculation is already implemented:
# sliced Wasserstein distance:
n_projections=1000
a, b = np.ones((n,)) / n, np.ones((n,)) / n # uniform distribution on samples
w_dist_sliced = ot.sliced_wasserstein_distance(xs, xt, a, b, n_projections, seed=0)
wasserstein_distances_sliced.append(w_dist_sliced)
a
and b
represent discrete probability distributions of the source and target, respectively. Both are required when using the POT library. In our case, a
and b
are uniform distributions, meaning each sample point in the source and target distributions are equally likely. We use uniform distributions because we are dealing with sets of samples where there is no reason to believe any sample is more likely than any other. We assume no additional information about the distribution of the samples. In any other case, where the sample points are not equally likely, a
and b
would be different and not uniform. These cases might occur, for example, when dealing with weighted samples from some underlying distribution.
Lastly, we calculate the L2 norm between the two samples:
# calculate the L2 distance:
L2_dist = np.sqrt(norm(xs - xt))
l2_distances.append(L2_dist)
To demonstrate how the distances change as we, e.g., vary the means, let’s compute the three metrics for three different sets of means and standard deviations and plot the according distributions:
As we can see, the metrics increase as the distance between the two sets increases. However, the speed of increase of the metrics differ among each other. The L2 norm increases the fastest, followed by the sliced Wasserstein distance, and the Wasserstein distance increases the slowest. To get a more comprehensive impression of how the metrics change as we vary the means and standard deviations, we can repeat the above procedure, but now for a range of means and standard deviations. The following plots are a saved animation of the results (you can find the full animation code in the GitHub repository mentioned below):
Seeing the behavior of the metrics now on a broader range, we can observe that indeed all metrics evolve differently for increasing means of the target set.
The default Wasserstein distance shows the least steep increase and slowly converges to $\sim$1, while the slices Wasserstein distance increases almost linearly. The reason for this lies in the differences in how the Wasserstein distance and the sliced Wasserstein distance are computed. For the Wasserstein distance, if we shift one identical distribution along any axis by a constant value, the cost of moving each “pile of earth” from the source distribution to the corresponding “hole” in the target distribution stays the same, because the mass it needs to be moved stays constant, regardless of how much the distribution is shifted. Since we normalize the cost matrix (M /= M.max()
), which we do for reasons of numerical stability, also the distance stays the same (the maximum pairwise distance is 1). The sliced Wasserstein distance, on the other hand, is computed by projecting the distributions onto random lines and computing the Wasserstein distance between these one-dimensional projections. When we shift the distributions apart, these one-dimensional Wasserstein distances will tend to increase, leading to an increase in the overall SWD. The SWD, therefore, can capture absolute location differences more sensitively, whereas the standard Wasserstein distance is more focused on shape differences. This also applies to the increase of the standard deviation of the target distribution while keeping the means fixed, as we can see in the following animation:
The explanation for the observable behavior of the Wasserstein distance and the sliced Wasserstein distance are still the same. The increase of the standard deviation affects the spread and dispersion of the data points. But, as the global structure and mass of the distributions remain the same, the optimal transport plan will again not change, leading to only a small increase of the Wasserstein distance. The sliced Wasserstein distance, on the other hand, will increase more strongly, as the one-dimensional Wasserstein distances will increase with increasing standard deviation.
The L2 norm, on the other hand, increases the fastest, as it is not concerned with the shape of the distributions, but only with the distance between the points. However, when only shifting the target distribution apart from the source distribution, this holds true only until a certain distance, after which the SWD becomes the fastest. The L2 norm measures the straight-line distance between two points in a Euclidean space. When two distributions are close to each other, small changes in their locations or shapes can result in noticeable changes in the L2 distance. Therefore, in the early stages of the animation, as the mean of the target distribution starts to move away from the source distribution, the L2 norm increases rapidly. However, as the target distribution continues to move further away from the source distribution, the L2 norm’s rate of increase tends to slow down. This is because the L2 norm is essentially measuring the straight-line distance from the source to the target, which becomes less sensitive to changes when the two distributions are already far apart.
On the other hand, the sliced Wasserstein distance considers the optimal transportation cost of transforming one distribution into another, not just the straight-line distance between their means. In the early stages, when the two distributions are close to each other, the optimal transport plan can be relatively simple and inexpensive, leading to a smaller SWD. As the target distribution continues to move away from the source, the transportation cost increases, and hence the SWD increases. However, since SWD considers the entire structure of the distributions, not just their means, it can be more sensitive to changes when the distributions are far apart. This can lead to a faster increase in SWD compared to the L2 norm in the later stages of the animation.
Our examination of the Wasserstein distance, sliced Wasserstein distance (SWD), and L2 norm reveal fundamental differences in how these metrics capture variations between distributions. Each metric’s behavior under changing parameters of the distributions provides valuable insight into their optimal applications in machine learning tasks:
The Wasserstein distance, characterized by its insensitivity to translation and focus on shape differences, is best suited to tasks where the overall shape of the distribution is paramount, such as in generative models like Generative Adversarial Networks (GANs) . It provides a more robust comparison by considering the minimal cost of transforming one distribution into another, ignoring simple translations.
The sliced Wasserstein distance, with its sensitivity to both location and dispersion changes, can be more beneficial in tasks requiring an understanding of both shape and location changes. For instance, in outlier detection or tasks that demand a nuanced understanding of how distributions can vary.
The L2 norm, due to its high sensitivity to point-to-point distances, is often most applicable in tasks like regression or clustering, where Euclidean distances between data points are of primary interest.
In conclusion, choosing the correct distance metric should be based on the specific requirements of the underlying task, considering the importance of shape, location, dispersion, and point-to-point distances in the data.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below.
]]>Let’s recap: Given two 1D distributions $P$ and $Q$, the first Wasserstein distance is defined as:
\[W_1(P, Q) = \inf_{\gamma \in \Gamma} \sum_{i,j} \gamma_{i,j} \cdot c_{i,j}\]where $\Gamma$ is the set of all joint distributions $\gamma(x, y)$ whose marginals are respectively $P$ and $Q$, and $c_{i,j}$ is the cost function, typically the absolute difference between $i$ and $j$.
The approximation method presented here calculates the cumulative distribution function (CDF) of the two distributions and then computes the area between these two CDFs. This area can be interpreted as the total “work” done to transform one distribution into the other, which is the essence of the Wasserstein distance.
Given the CDFs $F_P$ and $F_Q$ of the two distributions $P$ and $Q$, the total work is calculated as:
\[\text{Total work} = \int |F_P(x) - F_Q(x)| dx\]This integral can be approximated by a sum over discrete $x$ values:
\[\text{Total work} \approx \sum_{i} |F_P(x_i) - F_Q(x_i)| \cdot \Delta x_i\]where $\Delta x_i = x_{i+1} - x_i$ is the distance between the $x$ values.
Let’s apply the method to two distributions using Python. First, we’ll generate two discrete normally distributed sample sets. For ease of illustration, the sets are randomly generated, but identical (for both sets, np.random.seed()
is reset to zero). However, the target set is shifted by one unit against the source set:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import cumfreq
from scipy.stats import wasserstein_distance
from scipy.stats import norm
from scipy.interpolate import interp1d
import ot
# generate two 1D gaussian samples:
n=1000
x=np.linspace(-10, 10, n)
m1 = 0
m2 = 1
s1 = 1
s2 = 1
np.random.seed(2)
dist1 = norm.rvs(loc=m1, scale=s1, size=n)
np.random.seed(2)
dist2 = norm.rvs(loc=m2, scale=s2, size=n)
# plot the distributions:
plt.figure(figsize=(7, 3))
plt.plot(x, dist1, label=f"source ($\mu$={m1}, $\sigma$={s1})", alpha=1.00)
plt.plot(x, dist2, label=f"target ($\mu$={m2}, $\sigma$={s2})", alpha=0.55)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.tight_layout()
plt.show()
Next, we compute the CDFs of the two sets:
# compute the CDFs:
a = cumfreq(dist1, numbins=100)
b = cumfreq(dist2, numbins=100)
# compute the x-values for the CDFs:
x_a = a.lowerlimit + np.linspace(0, a.binsize*a.cumcount.size, a.cumcount.size)
x_b = b.lowerlimit + np.linspace(0, b.binsize*b.cumcount.size, b.cumcount.size)
We need to interpolate the CDFs to the same $x$ values to be able to calculate the area between them:
# interpolate the CDFs to the same x-values:
f_a = interp1d(x_a, a.cumcount / a.cumcount[-1])
f_b = interp1d(x_b, b.cumcount / b.cumcount[-1])
x_common = np.linspace(max(x_a[0], x_b[0]), min(x_a[-1], x_b[-1]), 1000)
cdf_a_common = f_a(x_common)
cdf_b_common = f_b(x_common)
To get an idea, how the underlying distributions look like, we can calculate and plot the probability density functions (PDFs) of the two distributions:
# calculate the PDF of the first distribution:
pdf_a = np.diff(cdf_a_common)
pdf_b = np.diff(cdf_b_common)
# plot the PDFs:
plt.figure(figsize=(7, 3))
plt.plot(pdf_a, label='source PDF')
plt.plot(pdf_b, label='target PDF')
plt.ylabel('probability density')
plt.legend()
plt.tight_layout()
plt.show()
And the according CDFs:
# plot the CDFs:
plt.figure(figsize=(5.5, 5))
plt.plot(x_common, cdf_a_common, label='source CDF')
plt.plot(x_common, cdf_b_common, label='target CDF')
# plot the absolute difference between the CDFs:
plt.fill_between(x_common, cdf_a_common, cdf_b_common, color='gray', alpha=0.5, label='absolute difference')
plt.ylabel('cumulative frequency')
plt.legend()
plt.tight_layout()
plt.show()
The grey shaded area indicates the absolute difference between the two CDFs. It represents the total work needed to transform the source into the target distribution and serves as an approximation of the Wasserstein distance. To quantitatively assess the area, we first need to calculate the absolute difference between the two CDFs at each point and then multiply it by the distance between the points:
# compute the absolute difference between the CDFs at each point:
diff = np.abs(cdf_a_common - cdf_b_common)
# compute the distance between the points:
dx = np.diff(x_common)
# compute the total "work":
total_work = np.sum(diff[:-1] * dx)
print(f"Total work of the transport: {total_work}")
Total work of the transport: 0.9769786231313
For comparison, we calculate the Wasserstein distance using library functions:
print(f"Wasserstein distance (scipy): {wasserstein_distance(dist1, dist2)}")
print(f"Wasserstein distance W_1 (POT): {ot.wasserstein_1d(dist1, dist2, p=1)}")
Wasserstein distance (scipy): 1.0
Wasserstein distance W_1 (POT): 1.0000000000000007
As you can see, the Wasserstein distance calculated with the approximation method is very close to the exact Wasserstein distance calculated with the scipy and POT library. However, keep in mind, that the two sample sets are identical and, though shifted, the dissimilarity between them is very low. If we increase the shift, the approximation becomes less accurate:
The same accounts for increasing the variance of the target set:
In conclusion, the approximation can become less accurate for distributions with significant differences in their shapes or locations. However, the method is computationally efficient, especially for high-dimensional data, as it avoids the need for solving a linear programming problem. As long as the distributions are not too dissimilar, the approximation provides a valuable alternative for estimating the the Wasserstein distance. Another factor controlling the accuracy of the approximation is the granularity of the $x$ values. A finer grid will yield a more accurate approximation, but will also increase the computational cost.
The approximation of the Wasserstein distance by calculating the cumulative distribution function provides an intuitive and computationally efficient method to quantify the ‘distance’ between two distributions. While it may not always provide the exact Wasserstein distance, especially for dissimilar distributions, it offers a good estimate and I think it also helps to understand the underlying concept of the Wasserstein distance.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below or reach out to me on Mastodonꜛ.
]]>Before we head over to the Sinkhorn algorithm, let’s first understand Sinkhorn’s theorem, which is the foundation of the algorithm. The theorem is named after Richard Sinkhorn, who proved it in the context of matrices.
The Sinkhorn theorem states that, given a matrix $A \in \mathbb{R}^{n \times n}$ with positive entries $a_{ij} \gt 0$ for all $i, j$, there exist diagonal matrices $D_1 = \text{diag}(d_1)$ and $D_2 = \text{diag}(d_2)$ with positive diagonal entries such that $D_1 A D_2$ is a doubly stochastic matrix. In other words, all rows and columns of $D_1 A D_2$ sum to one.
The Sinkhorn algorithm, which is used to find these diagonal matrices, can be described as follows:
In the context of optimal transport, the matrix $A$ is typically chosen to be $e^{-C/\epsilon}$, where $C$ is the cost matrix, $\epsilon \gt 0$ is a regularization parameter, and the exponentiation is element-wise. The resulting matrix $P = D_1 A D_2$ is then a near-optimal transport plan, and the value of the regularized optimal transport problem is approximately $\langle P, C \rangle = \text{trace}(P^T C)$.
The Sinkhorn algorithm is efficient because each iteration only involves matrix-vector multiplications and element-wise operations, which can be done in linear time. Furthermore, the algorithm is guaranteed to converge to a unique solution due to the Sinkhorn theorem.
Let’s recap the original optimal transport problem. The goal of the problem is to find a transport plan that minimizes the total cost of transporting mass from one distribution to another:
\[\min_{\gamma \in \Gamma(P, Q)} \langle \gamma, C \rangle\]where $\gamma$ is the transport plan, $C$ is the cost matrix, and $\Gamma(P, Q)$ is the set of all transport plans that move mass from distribution $Q$ to distribution $Q$. The Sinkhorn algorithm addresses this problem by adding an entropy regularization term, which transforms the problem into:
\[\min_{\gamma \in \Gamma(P, Q)} \langle \gamma, C \rangle - \epsilon H(\gamma)\]where $H(\gamma)$ is the entropy of the transport plan, and $\epsilon \gt 0$ is the regularization parameter. The entropy of a transport plan is defined as:
\[H(\gamma) = -\sum_{i,j} \gamma_{i,j} \log(\gamma_{i,j})\]The Sinkhorn algorithm solves this regularized problem by iteratively updating the transport plan according to the following rule:
\[\gamma^{(k+1)} = \text{diag}(u) K \text{diag}(v)\]where $K = \exp(-C/\epsilon)$ is the kernel matrix, $u$ and $v$ are vectors that are updated at each iteration to ensure that the transport plan $\gamma$ satisfies the marginal constraints, and $\text{diag}(u)$ denotes a diagonal matrix with the elements of $u$ on its diagonal.
The Sinkhorn algorithm iterates this update rule until convergence, resulting in a transport plan that minimizes the regularized problem. The resulting transport plan is smoother and less scattered than the one obtained from the original problem, which makes the Sinkhorn algorithm a powerful tool for computing the Wasserstein distance in large-scale problems.
While the Sinkhorn algorithm provides a computationally efficient method for approximating the Wasserstein distance, it’s important to note that the results can differ from those obtained using linear programming. The reason for this is that the Sinkhorn algorithm introduces a regularization term to the optimal transport problem, which can lead to a different solution than the unregularized problem solved by linear programming. When the regularization parameter $\epsilon$ is small, the solution of the Sinkhorn algorithm is close to the solution of the unregularized problem, and the Wasserstein distance calculated with the Sinkhorn algorithm is close to the true Wasserstein distance. However, when $\epsilon$ is large, the solution of the Sinkhorn algorithm can be quite different from the solution of the unregularized problem, and the Wasserstein distance calculated with the Sinkhorn algorithm can be quite different from the true Wasserstein distance. Despite these potential differences in results, the Sinkhorn algorithm remains a practical choice for many applications due to its computational efficiency, especially for large problems.
Here is a Python code example, that computes the Wasserstein distance between two distributions using the Sinkhorn algorithm. The code is the same we have used in the previous post, except that we exchange the computation of transport plan G
. We again use the POT libraryꜛ, which provides an implementation of the Sinkhorn algorithm:
import numpy as np
import matplotlib.pyplot as plt
import ot.plot
from ot.datasets import make_1D_gauss as gauss
from matplotlib import gridspec
# generate the distributions:
n = 100 # nb bins
x = np.arange(n, dtype=np.float64) # bin positions
a = gauss(n, m=20, s=5) # m= mean, s= std
b = gauss(n, m=60, s=10)
# calculate the cost/loss matrix:
M = ot.dist(x.reshape((n, 1)), x.reshape((n, 1)), metric='sqeuclidean')
M /= M.max()
# solve transport plan problem using Sinkhorn algorithm:
epsilon = 1e-3
G = ot.sinkhorn(a, b, M, epsilon, verbose=False)
# calculate the Wasserstein distance:
w_dist = np.sum(G * M)
print(f"Wasserstein distance W_1: {w_dist}")
# plot distribution:
plt.figure(1, figsize=(6.4, 3))
plt.plot(x, a, c="#0072B2", label='Source distribution', lw=3)
plt.plot(x, b, c="#E69F00", label='Target distribution', lw=3)
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_linewidth(2)
ax.spines['bottom'].set_linewidth(2)
ax.tick_params(axis='x', which='major', width=2)
ax.tick_params(axis='y', which='major', width=2)
ax.tick_params(axis='both', which='major', labelsize=12)
plt.legend()
plt.show()
# plot loss matrix:
plt.figure(2, figsize=(5, 5))
plot1D_mat(a, b, M, 'Cost matrix\nC$_{i,j}$')
plt.show()
# plot optimal transport plan:
plt.figure(3, figsize=(5, 5))
plot1D_mat(a, b, G, 'Optimal transport\nmatrix G$_{i,j}$')
plt.show()
The plot function plot1D_mat
, which is a modified adaption from the POT library, also remains unchanged:
def plot1D_mat(a, b, M, title=''):
""" Plot matrix :math:`\mathbf{M}` with the source and target 1D distribution
Creates a subplot with the source distribution :math:`\mathbf{a}` on the left and
target distribution :math:`\mathbf{b}` on the top. The matrix :math:`\mathbf{M}` is shown in between.
Modified function from the POT library.
Parameters:
----------
a : ndarray, shape (na,)
Source distribution
b : ndarray, shape (nb,)
Target distribution
M : ndarray, shape (na, nb)
Matrix to plot
"""
na, nb = M.shape
gs = gridspec.GridSpec(3, 3)
xa = np.arange(na)
xb = np.arange(nb)
ax1 = plt.subplot(gs[0, 1:])
plt.plot(xb, b, c="#E69F00", label='Target\ndistribution', lw=2)
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
plt.ylim((0, max(max(a), max(b))))
# make axis thicker:
ax1.spines['left'].set_linewidth(1.5)
ax1.spines['bottom'].set_linewidth(1.5)
plt.legend(fontsize=8)
ax2 = plt.subplot(gs[1:, 0])
plt.plot(a, xa, c="#0072B2", label='Source\ndistribution', lw=2)
plt.xlim((0, max(max(a), max(b))))
plt.xticks(ax1.get_yticks())
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_linewidth(1.5)
ax2.spines['bottom'].set_linewidth(1.5)
plt.legend(fontsize=8)
plt.subplot(gs[1:, 1:], sharex=ax1, sharey=ax2)
plt.imshow(M, interpolation='nearest', cmap="plasma")
ax = plt.gca()
plt.axis('off')
plt.text(xa[-1:], 0.5, title, horizontalalignment='right', verticalalignment='top',
color='white', fontsize=12, fontweight="bold")
plt.xlim((0, nb))
plt.tight_layout()
plt.subplots_adjust(wspace=0., hspace=0.2)
Here is how the two distributions look like:
The cost matrix:
And the resulting transportation plan, compared to the transportation calculated with linear programming:
The corresponding Wasserstein distance is $W_1 = \sim0.1662$ and $W_1 = \sim0.1658$ for the Sinkhorn algorithm and linear programming, respectively. The difference is small, but it’s important to keep in mind that the Sinkhorn algorithm only approximates the optimal transport plan, which can lead to differences in the resulting Wasserstein distance.
The Sinkhorn algorithm offers an efficient solution to the optimal transport problem and the calculation of the Wasserstein distance. By introducing regularization, it makes the problem computationally tractable for large datasets, a task that is often infeasible with traditional linear programming methods. However, the regularization can lead to differences in the results, controlled by the regularization parameter, epsilon. Therefore, a careful balance between computational efficiency and result accuracy is crucial.
The code used in this post is available in this GitHub repositoryꜛ.
If you have any questions or suggestions, feel free to leave a comment below or reach out to me on Mastodonꜛ.