3 Random Seeds, Model Stability, and the Complexity–Parsimony Trade-off

3.1 Introduction

In the previous project, we developed graphical representations for model-agnostic XAI methods, primarily focusing on intrinsically interpretable models like tree-based algorithms. However, the rapid growth of artificial intelligence has firmly established deep neural networks as a dominant tool for complex tasks, from logical reasoning to creative endeavors. This shift presents a fundamental challenge, their decision processes are hidden within intricate combinations of linear and non-linear functions, making traditional XAI methods insufficient and giving rise to new fields like mechanistic interpretability aimed at untangling their internal workings.

At their core, neural networks learn by iterating over batches of data, updating large number of internal parameters via back-propagation and gradient descent (Goodfellow-et-al-2016?). Their capacity is governed by hyperparameters, such as the number of layers and neurons, which determine the model’s complexity. Because the state-of-the-art tasks neural networks excel at often involve massive, complex datasets, the prevailing trend has been to build ever-larger models, creating a culture where complexity is often the default starting point.

This default, however, is problematic. Highly complex models are not only unnecessary but detrimental, if a simpler solution exists. They are prone to overfitting where they are memorizing noise instead of learning true patterns, thereby leading to poor generalization on new data. In these scenarios, simpler, parsimonious models are superior as they are more robust, computationally efficient, and, most importantly, transparent. (breiman2001statistical?) provides an early statistical perspective on algorithmic models that focus on predictive accuracy to the detriment of interpretability.

Therefore, a critical gap exists between the “make it bigger” trend of mainstream deep learning and the foundational principle of model parsimony. This project directly addresses this gap. We investigate how simple, parsimonious neural networks can be discovered for simple datasets. Our focus will be on single-layer, fully connected neural networks, using them as a base to build a pedagogical tool to explore the impact of random seed and model size across various data-generating processes.

In discovering simpler models, we were able to uncover clear, demonstrable insights into how simpler architectures can be found. The key takeaway messages from this project would be

A simple parsimonious model can exist for a given neural architecture that has similar performance as a more complex model, but it’s not so easy to find.
Changing the random seed produces large variation in the accuracy of a simple model fit, but little effect on the accuracy of a complex model fit.
A complex model has stability but little interpretability. Persisting with random starts can provide a parsimonious, interpretable fitted model.

3.2 Background

3.2.1 Training neural networks

Figure 3.1: An example neural network structure

A neural network can be seen as a sequence of linear transformations combined with nonlinear functions called activations. The weights of these linear transformations are trained through a process of iterative adjustments to reduce the loss between the current prediction and the expected output, guided by an optimization algorithm. This minimization is achieved through repeated cycles of forward and backward propagation (Goodfellow-et-al-2016?).

The training begins with a forward pass. Suppose the input observation is $x \in \mathcal{R}^{1 \times p}$ where $p$ is the number of input features (in the given example $p = 2$). Input data is fed into the network, and each neuron calculates its output based on the weighted sum of its inputs and a non-linear activation function (for example ReLU). In the example given there are $4$ neurons, and each neuron has $p$ weights associated with it, which will be multiplied with the input vector of size $p$ and added together to create a vector of 4. Additionally there’s a bias term in a layer that is also a vector of size 4 and gets added to the output. To smoothen the process between neurons, the action of an entire layer can be summarised to $y = xW^T+b$ where $W is a matrix of 4 rows and 2 columns and $b$ is a vector of 4 values.

These outputs then propagate forward through the network until a final prediction is produced. This prediction is compared to the true target value, and the error is calculated through a loss function. Different loss functions are defined based on the task and the data structure of predictions and ground truths. For example a binary cross entropy loss function (bishop_pattern_2006?) is defined for binary classification tasks assuming that the prediction is in the range of $[0,1]$ and the ground truth is in binary.

The heart of the learning process lies in the backward pass, also known as back-propagation. This step uses the chain rule of calculus to calculate the gradient of the loss function with respect to each weight in the network. Essentially, it determines how much each weight contributes to the overall error. The gradient indicates the direction and magnitude of the change needed to reduce the loss. To update a weight matrix $W$ we need the gradient of the loss with respect to $W$ and the learning rate used to indicate how fast and stable the updates to the weights should happen. Together these elements come together to create the basic concept of gradient descent as follows, where the gradient dictates the direction and the intensity the weight should “descend” down the loss surface towards a lower loss.

\[ W_{new} = W + \alpha * \frac{\partial L}{\partial W} \]

An optimization algorithm, such as Stochastic Gradient Descent or Adam (kingma_adam:_2017?), is then employed to update the weights based on these calculated gradients. The optimizer iteratively adjusts the weights in the direction that minimizes the loss. The learning rate controls the size of these adjustments—a smaller learning rate leads to slower but more stable learning, while a larger learning rate can accelerate learning but may also cause instability.

The entire forward and backward process is repeated for numerous iterations, often referred to as epochs. During each epoch, the network sees the entire training dataset multiple times, allowing it to progressively refine its weights and improve its accuracy.

3.3 Motivation

The educational implications of seeing more complex models in the wild can be concerning. Students learn to equate model performance with complex neural architectures, thereby developing an intuitive tendency toward complexity. This bias persists into professional practice, where engineers default to complex solutions without considering simpler alternatives. The result is a field that consistently over-engineers solutions to problems that may have elegant, simple answers. Therefore it is necessary that educators are made aware of the possibility that parsimonious models can have differing fits based on the randomness in the model. In additon, using complex models where a smaller parsimonious models are sufficient introduces a requirement for additional computational resources, and can be prone to overfitting on the smaller dataset. The motivation for this work is to challenge the assumption that increasingly complex models are always necessary for strong performance. In practice, not all tasks require deep and intricate architectures, and in some cases, a well-initialised smaller model may perform just as well as a larger one.

Several things needs to be mentioned here

Discuss issues that arise when considering more complex models than necessary
Pay attention to these from a teaching perspective

3.4 Methodology

3.4.1 Generating Simulated Data with the Squiggler Tool

In any two-dimensional area, there are numerous ways to define a boundary for a classification task. While complex, disjointed boundaries can exist, a single, continuous line is often preferred when teaching machine learning concepts. As a pedagogical tool, it’s far more intuitive for students and end users to visually separate two regions with a continuous line, making it easier to conceptualize how a model is attempting to learn and generalize.

However, to ensure that the generated boundary provides a suitable challenge for predictive models, the default design is intentionally complex. Instead of a simple line parallel to an axis, the design consists of a mix of oblique and axis-parallel segments. This approach ensures that the downstream classification task is not trivial and compels the model to learn non-linear relationships. Furthermore, this mixed design helps the model developer discern which of the two features is more influential in the model’s final decision-making process.

3.4.1.1 Key points of consideration

When developing this tool several technical considerations had to be taken to translate the user input into a usable decision boundary.

A dual coordinate system was implemented to translate between the screen coordinates and the data coordinates. The display coordinates range from 0 to 640 pixels, while the data coordinates are normalized to a -10 to 10 range. This separation is crucial because it allows the visualization to be resolution-independent and makes the data more meaningful for mathematical operations or external processing. The transformation also includes a y-axis flip (subtracting from the maximum value) because SVG coordinates have their origin at the top-left, while most mathematical coordinate systems place the origin at the bottom-left. When implementing this pattern, care was taken to be mindful of the order of operations during the coordinate conversion, especially when dealing with the flipped y-axis.

When dragging begins on an existing point, it selects that point for manipulation. However, when dragging starts on empty space, the system dynamically creates a new point at that location and immediately begins dragging it. This creates an intuitive user experience where clicking anywhere adds a point.

The visualization creates two complementary polygons - one above the draggable line and one below it. The upper polygon is constructed by connecting the top corners of the canvas with all the user-defined points, while the lower polygon connects the bottom corners with the same points. Both polygons require coordinate sorting to ensure proper edge connections, but this is done at render time (ie. when drawing the polygons on the interface) rather than modifying the original data structure. A key assumption is that the polygon rendering assumes points are meant to be connected in x-coordinate order, which works well for function-like curves.

3.4.2 Parallel Modeling Pipeline

We began by generating a dataset of 10,000 samples, distributed uniformly across a two-dimensional space ranging from -10 to 10 for both dimensions. A consistent seed was used throughout the entire process, regardless of the seed selected for model training, ensuring replicability. A function was designed to determine whether a given point in this two-dimensional space lay above or below a predetermined decision boundary. This function was then utilized to generate both the training and testing datasets by sampling points uniformly within the specified range and assigning each point a corresponding class label based on its position relative to the boundary. The training and testing dataset split was done using a 50/50 split resulting in 5,000 samples in the training set.

Afterwards, based on the number of replicates to test per neuron size,a random selection of seeds, spanning the range of [1, 99999], was employed for each neuron size being tested.

The training process was executed in parallel, using the generated training and testing dataset, a random seed, the number of neurons in the network, and the number of models to train for each neuron setting. For each unique seed-neuron combination, a single-layer neural network was initialized with weights determined by the seed, and trained on the training dataset for 500 epochs. Batch sizes were calculated, with each batch consisting of the square root of the number of rows in the training dataset (5,000) rounded down, resulting in 71 batches. To maintain consistency, the data order was shuffled based on the seed used for model training before each epoch.

The neural network architecture incorporated two inputs corresponding to the two features, with a ReLU activation function following the first hidden layer, and a final layer possessing a single output node. This configuration was necessary for the binary classification task. The final layer utilized a sigmoid activation function to produce a probability score between 0 and 1. The training process leveraged the Adam optimizer (kingma_adam:_2017?), configured with a learning rate of 0.01, a default weight decay of 0, and default beta values of 0.9 and 0.9999. The loss function employed was the Binary Cross Entropy Loss.

Following the training phase, the trained model was evaluated on the held-out testing dataset, with both the F1-score and accuracy recorded (hastie_elements_2009?).

In a binary classification setting, we can summarize the performance of a model using the confusion matrix, which consists of the following quantities:

True Positives (TP): correctly predicted positive samples
True Negatives (TN): correctly predicted negative samples
False Positives (FP): negative samples incorrectly predicted as positive
False Negatives (FN): positive samples incorrectly predicted as negative

Accuracy measures the proportion of correctly classified instances among all predictions: \[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

The F1 score is the harmonic mean of precision and recall, providing a balance between the two: \[ \text{Precision} = \frac{TP}{TP + FP}, \]

\[ \text{Recall} = \frac{TP}{TP + FN}, \]

\[ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

While accuracy reflects the overall correctness of predictions, the F1 score is more informative in cases of class imbalance, as it emphasizes both precision and recall.

To further understand the model’s decision boundary, we created a grid of 100x100 points spanning the same data range of -10 to 10, and the trained model was applied to this grid, storing the resulting predictions to accurately map the model’s learned boundary

3.4.3 Pedagogical Tool

To illustrate the spread of model fits that occur for a given neural network model, we developed a Shiny application that can be used in a teaching setting to encourage students to compare and evaluate different neural model architectures. The application consists of two layers, first a introductory story to ease the student into the environment and then at last the primary user interface where the student can train and evaluate neural networks.

3.4.3.1 Introducing the scenario through a story

Students often relate more to a topic more deeply when a compelling story is associated with it, especially one that reflects a real-world possibility they might encounter (ivala_enhancing_2013?). To utilize this potential, the application eases the student into the concept that’s about to be presented through a narrative centered on a fresh graduate named Joshua. We follow his story as he takes on his first task at a new data science company, providing a relatable context for the challenges of model development.

In the story, Joshua is given a dataset with a relatively simple, non-linear boundary. His initial attempts to fit a neural network fail, as the model repeatedly produces a simple linear boundary when visualized in the data space. His manager then intervenes and, using the exact same neural architecture, achieves a much better fit. The manager makes a casual, offhand comment that a better fit could always be achieved by simply using more neurons or training for longer. Disheartened but determined, Joshua tries again with his original network size. To his surprise, without changing the training process from his previous failed attempt, the model now converges to a much better solution.

This narrative illustrates a crucial concept that students will face when developing neural networks: the temptation to solve problems by simply making the model more complex. In most cases when faced with a under performing model the general consensus and the industry expectation is to make the model architecture more complex to hold more information. The story highlights the reality that even with an identical dataset and model architecture, there is a wide variety of possible outcomes due to the stochastic nature of training. Users can navigate between the pages of this introductory story and can end it at any time, which then reveals the main tool underneath. The user by this point should be curious and intrigued on what would cause a model architecture that didn’t work earlier to be able to work right now.

3.4.3.2 Aspects of the user interface

The user interface is logically structured into three main sections: “Design and Fit,” “Model Spread,” and “Individual Models.” This layout is designed to guide the user through the process of generating data, training models, and analyzing the results in a linear fashion while also providing the option to move back and forth between steps.

The “Design and Fit” section is primarily for generating data and fitting models through three distinct components. The first is the “squiggler” tool (as discussed under the Section 3.4.1), which is used to define the decision boundary for the data generating process. It initializes with a predefined decision boundary marked by a set of movable circles (similar to knots in a linear spline). Users can customize this boundary by adding new knots with a simple click on an empty space or by removing existing ones with a right-click.

The second component is where the user configures the neural network training, deciding on the number of neurons and the number of replicates to fit. To ensure reasonable performance, we provide a preset list of neuron counts, preventing users from selecting an architecture that would be unnecessarily slow to train. For each selected neuron count, the application will fit the specified number of replicate models. For example, if the user selects 4 and 8 neurons with 5 replicates, the application will train five separate single-layer networks with 4 neurons and another five networks with 8 neurons.

The third component in this section is a progress card that displays the status of the model fitting queue and the time taken for each model, offering transparency into the process. With current performance optimizations, a typical neural network can be fully trained within 15 seconds on a MacBook Pro M2 CPU.

Visualisation to show the spread of model fits in an interactive display

The “Model Spread” section is dedicated to visualizing the range of performance across all fitted models. This is primarily achieved through an interactive beeswarm plot, which displays the F1 score and accuracy of each model on the test dataset. The plot immediately reveals the variability in outcomes even for models with the same architecture. Users can hover over points for details and select individual model variants from the plot, which then populates the “Individual Models” section with that model’s specific decision boundary and misclassified points. As a second approach to visualizing variance, the application can generate an animation that cycles through all the fitted models, showing how the decision boundary shifts as the performance metric changes.

Visulisation of individual model fits in the data space

To further enhance the user experience, the interface is designed with a colorblind-friendly scheme of orange and purple, ensuring accessibility for vision-impaired users. Additionally, each card includes a dedicated help button that provides a concise summary of its functionality, allowing users to quickly refresh their understanding of what each section does.

3.4.3.3 The architecture

Figure 3.4: The architecture of the pedagogical tool

The web application is built on a foundation of R and Shiny, using the {rhino} package (rhino-rpkg?) as a robust framework for managing the codebase. The user interface components are implemented with Fomantic UI (the community fork of Semantic UI) through the {shiny.semantic} package (shiny.semantic-rpkg?). The {shiny} package (shiny-rpkg?) within R was chosen primarily to develop a proof-of-concept rapidly and to leverage powerful visualization libraries with minimal development overhead.

While the main application is in R, the architecture incorporates other languages for specialized tasks. The interactive “squiggler” tool is a bespoke component built using Svelte, which compiles to lightweight, pure Javascript and CSS. This standalone component is then seamlessly embedded into the Shiny application. The model fitting itself is handled by Python (python-book?), which is executed as a background task to ensure that the single-threaded R process does not block the user interface during intensive computations. We use PyTorch (pytorch-pypkg?) as the deep learning framework due to its flexibility in defining custom architectures and its support for inspecting model internals. To manage data transfer between R and Python, intermediate data files are saved in the Parquet format, a columnar storage format that offers better compression and speeds compared to traditional CSV files.

The application’s visualizations are powered by a combination of R packages. The interactive beeswarm plot is created by combining the {ggbeeswarm} package (ggbeeswarm-rpkg?) for the plot geometry with the ggiraph package to add the layer of interactivity. Animations are generated using the {gifski} package (gifski-rpkg?), which leverages Rust internally to create high-quality GIFs in a minimal amount of time. The general static plots are constructed using the {ggplot2} package (ggplot2_R_pkg?), which allows for the creation of highly customized and publication-quality visualizations with an efficient and declarative syntax.

3.4.4 Usage of the pedagogical tool

The design of the tool was primarily aimed at students. To use this tool in a teaching setting the teacher first has to introduce the fundamental concepts of building neural networks along with the training dynamics. Afterwards, the students can be given a simple dataset generated by the squiggle tool as practice to fit neural networks. At first the students can be told to freely choose their neural architectures. A hint can be given pointing towards the heuristic of picking the number of neurons based on the number of bends in the decision boundary. Subsequently, the teacher would provide a sample fitting template, deliberately using a seed that produces a sub-optimal result. After students have tried different neural architectures several students will inevitably face difficulty getting perfect fits. From this point, the teacher can bring forward the tool and follow along the first few pages of the tutorial to illustrate the story that a few students might have faced. Then the tool can be used to draw the decision boundary that the students faced along with several single layer neural architectures and their replicates.Once model fitting is done, we can highlight the significant impact of random numbers on the model’s decision boundary.