Benign or Malignant? Neural Networks Can Help

James Zhang
16 min readFeb 19, 2021
Can you tell which set of images is benign and which set is malignant?

You’re probably looking at the image above and thinking, “How in the world am I supposed to know which is which? What does a tumor even look like? That’s the job of a pathologist or a doctor, not me.” Well, what if I told you that a pathologist wouldn’t be so sure either.

In fact, a recent study found that pathologists could only detect micrometastases — super small, malignant tumors that spread to other parts of the body after its original growth site — at 38% accuracy. Additionally, 25% of these examinations would be changed upon a second review. As you can see, detecting cancer in the first place isn’t easy!

After all, they do have to try and observe every single cell to see if it follows an abnormal growth cycle, so the only way to know for sure is to take a biopsy, which just means that have to take a sample of the tumor; this will tell them if the tumor is benign or malignant.

  1. Benign tumors, unless in the brain, are usually harmless. They don’t spread like malignant tumors, and once they are removed, they don’t grow back. In fact, 9 out of 10 women experience benign breast tissue changes, so they are quite common.
  2. Malignant tumors, on the other hand, are incredibly harmful and infectious. They can metastasize, or spread to other parts of the body, and ultimately invade the space reserved for healthy tissues and organs by stealing oxygen and nutrients, causing the patient to die.

Catching malignant tumors sooner rather than later is crucial for the patient, simply because it is much easier to treat when it is detected earlier. By the symptoms arise, some cells from the tumor may have already metastasized, making treatment 10x more complex and troublesome. In addition, metastasis greatly increases the likelihood of cancer coming back, even well after treatment, thus decreasing the survival rate of the patient.

As the tumor spreads, survival rates rapidly decrease.

If you’re interested in a more in-depth understanding of cancer, I recommend you read this article that I wrote a couple of months back, but you don’t need to read it to understand the rest of this article.

To solve the problem in determining if the tumor is benign or malignant, I used machine learning to develop a Sequential Neural Network that can do just that. To do it, I used TensorFlow, a Python library for developing Artificial Neural Networks developed by Google, and Keras, its API, which is just a set of functions that lets you manipulate neural networks. I ran all of my code on Google Colab — personally, I find it simpler and easier to use, but some people prefer to download TensorFlow for its improved stability. In my case, though, Colab suffices. Now let’s dive into the details!

Instead of having to download everything on my computer, in Colab, TensorFlow is pre-installed, so all I have to do is import it like so.

The Data

The first step to creating a precise neural network is to actually find a good dataset. I found one here, but the variety on the website is actually quite limited. (Most of the time, I usually prefer Kaggle, but the dataset I found sufficed). After downloading the dataset, I renamed it data.csv, where “csv” stands for “comma-separated values”, and then I opened it up.

The file contains information about 9 different properties to determine if the tumor is benign or malignant, and in total, there are 569 tumors. 🔑 tip: The more data and the more variety of data you have, the more accurate the neural network will be. In our case, we will use the 9 properties to teach our neural network and help determine the severity of the tumor, which is represented by the “class” column. 2 = benign and 4 = malignant.

Some of the properties include clump thickness, uniformity of cell size, bland chromatin, mitosis, etc.

Translating the Data

Our next step is to actually translate our data from this csv file into Python lists so that TensorFlow can actually understand it. The first thing I did was create an empty list called “data”. Next, I opened my dataset file, and then using a csv reader, I appended each line of my dataset into the empty data Python list.

As you can see in the bottom output section, each set of brackets contains 9 features and the class for each tumor. There’s just one problem, can you figure it out? If you said that each index of each bracket is a String, then you’d be right. After all, we appended the data line by line, instead of index by index, so this would make sense. However, it is ultimately quite problematic because TensorFlow can only understand math. It sees the single quotation marks and goes, uh oh, no bueno.

To solve this problem, I used a nested for loop to individually go through each and every index in the entire data list, and then once I had a single index, I changed it from a String to an int.

Now, as shown in the second line of output, there are no single quotation marks, and thus they’re all ints.

Splitting the Data

Once all of our data is transferred from the csv file into our data list, we can move onto splitting the data. There are actually two separate components of splitting our data: first, we need to split it into an input and output set; this is important to actually train our network.

You can sort of think of it as separating the questions and the answers on a test. After separating, you’re left with the test of questions and an answer key which you can use to grade a student’s answers. Similarly, our neural network will take in the inputs and generate an output. We compare this generated output to our actual output, make the appropriate changes to our network’s weights and biases, and run it again and again until eventually, our network is pretty accurate.

In other words, we need to create a list with just the 9 properties and then separate the list with the “class” column, which just says if the tumor is benign (represented by a 2, remember) or malignant (represented by a 4).

The first component of splitting the data.

The first step to do such a task is to create two more empty lists, which I named “X” and “y”. Next, we want to loop through all of the lines in data and append everything except for the last index of each line, which is our output. Still in the for loop, we look at that last index, and here we have to determine if we append a 0 or a 1.

Our network will use binary classification. You can think of binary classification as determining if an image is a picture of a dog or cat, or if today will rain versus it not raining, or if a student will pass or fail. It’s either one or the other. In our case, the tumor is either benign or malignant, so it follows that same pattern.

You might be wondering, “Why 0s and 1s? What’s wrong with 2s and 4s?” 4 is my favorite number HMPH.” Okay, probably not at that last bit, but you know what I mean.

The short answer is that computers find 0s and 1s are easier to work with than 2s and 4s, and after using a Sigmoid Function, the network can easily determine if the state of the tumor. That’s definitely a little abstract, but I’ll explain more later on in the article, but definitely take a minute or two and ponder why the Sigmoid, 0s, and 1s, and how would this improve the network’s overall performance.

The second component requires us to split our recently acquired input and output data using a train-test-split. Essentially, we are dividing our data into two portions; in the training portion, our network repeatedly iterates and improves itself, and then the network tests its accuracy in the testing portion.

This helps prevent overfitting, which is when the network models the training network too closely, and thus worsens by failing to generalize when it encounters new data.

Utilizing sklearn.model_selection’s train_test_split function is like a walk in the park. In fact, the only thing simpler than using a train_test_split is leaving this article a clap and connecting with me on other platforms 👀 once you finish reading.

Anywayysss, the train_test_split essentially splits up our X (input) list and our Y (output) list into a total of four lists. I used a test size of 0.2, which just means that 20% of our data will be used to test, and the other 80% will be used to train the model.

The Neural Network

Before we dive into using TensorFlow and Keras to build our neural network, it’s important to go over some of the key concepts that allow neural networks to model a biological brain and learn in a similar fashion.

Activation Functions

In neural networks, inputs are multiplied by the weights, then added to the bias, and then passed on into an activation function.

Simply put, activation functions are mathematical functions that help neural networks recognize complex patterns by keeping the numbers in a reasonable threshold. For example, in the situation mentioned previously if a student will pass or fail, a step activation function could be used.

The step function takes in any real number, and if it is greater than a certain threshold, the number is considered 1. Otherwise, it will be considered 0.

The sigmoid function is similar to the step function, but more often than not, the sigmoid function is preferred to the step function because it provides an element of confidence in the output.

Here’s what I mean by that: say the threshold is 0.5 (remember, we use 0s and 1s in our network, so 0.5 is right in the middle), and the predicted output of the neural network after the sigmoid function is 0.47. It’s swaying towards benign, so the network will make that guess, but at the same time, we wouldn’t necessarily be surprised if the network was wrong and the tumor was actually malignant. On the other hand, if our network returned 0.98, we would be almost certain that the tumor is malignant, and thus the sigmoid function provides a sense of confidence to the returned output as opposed to the step function which just returns a 0 or 1.

Rectified linear unit, or ReLU, is another useful activation function, but it is typically used in the hidden layers of a neural network. Recall that throughout all of the nodes, we’re trying to calculate the probability of the tumor being malignant, but what if the weights and biases cause the output of a node to be a negative number?

Well, what’s negative probability? You can’t be more certain than 0%, so ReLU just takes the maximum between 0 and the output. Essentially, if the number is negative, ReLU returns 0, and if it’s positive, ReLU returns the number as it is. This is ultimately beneficial for the computer because it can spend less time processing data that it really doesn’t need to, and consequently, our neural network will become far more efficient.

Gradient Descent

Gradient descent is an algorithm that works to minimize loss, which represents the network’s levels of failure. The lower the loss, the better the network’s performance. So how does gradient descent work exactly?

You can think of it like this. Imagine you’re scared of heights, and yet somehow someway, you find yourself standing at the tippity top of Mount Everest. How do you get down? Well, it’s pretty obvious for a human to figure out (obvious does not equate to simple or easy FYI).

You look down around you, and you evaluate which direction would take us down the mountain the quickest. At the same time, however, you don’t want to go too fast, otherwise, you might slip and that won’t end up being great for you. So you’re tip-toeing down the mountain the way you intended, and then eventually, you’ll think yourself, “Woah, this way could also be pretty quick.”

Hopefully, you don’t find yourself stranded at the top of Mt. Everest…

And so, you choose to slowly head down the mountain that way, and if you keep repeating this process over and over again, hopefully, you reach the bottom of the mountain safe and sound.

Well, neural networks work the exact same way. The network calculates the gradient based on the data points that will lead to decreasing the network’s loss in a similar way you evaluate which steps you need to take to get down that mountain.

The network will then update its weights accordingly and in small increments to prevent it from doing unnecessary work, and it repeats the process over and over until the network is pretty accurate.

Unfortunately, gradient descent alone only really works for neural networks with two layers, and for most problems, their level of complexity requires some hidden layers. We need some kind of concept that can work across an entire deep neural network. “Yooo, backpropagation! Yes, you, it’s your turn to present, let’s go!”

Backpropagation

Backpropagation is the main algorithm used for training neural networks with hidden layers. We can describe it in the pseudocode below:

  • Calculate the error/loss for the output layer (Is the predicted output correct? Can it be even more correct if some of the weights were adjusted?)
  • Repeat the following steps until the input layer is reached
  • 1) Propagate the error back one layer
  • 2) Calculate the weights of this layer using gradient descent
  • 3) Update the weights accordingly

This is a little bit abstract, so I’ll walk you through it. Let’s say we have a benign tumor, and our network happens to predict an output of 0.45. It didn’t necessarily predict wrong, it’s just that it could have been more accurate in that the number could be closer to 0, because our confidence in that output is not very high, if you recall the Sigmoid function.

Boom, we just calculated the error. The computer uses a numerical representation, but for our human's sake, what I said above suffices. The next step is to propagate the error back one layer. Essentially, the output layer just sends the error back to the previous layer for further analysis.

Next, we use gradient descent and calculate the weights of this layer. Since we already know the error of the following layer because we backpropagated, we can find out which weights are responsible for what and how they affect the performance of the network. It’s kind of like if you could rewind time. Imagine you are in a maze and you take the wrong turn and end up in a dead-end, so you rewind time, and now you know that that turn leads you away from your destination. Similarly, the neural network can keep this error in mind, and thus determine which weights are causing the error.

Finally, the network tweaks its own weights and biases to try and minimize that error, and the process repeats again. Backpropagate, figure out what is causing everything to go wrong, and then try and fix it.

Sweeeett, now we have the boring data manipulation and the basics of neural networks covered, we can finally get into the fun stuff. Get ready to rumble! It’s time to actually build our network!

Building the Network

I’ve chosen to implement a sequential model for our network as you can see at the top of the code portion below. Sequential just means that each layer in the model receives its network from the previous layer and then feeds it into the next one. This is different from other types of as networks, such Convolutional Neural Networks, or CNNs, and Recurrent Neural Networks, or RNNs, but in this article, we don’t need to know about those two.

After initializing the Sequential neural network and naming it “model”, the next thing we have to think about is how many nodes and hidden layers we want our network to have. Straight away, we know the input layer has to have 9 nodes because our dataset has 9 unique features for all of the tumors, but thanks to TensorFlow, there’s no need to include it in the code.

Now we have to decipher how many hidden layers our network needs and how many nodes per layer is sufficient. There’s a Catch 22 Situation, though. The more hidden layers and nodes per hidden layer, the more complex and accurate our network can be. Consequently, the network has to do more unnecessary math in order to reach that high level of precision, and thus it will be slower and far more inefficient. For our situation, after tinkering for some time, I found that 2 hidden layers with 8 and 6 nodes respectively was sufficiently accurate.

Furthermore, I’ve chosen both hidden layers to use the ReLU Activation Function to save our network from doing more unnecessary math by eliminating the need to deal with negative probabilities.

Finally, we know that the output layer, also called the output node because our network will use binary classification (recall the idea that the output is either one or the other), will only need 1 node. The output layer will utilize the Sigmoid Activation Function just so our network can make predictions based on a reasonable threshold. You can think of it like this.

Without the Sigmoid function, our network could potentially return a predicted output of anything from -283 to 891, or any arbitrary range for that matter. This makes it almost impossible for the network to train and iterate itself because the numbers don’t actually mean anything, and our network can’t decipher if any one of these arbitrary numbers represents a benign or malignant tumor.

The Sigmoid function solves this. Recall that it turns any large number past a certain threshold to 1 and any small number less than the threshold to 0. Additionally, the Sigmoid provides each prediction with a sense of confidence depending on how far away it is from the threshold.

This gives our network a reasonable range to deal with because it can assume the threshold as 0.5, and any number greater can is considered malignant and any number less is considered benign. “Your welcome, Network… now you don’t have to deal with that stupid range of -283 to 891! YOU’RE WELCOME!”

Once we compile the model like so, we can finally begin training our network! “Adam” is TensorFlow’s built-in optimizer, and it uses gradient descent to update its weights based on the “binary_crossentropy” loss function, which grades how poorly our network is performing by analyzing times when it predicted the wrong binary output.

PS. the metrics parameter is optional, but I like to see the overall accuracy of the model because it gives me a better sense of the network’s performance, as opposed to the computer which uses the loss function.

🔑 Reminder: The entire goal of a neural network is to minimize loss, and it does so by updating its weights and biases during the backpropagation algorithm, which uses utilizes gradient descent.

Training the Network

Model.fit() trains the algorithm by appropriately fitting its weights and biases to the subsequent loss function, and the epochs parameter just outlines how many cycles of the data the network will go through. For instance, 12 epochs just tells the network to go through all of the input data 12 times.

In this case, 12 epochs were sufficient because anything more than 12 epochs resulted in the network peaking at about the same loss function and accuracy, so there was no need for more than 12.

Shall we see how our newly trained network performs against our never-before-seen testing data?! Drumrolllll, pleaseeee!

The Results

After evaluating on the testing data, our network performed at 96.49% accuracy! Therefore, we know that our network is accurate even at general data that it has never encountered before, and thus overfitting is not a problem for our network.

If you aren’t interested in becoming a pathologist or a doctor, fear not because Artificial Intelligence spreads its wings far past just healthcare alone. AI can be applied to any industry in order to augment human life conditions, as long as there’s enough data for the neural network. Object detection, playing chess, converting a person’s thoughts into speech, better cybersecurity, are all examples of where AI can be applied, and the possibilities are endless. In 2021, AI isn’t the future anymore, it’s the present.

😦 “If there’s data, then there’s a way…” - James Zhang, 2021 😦

Key Takeaways

  • Artificial Intelligence is leaving its impact in almost every field.
  • The entire goal of a neural network is to minimize loss, and it does so by updating its weights and biases during the backpropagation algorithm, which uses utilizes gradient descent.
  • Data is crucial! To create a neural network, the data must be gathered, manipulated, and split into input/output and train/test.
  • Hidden layers are like a Catch 22. Adding more adds increased complexity and accuracy, but it also slows down the network because it has to perform more calculations
  • Training and testing the neural network is easy, thanks to TensorFlow, Keras, and Google Colab.

Thanks for Reading!

If you’re interested in playing with the code, you can view the Github Repository here.

Big shout out to YOU for making it to the end of this article! ❤️ I seriously appreciate it. Now, if you remember in the middle of the article, I said the one thing that was easier using the train_test_split function was to leave this article some claps and connect with me! Have at it!

Medium | LinkedIn | Github

--

--