Breast cancer is a leading cause of death among women, where early and accurate detection is vital for improving survival rates. Traditional methods are often costly, time-consuming, and error-prone. Using machine learning and neural networks, we can build models that offer faster, more reliable, and scalable diagnosis. This enhances clinical decision-making, supports early intervention, and makes AI a powerful tool in life-saving healthcare applications.
This project focuses on breast cancer prediction using a neural network built in PyTorch within the Spyder IDE environment. It uses a structured dataset containing quantitative features extracted from digitized Fine Needle Aspirate (FNA) test results, such as cell radius, texture, and perimeter. The workflow includes data preprocessing, feature scaling, model design, training, and evaluation. This project demonstrates the power of deep learning in enabling early and accurate diagnosis, contributing to improved clinical outcomes in healthcare.
- Understand Data: Explore the Breast Cancer Wisconsin dataset to identify key features and their distributions.
- Preprocessing: Apply feature standardization to ensure balanced input for neural network training.
- Model Development: Build a PyTorch-based neural network to learn complex patterns for classification.
- Training: Train the model using Binary Cross-Entropy loss and Adam optimizer for optimal learning.
- Evaluation: Assess performance using metrics like accuracy on both training and test datasets.
- Achieve High Accuracy: Demonstrate strong predictive reliability for real-world diagnostic support.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("file-path")
print('Read sucessfully')
# check for CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print("Breast cancer data - rows:",df.shape[0]," columns:", df.shape[1])
Data type
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns):
Column | Non-Null Count | Dtype |
---|---|---|
id | 569 | int64 |
diagnosis | 569 | object |
radius_mean | 569 | float64 |
texture_mean | 569 | float64 |
perimeter_mean | 569 | float64 |
area_mean | 569 | float64 |
smoothness_mean | 569 | float64 |
compactness_mean | 569 | float64 |
concavity_mean | 569 | float64 |
concave points_mean | 569 | float64 |
symmetry_mean | 569 | float64 |
fractal_dimension_mean | 569 | float64 |
radius_se | 569 | float64 |
texture_se | 569 | float64 |
perimeter_se | 569 | float64 |
area_se | 569 | float64 |
smoothness_se | 569 | float64 |
compactness_se | 569 | float64 |
concavity_se | 569 | float64 |
concave points_se | 569 | float64 |
symmetry_se | 569 | float64 |
fractal_dimension_se | 569 | float64 |
radius_worst | 569 | float64 |
texture_worst | 569 | float64 |
perimeter_worst | 569 | float64 |
area_worst | 569 | float64 |
smoothness_worst | 569 | float64 |
compactness_worst | 569 | float64 |
concavity_worst | 569 | float64 |
concave points_worst | 569 | float64 |
symmetry_worst | 569 | float64 |
fractal_dimension_worst | 569 | float64 |
Unnamed: 32 | 0 | float64 |
dtypes: float64(31), int64(1), object(1)
Drop the column
df.drop("Unnamed: 32", axis=1, inplace=True)
First few rows of Data
df.head()
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 32 columns
Last few rows of Data
df.tail()
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
926424 | M | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
926682 | M | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
926954 | M | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
927241 | M | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
92751 | B | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
5 rows × 32 columns
Checking Null values
df.isnull().sum()
Column | Null Count |
---|---|
id | 0 |
diagnosis | 0 |
radius_mean | 0 |
texture_mean | 0 |
perimeter_mean | 0 |
area_mean | 0 |
smoothness_mean | 0 |
compactness_mean | 0 |
concavity_mean | 0 |
concave points_mean | 0 |
symmetry_mean | 0 |
fractal_dimension_mean | 0 |
radius_se | 0 |
texture_se | 0 |
perimeter_se | 0 |
area_se | 0 |
smoothness_se | 0 |
compactness_se | 0 |
concavity_se | 0 |
concave points_se | 0 |
symmetry_se | 0 |
fractal_dimension_se | 0 |
radius_worst | 0 |
texture_worst | 0 |
perimeter_worst | 0 |
area_worst | 0 |
smoothness_worst | 0 |
compactness_worst | 0 |
concavity_worst | 0 |
concave points_worst | 0 |
symmetry_worst | 0 |
fractal_dimension_worst | 0 |
dtype: int64
Statistical information
df.describe()
Statistic | id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5.69e+02 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
mean | 3.04e+07 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
std | 1.25e+08 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
min | 8.67e+03 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
25% | 8.69e+05 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
50% | 9.06e+05 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
75% | 8.81e+06 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
max | 9.11e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 31 columns
Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
print(X)
print(y)
plt.figure(figsize=(20, 10))
sns.heatmap(df.describe().T, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Statistical Summary of Numerical Columns")
plt.show()
Pairplot
sns.pairplot(df[['radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'smoothness_mean']])
plt.show()
Histogram
scaler = StandardScaler()
# Fit and transform the 'radius_mean' column
scaled_data = scaler.fit_transform(df[['radius_mean']])
# Create a new DataFrame with the scaled data for 'radius_mean'
scaled_df = pd.DataFrame(scaled_data, columns=['radius_mean'], index=df.index)
plt.figure()
plt.hist(df['radius_mean'], alpha=0.5, label='Raw')
plt.hist(scaled_df['radius_mean'], alpha=0.5, label='Scaled') # Now scaled_df is defined
plt.legend()
plt.title('Feature Scaling: radius_mean')
plt.show()
Confusion matrix
y_true = np.array([0, 1, 0, 1, 1]) # Example ground truth
y_pred = np.array([1, 1, 0, 0, 1]) # Example predictions
cm = confusion_matrix(y_true, y_pred)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(2) # Assuming binary classification
plt.xticks(tick_marks, ['Class 0', 'Class 1'], rotation=45)
plt.yticks(tick_marks, ['Class 0', 'Class 1'])
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X.shape)
print(X_train.shape)
print(X_test.shape)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
type(X_train)
Convert data to PyTorch tensors and move it to GPU
X_train = torch.tensor(X_train, dtype=torch.float32).to(device)
y_train = torch.tensor(y_train, dtype=torch.float32).to(device)
X_test = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test = torch.tensor(y_test, dtype=torch.float32).to(device)
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
out = self.sigmoid(out)
return out
input_size = X_train.shape[1]
hidden_size = 64
output_size = 1
learning_rate = 0.001
num_epochs = 100
nitialize the Neural Network and move it the GPU
model = NeuralNet(input_size, hidden_size, output_size).to(device)
Loss and the Optiizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# training the model
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train.view(-1,1))
loss.backward()
optimizer.step()
# claculate accuracy
with torch.no_grad():
predicted = outputs.round()
correct = (predicted == y_train.view(-1,1)).float().sum()
accuracy = correct/y_train.size(0)
if (epoch+1) % 10 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Loss : {loss.item():.4f}, Accuracy: {accuracy.item() * 100:.2f}%")
Evaluation on training set
model.eval()
with torch.no_grad():
outputs = model(X_train)
predicted = outputs.round()
correct = (predicted == y_train.view(-1,1)).float().sum()
accuracy = correct/y_train.size(0)
print(f"Accuracy on training data: {accuracy.item() * 100:.2f}%")
Evaluation on test set
model.eval()
with torch.no_grad():
outputs = model(X_test)
predicted = outputs.round()
correct = (predicted == y_test.view(-1,1)).float().sum()
accuracy = correct/y_test.size(0)
print(f"Accuracy on test data: {accuracy.item() * 100:.2f}%")
- High Predictive Accuracy: The neural network achieved 98.02% accuracy on the training set and 97.37% on the test set, indicating excellent generalization to new data.
- Model Effectiveness: A simple architecture with one hidden layer was sufficient for achieving high classification accuracy.
- Preprocessing Importance: Standardizing features with StandardScaler improved model stability and performance by preventing large-range attributes from dominating the learning process.
- PyTorch Efficiency: Using PyTorch, along with GPU acceleration (CUDA), accelerated training over 100 epochs, making the process more efficient.
- Kaggle – Dataset source
- Spyder IDE – Interactive environment for coding and presenting analysis
- Python – Data analysis, manipulation and Visualization
- Libraries:
numpy
,pandas
,matplotlib
,seaborn
- Libraries:
- Machine Learning – Model development and evaluation
- Scikit-learn:
train_test_split
,StandardScaler
- Scikit-learn:
- Deep Learning – Neural Network
- PyTorch:
torch
,torch.nn
,torch.optim
- PyTorch:
This project successfully developed a PyTorch-based neural network for classifying breast cancer tumors. Through a structured workflow of data preprocessing, model training, and evaluation, the model achieved over 97% accuracy on the test dataset. This result highlights the power of neural networks, even simple architectures, in addressing complex medical classification tasks when applied to relevant features.
The project demonstrates how deep learning tools can aid in medical diagnosis, showcasing the potential of reliable predictive models based on quantitative image-derived data. The model’s strong generalization suggests it could be valuable in computer-aided breast cancer screening systems.