Figuring out what we can do with the data available and what we can’t
- Making our Chest X-ray COVID-19 classifier
2.1. Data preparation
- Does it really work?
3.1. Further analysis
3.2. Discussion and takeaways
Today everyone knows about the pandemic. Professionals do their best to help deal with it: preventing rapid spread of infection, developing diagnosis methods, drug discovery, patient care strategies, vaccine development, mortality predictions, modeling implications for the global economy, and many many others.
Data scientists are no exception. There’s no need to convince anyone that AI works great for medical applications. You could have seen publications (even scientific papers) claiming that some model has been developed that can predict whether a patient has COVID-19 or not. Some publications claim 90+% prediction accuracy when applying deep learning to chest X-ray images which raise a lot of questions.
And that’s why: at Futuremed we’re working with medical data in tight collaboration with doctors. And radiologists say that there’s little to none COVID-19 specific patterns in chest X-rays.
There are times when AI is capable of finding features that no human doctor can.
Maybe that’s the case with COVID-19 on chest X-rays either? Let’s find out.
This is not a tutorial on “how to train a neural network”, but I’ll add all the necessary information to reproduce described work.
2.1. Data preparation
For the task of classifying COVID-19 on Chest X-Ray images, our dataset should have at least two classes: “COVID-19” and “Other”.
We are going to use four sources of data for training:
1. The famous GitHub repo with COVID-19 images.
2. Images from the Italian database with COVID-19 cases.
3. Kaggle chest X-ray pneumonia dataset.
4. NIH ChestXRay-14 dataset.
To get as many COVID-19 images as possible let’s combine the first two sources. Most of the images from the Italian database have already been included in the GitHub repo. But some of them weren’t, so we just added them manually. That way we get all available at the moment (7 April 2020) images with COVID-19, and a couple of images without it (with other pathology or “no finding”, they will be used as “Other” class samples). One thing to note here: each patient can have multiple images in that part of the dataset, so n_patients ≤ n_images.
To get more images for the “Other” class the last two sources are used.
- 450 Randomly picked images from chest X-ray pneumonia dataset in a per class balanced manner. Patient IDs were not taken into account. That way we get:
150 images of 149 patients with no finding,
150 images of 144 patients with viral pneumonia,
150 images of 144 patients with bacterial pneumonia.
- Randomly picked 450 images from NIH ChestXRay-14 dataset:
30 images with every 14 pathologies and another 30 images with “no finding” label. Images with one target pathology may contain other pathologies as well. So this part of the dataset is almost balanced.
The images were picked in a way that this sub-dataset may contain only one image of a certain patient. In other words we get 450 unique patient images.
Next, let’s combine all the data. Here are resulting dataset stats:
All the images were resized to 564×564. Mean and standard deviation were calculated for the images in the dataset.
Let’s use DenseNet-121 as a backbone for the model (it became almost a default choice for processing 2D medical images). And since our COVID-19 dataset is too small to train a model from scratch, let’s train our model on ChestXRay-14 first, and then use a pre-trained model for weight initialization.
When working with medical images it’s crucial to make sure that different images of one patient won’t get into training/validation/test sets. To address this issue and due to the scarcity of COVID-19 images, we decided to use 10-fold cross-validation over patients for training.
The following data augmentations were performed for training:
- Random rotate (<15°),
- Random crop (512×512),
- Random intensity shift.
For evaluation, we used only center crops (no TTA).
Calculated mean and std were used to standardize images after augmentation.
The network was modified to produce two logits for the classes (“COVID-19” and “Other”). The data was unbalanced, so we choose weighted binary cross-entropy as the loss function. Soft-labeling was also used: one-hot encoded labels smoothing by 0.05. As we cross-validate over patients, the number of images for two classes changes from one fold to another, so we calculate per class weights for every fold on the fly.
The network was trained using Adam optimizer with AMSGrad. Other hyperparameters and the code can be found in the project repo here. Best on the validation set by ROC AUC model was saved for each fold.
Resulting models formed an ensemble that is used for further analysis. Mean by all the validation folds metrics:
ROC AUC: 0.99387,
For testing, we used new frontal (PA or AP views) X-ray images from the GitHub repo (the ones that were added from 7 to 22 April 2020).
And required to balance (“COVID-19” and “Other”) classes number of images were added from unused in training patient’s images randomly picked from ChestXRay-14 (as they were picked randomly, statistically most of them were with “no finding” label).
All these images with corresponding labels form a test set.
Per label stats on the test set:
And common metrics are:
Not bad! It seems like we’ve got a solid Chest X-ray COVID-19 classifier.
Results may look convincing for some readers, but the others might have a “there’s something wrong here” feeling.
Let’s use more data for evaluating our classifier’s performance.
3.1. Further analysis
First, let’s look at classifier’s performance stats on the rest (unused in training) of ChestXRay-14 dataset:
Since in that dataset there are no COVID-19 cases, then the only thing we can claim is that our classifier has pretty good specificity (0.99235) on this dataset.
Also, you can see that there’s no peak of false positives on such classes as “Pneumonia” and “Infiltration” — the ones which might have similar to the COVID-19 X-ray picture. Does it mean that COVID-19 can be distinguished from other similar looking pathologies by an AI algorithm?
Second, let’s have a look at the classifier’s performance on unseen proprietary data. The used dataset has only “normal” and “abnormal” labels. There are no COVID-19 positive patient images in this dataset.
Specificity drops down significantly (to 0.69333). The classifier doesn’t seem to be so good now. What happened?
Third, let’s look at the detailed results on new images (7–22 Apr 2020) from GitHub repo with COVID-19 cases.
Summarizing all the statistics above, we get:
After recalculating the metrics we can already see something:
It’s not “not bad” anymore. We can see the truth now: our classifier is rubbish. Let’s think this through.
3.2. Discussion and takeaways
As it was mentioned, resulting precision shows that the classifier isn’t able to distinguish COVID-19 specific patterns in the images (remember that radiologists say, there’s not much specific to COVID-19 patterns in chest X-ray images). But what have the classifier learned then, and why does it perform well on GitHub repo and ChestXRay-14 data?
The classifier learned how images from datasets picked for “Other” class look like. And it also learned that any pathological pattern means it’s “COVID-19”, given that the image doesn’t look like it’s from “Other” datasets.
So, generally, the classifier learned to distinguish something pathological and not looking like “Other” images.
That’s why it marked almost every third image as “COVID-19” on our proprietary dataset (containing images that don’t look similar to “Other” images).
The classifier knows some differences between normal and abnormal images though. It marked as “COVID-19” every 3rd abnormal and every 5th normal image. At least not all of our effort was in vain 🙂
Despite strong data augmentation while training, careful patient-wise k-fold cross-validation, and weighted loss function, the classifier failed to perform well on real-world data.
We encourage anyone interested to reproduce our experiment.
Actually, you don’t need any proprietary data, you may just exclude one dataset from “Other” class, and use it as “unseen”.
I’d like to point out the two main takeaways:
Any neural network will always try to find the easiest way to solve the task.
Look closely at the data on which your model performance is validated. Not just bare numbers.
At this point, I’m not stating that it’s absolutely impossible to find any COVID-19 specific patterns on Chest X-rays at all. There’s a chance that something specific exists indeed, which AI would be able to capture. But it’s definitely not possible when:
- The neural network is trained on a small amount of data
- When images for some specific class have significant differences from the rest of the dataset
All the source code is available at our project’s GitHub repo. Feel free to contact me if you have any questions.