Skip to content
Generic filters
Exact matches only

AI vs COVID-19. Does it really work?

Figuring out what we can do with the data available and what we can’t

Mikhail Padalko
Photo by Alissa Eckert, MS, and Dan Higgins, MAMS, on PHIL


  1. Making our Chest X-ray COVID-19 classifier
    2.1. Data preparation
    2.2. Training
    2.3. Results
  2. Does it really work?
    3.1. Further analysis
    3.2. Discussion and takeaways

Data scientists are no exception. There’s no need to convince anyone that AI works great for medical applications. You could have seen publications (even scientific papers) claiming that some model has been developed that can predict whether a patient has COVID-19 or not. Some publications claim 90+% prediction accuracy when applying deep learning to chest X-ray images which raise a lot of questions.

And that’s why: at Futuremed we’re working with medical data in tight collaboration with doctors. And radiologists say that there’s little to none COVID-19 specific patterns in chest X-rays.
There are times when AI is capable of finding features that no human doctor can.

Gender classification using retinal photos.

Maybe that’s the case with COVID-19 on chest X-rays either? Let’s find out.

2.1. Data preparation

To get as many COVID-19 images as possible let’s combine the first two sources. Most of the images from the Italian database have already been included in the GitHub repo. But some of them weren’t, so we just added them manually. That way we get all available at the moment (7 April 2020) images with COVID-19, and a couple of images without it (with other pathology or “no finding”, they will be used as “Other” class samples). One thing to note here: each patient can have multiple images in that part of the dataset, so n_patients ≤ n_images.

To get more images for the “Other” class the last two sources are used.

  • 450 Randomly picked images from chest X-ray pneumonia dataset in a per class balanced manner. Patient IDs were not taken into account. That way we get:
    150 images of 149 patients with no finding,
    150 images of 144 patients with viral pneumonia,
    150 images of 144 patients with bacterial pneumonia.
  • Randomly picked 450 images from NIH ChestXRay-14 dataset:
    30 images with every 14 pathologies and another 30 images with “no finding” label. Images with one target pathology may contain other pathologies as well. So this part of the dataset is almost balanced.
    The images were picked in a way that this sub-dataset may contain only one image of a certain patient. In other words we get 450 unique patient images.

Next, let’s combine all the data. Here are resulting dataset stats:

All the images were resized to 564×564. Mean and standard deviation were calculated for the images in the dataset.

2.2. Training

  • Random rotate (<15°),
  • Random crop (512×512),
  • Random intensity shift.

For evaluation, we used only center crops (no TTA).
Calculated mean and std were used to standardize images after augmentation.

The network was modified to produce two logits for the classes (“COVID-19” and “Other”). The data was unbalanced, so we choose weighted binary cross-entropy as the loss function. Soft-labeling was also used: one-hot encoded labels smoothing by 0.05. As we cross-validate over patients, the number of images for two classes changes from one fold to another, so we calculate per class weights for every fold on the fly.

The network was trained using Adam optimizer with AMSGrad. Other hyperparameters and the code can be found in the project repo here. Best on the validation set by ROC AUC model was saved for each fold.

2.3. Results

For testing, we used new frontal (PA or AP views) X-ray images from the GitHub repo (the ones that were added from 7 to 22 April 2020).
And required to balance (“COVID-19” and “Other”) classes number of images were added from unused in training patient’s images randomly picked from ChestXRay-14 (as they were picked randomly, statistically most of them were with “no finding” label).
All these images with corresponding labels form a test set.

Per label stats on the test set:

And common metrics are:

Not bad! It seems like we’ve got a solid Chest X-ray COVID-19 classifier.

Results may look convincing for some readers, but the others might have a “there’s something wrong here” feeling.

3.1. Further analysis

Since in that dataset there are no COVID-19 cases, then the only thing we can claim is that our classifier has pretty good specificity (0.99235) on this dataset.
Also, you can see that there’s no peak of false positives on such classes as “Pneumonia” and “Infiltration” — the ones which might have similar to the COVID-19 X-ray picture. Does it mean that COVID-19 can be distinguished from other similar looking pathologies by an AI algorithm?

Second, let’s have a look at the classifier’s performance on unseen proprietary data. The used dataset has only “normal” and “abnormal” labels. There are no COVID-19 positive patient images in this dataset.

Specificity drops down significantly (to 0.69333). The classifier doesn’t seem to be so good now. What happened?

Third, let’s look at the detailed results on new images (7–22 Apr 2020) from GitHub repo with COVID-19 cases.

Summarizing all the statistics above, we get:

After recalculating the metrics we can already see something:

It’s not “not bad” anymore. We can see the truth now: our classifier is rubbish. Let’s think this through.

3.2. Discussion and takeaways

The classifier learned how images from datasets picked for “Other” class look like. And it also learned that any pathological pattern means it’s “COVID-19”, given that the image doesn’t look like it’s from “Other” datasets.

So, generally, the classifier learned to distinguish something pathological and not looking like “Other” images.

Original image of the patient with classifier’s prediction of COVID-19 (left), corresponding GradCAM (right).

That’s why it marked almost every third image as “COVID-19” on our proprietary dataset (containing images that don’t look similar to “Other” images).
The classifier knows some differences between normal and abnormal images though. It marked as “COVID-19” every 3rd abnormal and every 5th normal image. At least not all of our effort was in vain 🙂

Despite strong data augmentation while training, careful patient-wise k-fold cross-validation, and weighted loss function, the classifier failed to perform well on real-world data.

We encourage anyone interested to reproduce our experiment.
Actually, you don’t need any proprietary data, you may just exclude one dataset from “Other” class, and use it as “unseen”.

I’d like to point out the two main takeaways:

Any neural network will always try to find the easiest way to solve the task.

Look closely at the data on which your model performance is validated. Not just bare numbers.

At this point, I’m not stating that it’s absolutely impossible to find any COVID-19 specific patterns on Chest X-rays at all. There’s a chance that something specific exists indeed, which AI would be able to capture. But it’s definitely not possible when:

  1. The neural network is trained on a small amount of data
  2. When images for some specific class have significant differences from the rest of the dataset

All the source code is available at our project’s GitHub repo. Feel free to contact me if you have any questions.