In the matter of confirmed COVID-19 cases, some areas have been affected much more than others. Take the United States for example. As of writing, New York has around 60,000 confirmed cases while most states are less than 1,000. This wild of an outlier makes the entire population difficult to visualize because the scale is extremely large compared to the average value. When we map the confirmed cases by state, we end up with a map that basically only has two colors (left) while we’d much rather a map with a diverse color saturation (right).

You can see that New York is solid red indicating it is the highest, but almost every other state is purple indicating the lowest possible value. The issue is that our data has such a large scale due to the single outlier that it reduces the effect of the coloring on the other states. In this blog post, I’m going to show you a few ways to quickly fix this problem.

Option #1: Setting a Center Value

We can quickly rescale the map if we set a center value that indicates what number of cases our middle color (yellow) should be. Note that I am using a diverging color scale, with purple as the lowest, yellow in the middle, and red as the highest. Setting the center value to 3500 makes the map look much nicer.

The problem with this method is that it’s not dynamic. This map will grow less perfect as our cases rise over the coming weeks. This is a decent solution for a static dataset, but not one that is expected to change in the near term.

Option #2: Log Scale

We can quickly rescale our data by changing from linear scale (default) to log scale. You can do this by creating a calculated column or measure and using the LOG function to convert to log base 10. Note that you can choose base 2, 5, 10, etc. The base that you choose will not affect the visualization.

Next, throw the new Log Scale column into the map and note the difference.

This looks better than the default map, but it seems to focus too much on the center values. If you’re looking for a quick solution that remains dynamic as your data grows, this might work for you.

Option #3 (Best Solution): Scaling Down Outliers using Standard Deviation

In my testing, the best solution was to take a slight statistical approach. Simply put, we can calculate the standard deviation of the dataset to determine how spread out our data is from the average. Using this value, we can set bounds for the upper and lower limits of the values.

First, we need to calculate the mean. I’m going to perform these calculations as calculated columns to keep it simple to follow, but you can convert this to a single measure to keep your data model light. To calculate the mean, we simply need to take the AVERAGE of the Confirmed Cases column. We return the same value of 2775 for each row, which is expected given that we’re not using the CALCULATE function.

Next, calculate the standard deviation which will give us an idea of how much our data varies around the mean. Since New York is such an outlier, we can expected that the standard deviation will be high. In fact, our standard deviation calculation comes out to 8319 which is much higher than the mean. Note that we are using the STDEV.P function here and simply providing the Confirmed Cases column as the only parameter.

Finally, we’ll call our output column “Scaled Cases” and give it some logic. We basically want to scale down our outliers that are over one standard deviation away from the mean. Since our mean is 2775 and our standard deviation is 8319, totaling 11094, we want to convert values larger than this to simply be 11094. This ensures that large values will continue to be within the normal range of our dataset.

You can see that New York and Jew Jersey have been scaled down, while the normal values remain as they were. Using our new Scaled Cases column in the map visual, we get a much nicer representation of the data 😊

Here’s the final result (map on the right) compared to the original plot (left).

And with that, we’re done! One last note: you can change the Scaled Cases calculation to use two standard deviations away from the mean if you don’t want to scale down your data quite as much. This will highlight the difference between New York and New Jersey as they won’t have the same value, but you’ll lose a bit of coloring for states with lower cases.

Also, if you’re looking for top Power BI training for a fraction of the price, make sure to check out the BI Elite Training portal! There is a ton of great info available that you’re sure to enjoy.

Parker Stevens

Parker is the creator of the BI Elite YouTube channel, a community of over 20,000 students learning Power BI, DAX, and Power Query. He is a Microsoft Data Platform MVP for his work with Power BI and continues to provide elite Power Platform training courses to help data analysts, BI developers, and citizen dashboard creators reach their goals. Parker is committed to producing high-quality training content that is also extremely cost-effective, to ensure that the largest amount of users can benefit from the content. Training courses located at https://training.bielite.com/