Exploratory Data Analysis with PowerBI
EDA is exactly what it sounds like. It is the process of investigating and gleaning valuable information from data. The insights that are found in the data can be something obvious, or it could be an interesting comparison that was not apparent from the start. EDA uses what’s called datasets, which could be found online, or you could create one yourself. The dataset contains all the information that you are trying to get insights on. The form in which these datasets can vary, but the most common will be CSV, JSON, or just plaintext.
For this blog, we will be using a dataset found online, which can be found here. Since the downloaded dataset had a few extra lines in the beginning of the file, we recommend using the CSV provided in the GitLab repository.
A Quick Introduction to PowerBI
If you are already familiar with PowerBI, you can skip this section and move onto the next. For those that may not be familiar, it might be good to have some additional knowledge on the platform before diving into the more technical aspects of this article. PowerBI is a Microsoft application that allows the user to create visuals and manipulate data all in one application. Most of this visual creation and data manipulation can be done through a drag-and-drop/point-and-click methodology. Though, you can get much more complex in PowerBI with custom queries, columns, relationships, measures, etc. To keep this article simple, we will just stick to the basics.
What are the Benefits of Using PowerBI for EDA?
PowerBI really enables those who may not be as adept at coding to still perform EDA to great extents. To use PowerBI, you do not have to have a coding background, but some exposure to Excel might be helpful as it uses a similar language for some of its data transformation. PowerBI also comes default with several visuals that can be used to produce meaningful graphs with little effort. You can also find plentiful other visuals online created be the community that can be downloaded and imported into PowerBI Desktop. There is a lot of built-in functionality that truly enables anyone to explore their data and create insightful analytics for a multitude of use cases.
To do any work with PowerBI, you need to have a Windows device that you can install the PowerBI Desktop application on. The PowerBI Desktop application can be downloaded here. Once PowerBI Desktop is installed, you are ready to start with your EDA.
Importing the Data
To get started with PowerBI, you need to first load in the data that you plan to work with. To do this, you will want to click the Get Data drop-down in the data section of the ribbon. You can see in this drop-down that there are a ton of options to get data from in PowerBI. There are even more options not listed in the drop-down. Since we are working with a CSV file, you’ll want to click the Test/CSV option.
This will bring up Windows Explorer where you can navigate to where you saved the CSV file. Once selecting the file, you will get a window like the one below. From there, click Transform Data… this will bring us to the next section in this article.
Transforming the Data
With PowerBI, there are only a few steps that we need to go through to get our data to the state that we need to start producing visuals. The first of which is to make the first row our column headers. When PowerBI loads the CSV, it makes the column headers the first row. Luckily for us, PowerBI makes it easy for us to convert the first row into our headers with the ‘Use First Row as Headers’ option in the transform section of the ribbon.
Now that our column names are looking better, we will move on to unpivoting our year columns. This dataset comes with a column for each year, but instead, we need a row for each country or continent for each year with the population for that year. To do this, PowerBI also makes this easy with the ‘Unpivot Columns’ button. Before doing this action, you will want to highlight all the Year columns, so 1960 – 2020 (you can easily do this by holding the Shift key when clicking). Select the Transform tab in the ribbon, click the drop-down for ‘Unpivot Columns’ in the Any Column section in the ribbon and click ‘Unpivot Only Selected Columns’.
This will transform all those columns into two new columns. One for the attribute (Year) and another for the value (population). This process exponentially increases the number of rows depending on the number of columns you unpivot.
The column names that it creates aren’t very useful, however. So, you will want to double-click the column names to rename them accordingly.
If you click the drop-down next to the Indicator Name and Indicator Code columns, you'll notice that there is only one value in each of those columns. Making those columns for this dataset useless. To get rid of those columns, we can highlight both columns, right-click, and select Remove Columns.
The last thing that we will do with this data is clean up the Country Name column. There are several entries that are not just countries or continents. In order to get rid of these entries, we will apply a filter to the Country Name column. To get started, click the drop down on the column name, then select Text Filters > Does Not Contain.
There are a number of filters to apply here. The easiest way to do this would be to apply just a single filter such as ‘&’ and then click Ok. You can then edit the query and paste in the rest of the filter found below. However, if you wanted to manually create all the conditions, they are as follows:
Table.SelectRows(#"Removed Columns", each not
Text.Contains([Country Name], "&") and not
Text.Contains([Country Name], "and") and not
Text.Contains([Country Name], "IDA") and not
Text.Contains([Country Name], "-") and not
Text.Contains([Country Name], ":") and not
Text.Contains([Country Name], "(") and not
Text.Contains([Country Name], "income") and not
Text.Contains([Country Name], "IBRD") and not
Text.Contains([Country Name], "World") and not
Text.Contains([Country Name], "OECD") and not
Text.Contains([Country Name], "area"))
To apply all your changes, in the top left, click Close and Apply.
Visualizing the Data
Now that the data is properly transformed, we are ready to start making visuals with our data. As mentioned in the beginning, PowerBI uses a very simple drag-and-drop methodology for creating visuals. We will not be going too in-depth on the visuals, but we encourage you to explore the different options available after going through this following process.
To start, we will create a line chart. Select the line chart option under the visualizations section on the right.
This will give you a blank visual as there is no data added to it yet. From the Fields section on the far right, drag and drop the following fields to the respective sections for the visual. If you don’t see these fields under the visualizations section, be sure to click the visual in the left panel.
Doing this will create a graph that looks like this below… which isn’t very useful at this stage because there is too much going on.
So what we’ll do is filter this visual so that we get fewer data results in a better looking visual. To begin, expand the Country Name heading in the Filters column and change the filter type to Top N.
From there, input 10 in the text field and then drag the population field into the By Value section. Click Apply Filter once done.
You should be able to see a drastic change in the visual. However, there are still improvements that can be made. The second filter that we will apply is to the Year. Expand the Year filter and then change the filter type to Advanced Filtering.
From there, change the ‘contains’ drop-down to ‘does not start with’ and input 196 in the text field. This will exclude all years in the 1960’s, which will make the X-Axis not look so squished.
Click Apply Filter when ready and see how the graph changes. Your final graph should look like the one below. In PowerBI, if you wanted to get more information on data points in the graph, all you need to do is hover over the graph in a certain section, as seen below.
We are going to create one more visual for this dataset. So, go ahead and create a new tab in your PowerBI desktop by clicking the big yellow plus sign in the bottom left. From there, click the Filled map visual from the visualizations section on the right.
Next, add in the Country Name for Location data and Population for Tooltips. Make sure to have the visual selected. Otherwise, those fields will not show up under the visualizations section.
The graph as it stands right now looks decent, but it is a little hard to tell which countries have the higher population count. To fix this, we will select the Format tab under the visualizations section and expand the data colors section. From there, click the "fx" option next to the default color.
This will bring up a screen that will allow us to change the colors that appear on the visual. The first thing you need to change Is the ‘What field should we base this on?’, this should be changed to ‘Sum of Population’. Then, select whatever colors you want for the minimum and maximum and click Ok. The final visual should look something like below.
You can hover over a certain section to get additional information. As well as zoom in and out on different sections of the map to get a closer view.
Today we looked at an overview of EDA and how you can conduct EDA with PowerBI. While this article may not cover all EDA processes or scenarios, it should give you additional insight or enough information to get you started on analyzing your own dataset. If you are someone that enjoys Python or is looking to get into Python, you’re in luck! We have another article on how to do EDA with Python and Jupyter Notebooks if that is more your speed.
Thank you for taking the time to review this article. Reach out to 5.15 if you'd like a free EDA of your environment.