Data Cleaning

Share on:

Overview


“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.”
— Hadley Wickham, Chief Scientist at RStudio

This article will try to show how we can link the structure of a dataset to it’s meaning and make sure the data is showing what we want to show.


Introduction

In my experience, a data analyst spends most of his or hers time on data cleaning and data validation. For quite a few companies, data cleaning means a series of manual tasks that take up a lot of time and require a lot of focus from the data analysts’ part. I am sure most of my colleagues can confirm this, especially those who work in an organisation that is slow to change its ways or work with people who are too accustomed to their ways that they are reluctant to adopt new tools and methods.

Although as quoted at the beginning of this article, every dataset is messy in its own way, and common sense would suggest that the best way to clean it is to do it manually, I disagree. The are some ways in which one can automate the data cleaning process. Even if you do it partially is still a huge win.

I do not agree with a manual approach to repetitive tasks for several reasons. Here are the ones that I consider the most important:

  • The quality of the work can vary between individuals. Some people are more organised than others and that can decrease the chances of overlooking something
  • People get bored doing the same thing over and over again
  • Doing tasks that take a lot of time and little creativity, data cleaning can be like that, can lower motivation and the analysts will no longer try to improve the datasets or find useful insights into the data

I think it’s in everyone’s interest for the analysts to have free time to think of ways in which they can use the data to improve the company or the environment around them, not just make sure the data is there and it is correct, checking record by record. For me that used to be the most frustrating part of the job and it took me a lot of energy to find the motivation to start the work.

In the above section I have discussed why data cleaning might be problematic when done manually. In the next section I would like to go into a bit more detail about what is data cleaning and how we can automate some of the processes done by a data analyst so they can focus on analysis more than data manipulation.

What is data cleaning and data wrangling

Data cleaning

We will start with data cleaning as it is the more general term used by people and generally refers to a process through which the data quality is ensured. This can include but is not limited to:

  • ensuring the data format is correct (e.g. the dates have a standard format in the data source so the analysis software can detect it)
  • dealing with missing data
  • dealing with outliers

When dealing with these cases, analysts don’t necessarily spend the time to analyse each observation individually, but they rather follow a predetermined set of rules, usually a general rule per variable. We can see some examples below:

  • changing the dates in a suitable format
  • replacing missing records either with a mean, median or remove them altogether
  • eliminating outliers from the analysis or separating them in a different dataset so they can be studied in detail

No matter which of these is encountered daily, there is no reason why this cannot be automated as we already have the logic for it, so we can program a computer to do the same.

Data wrangling

Data wrangling is the process through which we manipulate the data so we can transform it to a format that is more suitable for our purposes. Data can come in a plethora of formats, however, when it comes to tabular data, which is the focus of this article, the format I found to be most useful and easiest to read and manipulate by most data software is a Tidy Format quoted at the beginning of this article(see here).

We will see some interesting tools to use for data wrangling in the examples that follow.

Data cleaning in practice

In the next section, we will discuss and go through a series of exercises that will allow us to clean and manipulate the data in a data source. For this we will use the tidyverse package from R. If you do not have it installed, you can do so by copying the following line into your console install.packages("tidyverse").

Now for the actual analysis, we will start by activating it in the environment.

library(tidyverse)

In order to play around with a dataset, we will use the mtcars dataset that comes with the tidyverse package. We can see the dataset using the code below:

mtcars %>%      # Our Dataset
      head()    # First six results
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If you are new to R and are not familiar with the %>% symbol used earlier, a good way to think of it is to consider that it links the actions attached to a dataset. It is to be read “take dataset then (%>%)”. In this case it would be “take mtcars then show the top 5 records”.

Now that we have the dataset and the knowledge on how to read the code, let’s start with actual data cleaning and wrangling.

Mutate

As we can see, the row numbers in this data set are the car names. However, this is not very useful if we want to use the names in our analysis or group by name. So we will use a function that is very useful in data cleaning, mutate().

mutate() allows the users to either create a new column at the end of a dataset, which is quite useful, or change an existing column. We will use this to create a new column with the names of the cars.

mtcars %>%
  mutate(car_names = row.names(mtcars)) %>%
  head()
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#>                           car_names
#> Mazda RX4                 Mazda RX4
#> Mazda RX4 Wag         Mazda RX4 Wag
#> Datsun 710               Datsun 710
#> Hornet 4 Drive       Hornet 4 Drive
#> Hornet Sportabout Hornet Sportabout
#> Valiant                     Valiant

There, we have added a new column to the dataset. Now, what if we need to filter something out of the data set?

Luckily the tidyverse is quite intuitive when it comes to naming functions and we can use the filter() function.

Filter

Let’s filter out from the dataset the cars with four cylinders (cyl).

mtcars %>%
  filter(cyl != 4)
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

As we can see the Datsun 710 that had 4 cylinders has disappeared from the first 6 records.

Of course, this function can also filter specific variables, and include multiple arguments. Let’s try to filter just the cars that have 4 cylinders and more that 90 horse power (hp).

mtcars %>%
  filter(cyl == 4, hp > 90)
#>                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710    22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Merc 230      22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Volvo 142E    21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Please note that this function will not work if you DO NOT use a double = when trying to compare values. One = means attribution in R, which is why the function cannot accept it in a comparison.

If you want to manipulate missing data in a column you can use the following:

  • filter(is.na(column_name)) filters the rows that have a missing values for the specified column
  • filter(!is.na(column_name)) filters the rows that DO NOT have a missing values for the specified column

Selecting columns

Maybe in some situations, you do not need to use the full dataset you have at your disposal.

In this case, you can use a select() statement that will allow you to work with just the needed dataset. Let’s select just two columns, mpg and cyl.

mtcars %>%
  select(mpg, cyl) %>%
  head()
#>                    mpg cyl
#> Mazda RX4         21.0   6
#> Mazda RX4 Wag     21.0   6
#> Datsun 710        22.8   4
#> Hornet 4 Drive    21.4   6
#> Hornet Sportabout 18.7   8
#> Valiant           18.1   6

Now we can select just the needed data without over exhausting our computing resources with unnecessary data.

What if we need to use multiple datasets and it would be good to have them all in one place?

For that we can use a join() statement.

Join

Joining two or more datasets is quite a straightforward process, however, I would like to take a moment and explain it a little bit for those who do not have experience with it and point out some possible aspects for which it would be good to watch out.

The process of joining two datasets requires a reference column with common values in both tables. The joining process will look in the column from the first set, will search the values for each row in the other set and will bring all the values associated with that row in the first set. Let’s see a simple example below. We will create two dummy data sets so we can demonstrate this better.

letters <- cbind(ID = c(1, 2, 3, 3, 1, 2),
                 Value1 = c("A", "B", "C", "D", "E", "F")) %>%
           as_tibble()

numbers <- cbind(ID = c(1, 2, 3, 4, 5, 6),
                 Value2 = c("one", "two", "three", "four", "five", "six")) %>%
           as_tibble()
letters
#> # A tibble: 6 x 2
#>   ID    Value1
#>   <chr> <chr> 
#> 1 1     A     
#> 2 2     B     
#> 3 3     C     
#> 4 3     D     
#> 5 1     E     
#> 6 2     F
numbers
#> # A tibble: 6 x 2
#>   ID    Value2
#>   <chr> <chr> 
#> 1 1     one   
#> 2 2     two   
#> 3 3     three 
#> 4 4     four  
#> 5 5     five  
#> 6 6     six

Now that we have the two datasets I would like to point out that, as you can see, the ID values in the numbers dataset are unique and each is associated with a different value. In the first set, they are not, each appearing twice. Now let’s check the join. We will use the left_join() function as it is the most common.

letters %>%
  left_join(numbers, by = "ID")
#> # A tibble: 6 x 3
#>   ID    Value1 Value2
#>   <chr> <chr>  <chr> 
#> 1 1     A      one   
#> 2 2     B      two   
#> 3 3     C      three 
#> 4 3     D      three 
#> 5 1     E      one   
#> 6 2     F      two

As we can see, the values from numbers associated with a particular ID have been added to the letters table. Note that the values higher than three are missing because they do not have an associated ID in the first dataset. Now let’s see what happens if we try to do it the other way around. Remember that we need the joined table to have unique ID values and here is why.

numbers %>%
  left_join(letters, by = "ID")
#> # A tibble: 9 x 3
#>   ID    Value2 Value1
#>   <chr> <chr>  <chr> 
#> 1 1     one    A     
#> 2 1     one    E     
#> 3 2     two    B     
#> 4 2     two    F     
#> 5 3     three  C     
#> 6 3     three  D     
#> 7 4     four   <NA>  
#> 8 5     five   <NA>  
#> 9 6     six    <NA>

Please notice that the function created additional rows for each time it encountered the needed ID.

Pivoting and Unpivoting Data

Another useful function for a data analyst is to be able to pivot and un-pivot data. There are a couple of methods to do this. The simplest is using two functions pivot_longer() and pivot_wider(). The first function gathers multiple columns in one (makes a table longer) and the other function creates new columns using a previous column. The functionality is similar to that of a pivot_table in Excel.

Let’s take a look at it in the following examples:

Pivot wider

mtcars %>%
  pivot_wider(names_from = cyl, values_from = mpg)
#> # A tibble: 32 x 12
#>     disp    hp  drat    wt  qsec    vs    am  gear  carb   `6`   `4`   `8`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  160    110  3.9   2.62  16.5     0     1     4     4  21    NA    NA  
#>  2  160    110  3.9   2.88  17.0     0     1     4     4  21    NA    NA  
#>  3  108     93  3.85  2.32  18.6     1     1     4     1  NA    22.8  NA  
#>  4  258    110  3.08  3.22  19.4     1     0     3     1  21.4  NA    NA  
#>  5  360    175  3.15  3.44  17.0     0     0     3     2  NA    NA    18.7
#>  6  225    105  2.76  3.46  20.2     1     0     3     1  18.1  NA    NA  
#>  7  360    245  3.21  3.57  15.8     0     0     3     4  NA    NA    14.3
#>  8  147.    62  3.69  3.19  20       1     0     4     2  NA    24.4  NA  
#>  9  141.    95  3.92  3.15  22.9     1     0     4     2  NA    22.8  NA  
#> 10  168.   123  3.92  3.44  18.3     1     0     4     4  19.2  NA    NA  
#> # ... with 22 more rows

The function took the values from the cyl column, created new columns with those values and filled them with the values from the mpg column.

Pivot Longer

Pivot longer does the exact opposite. Let’s see.

mtcars %>%
  pivot_wider(names_from = cyl, values_from = mpg) %>%
  pivot_longer(cols = c("4", "6", "8"), names_to = "cyl", values_to = "mpg")
#> # A tibble: 96 x 11
#>     disp    hp  drat    wt  qsec    vs    am  gear  carb cyl     mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#>  1   160   110  3.9   2.62  16.5     0     1     4     4 4      NA  
#>  2   160   110  3.9   2.62  16.5     0     1     4     4 6      21  
#>  3   160   110  3.9   2.62  16.5     0     1     4     4 8      NA  
#>  4   160   110  3.9   2.88  17.0     0     1     4     4 4      NA  
#>  5   160   110  3.9   2.88  17.0     0     1     4     4 6      21  
#>  6   160   110  3.9   2.88  17.0     0     1     4     4 8      NA  
#>  7   108    93  3.85  2.32  18.6     1     1     4     1 4      22.8
#>  8   108    93  3.85  2.32  18.6     1     1     4     1 6      NA  
#>  9   108    93  3.85  2.32  18.6     1     1     4     1 8      NA  
#> 10   258   110  3.08  3.22  19.4     1     0     3     1 4      NA  
#> # ... with 86 more rows

Now this created some extra rows, one for each extra column we previously created. We do not need three for each car considering that two of the are NA’s. All we need to do in this situation is filter the data, eliminating the NA values.

mtcars %>%
  pivot_wider(names_from = cyl, values_from = mpg) %>%
  pivot_longer(cols = c("4", "6", "8"), names_to = "cyl", values_to = "mpg") %>%
  filter(!is.na(mpg)) # notice the ! before the function in order to use the negative of the function
#> # A tibble: 32 x 11
#>     disp    hp  drat    wt  qsec    vs    am  gear  carb cyl     mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#>  1  160    110  3.9   2.62  16.5     0     1     4     4 6      21  
#>  2  160    110  3.9   2.88  17.0     0     1     4     4 6      21  
#>  3  108     93  3.85  2.32  18.6     1     1     4     1 4      22.8
#>  4  258    110  3.08  3.22  19.4     1     0     3     1 6      21.4
#>  5  360    175  3.15  3.44  17.0     0     0     3     2 8      18.7
#>  6  225    105  2.76  3.46  20.2     1     0     3     1 6      18.1
#>  7  360    245  3.21  3.57  15.8     0     0     3     4 8      14.3
#>  8  147.    62  3.69  3.19  20       1     0     4     2 4      24.4
#>  9  141.    95  3.92  3.15  22.9     1     0     4     2 4      22.8
#> 10  168.   123  3.92  3.44  18.3     1     0     4     4 6      19.2
#> # ... with 22 more rows

And now we can use a select to rearrange the columns.

mtcars %>%
  pivot_wider(names_from = cyl, values_from = mpg) %>%
  pivot_longer(cols = c("4", "6", "8"), names_to = "cyl", values_to = "mpg") %>%
  filter(!is.na(mpg)) %>%
  select(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb) %>%
  head()
#> # A tibble: 6 x 11
#>     mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21   6       160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21   6       160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8 4       108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4 6       258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7 8       360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1 6       225   105  2.76  3.46  20.2     1     0     3     1
mtcars %>%
  head()
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Conclusions

As we could see in this article, R is a tool that offers a lot of flexibility when it comes to data cleaning and data wrangling. Therefore I highly recommended that data analysts use this tool, or other automation tools, in their jobs as it will improve the data quality and it will reduce the time spent on frustrating or boring tasks.

I know it can be a bit of a headache at first since the learning curve for R can be a bit steep, however, I consider it is worth it. When I first started using R I was a bit intimidated by the fact that I did not know where to start implementing it into my job. Then I realised that it is enough to automate part of it, not everything at once and the results started to show immediately. All I needed to do was think of what were the steps I had to take in order to check something or create a report, and recreate those steps in R.

This is a very good first step. You will see that reports or tasks that can take hours or days can be done in a few seconds or minutes if you automate your work and let a script run.

Now that you have a lot more free time, you have the time to let me know how it went or if you encountered any problems.