CAP 5771 Spring 25

Logo

This is the web page for Introduction to Data Science at the University of Florida.

Data Wrangling Demo: Housing Data



1. Data cleaning

1.1 Accessing the data

In this first part of the demo, we will go over Data Cleaning. For this purpose we will import a dataset from Kaggle. We will use kagglehub to import the data into our Data wrangling demo folder.

import kagglehub
kagglehub.dataset_download("anthonypino/melbourne-housing-market")

If we only use this command the dataset will be catched somewhere in our machine. But where? It would be a good idea to print the return of this functions and see what is says. **Spoiler alert: **It will tell you were the dataset is located.

/home/codio/.cache/kagglehub/datasets/anthonypino/melbourne-housing-market/versions/27

Now, if you are exploring your folders using ls and cd commands you will notice that there is no visible .cache folder. This is because it is an invisible folder. Folders starting with a dot (.) are hidden in Unix-like systems to reduce clutter, prevent accidental modifications, and maintain consistency, as these often contain sensitive or system-critical configuration files.

Some examples: -.env file is commonly used to store environment variables -.cache - Used by various applications to store temporary data that can speed up load times but doesn’t need to be backed up. -.ssh - Holds Secure Shell (SSH) keys and configuration files, which are used for secure access to remote computers.

You can see a hidden folder by using in the terminal the command ls -la instead of ls. After getting to your .cache folder use the command mv to move the folder called kagglehub from there to your workspace. This is:

mv kagglehub ~/workspace/

Now, you should be abble to see a kagglehub folder in your workspace. Inside, there are two datasets: MELBOURNE_HOUSE_PRICES_LESS.csv and Melbourne_housing_FULL.csv.

We will work with the Melbourne_housing_FULL.csv. Go ahead and read the data using pandas.

import kagglehub
import pandas as pd
#print(kagglehub.dataset_download("anthonypino/melbourne-housing-market"))
file_path = "Data/Melbourne_housing_FULL.csv"
df = pd.read_csv(file_path)

1.2 Understanding, what needs to be cleaned

Now, in the descriptive_utils.py you will have three functions that will help us understand what do we need to do in terms of data cleaning. Let’s take a look at the functions one by one.

Now, that we understand the functions, lets call the them one by one and understand what is needed from us to clean this dataset.

1.2.1 describe_data(df)

When we call describe_data(df) the objective is to understand all the general features of our data including the types of variables, and ranges to understand better how to interpret this in the future.

Also, notice that the only argument of this function is a df. This means that we will be able to use this function with virtually any dataframe that we have at hand.

Now, I really encourage you create your personal little libraries of reusable functions to prevents redundant work.

Step-by-Step Breakdown

  1. Prints a header to indicate that the data description is starting (personal choice).
  2. Extracts the number of rows and columns using .shape.
  3. Loops through each column in the DataFrame. Retrieves the column’s data and datatype. Prints the column name and its type.
    1. Checks if the column contains numeric data (int or float). Computes and prints: Minimum value (min). Maximum value (max). Mean (average) value, and Median (middle value).
    2. Checks if the column is categorical (category or object type). Counts and prints the number of unique categories. If there are 10 or fewer unique categories, it prints category names and their respective counts.
    3. Checks if the column contains datetime values. Retrieves and prints: Earliest date (min). Latest date (max). Number of unique dates.

While looking at the summary you might have notice that we have 21 variables. This is a relatively large dataset. While some of the variable names are self-explanatory we could use some help understanding others:

The variable Type refers to the property type.

h → House
u → Unit (Smaller residential properties, such as apartments)
t → Townhouse (A distinct category separate from other units)

The Method variable represents the method of sale for the property. The common values are:

S → Property sold (Standard sale)
SP → Property sold prior (Sold before the scheduled auction)
PI → Property passed in (Did not meet the reserve price at auction)
VB → Vendor bid (A bid made by the vendor to influence the auction)
SA → Sold after (Sold after the auction)

The Propertycount variable represents the total number of properties within the suburb where the listed property is located.

In addition, hopefully by looking at the summary you will detect a couple of things:

  1. Date as you can see you call the function and execute the code, is consider an object type. Therefore, we will need to convert it to datetime in the format %d/%m/%y.
df["Date"] = pd.to_datetime(df["Date"], format = "%d/%m/%y")

If there are inconsistencies in the data format we can also use the format = "mixed" option in which case the most appropriate format will be selected.

df["Date"] = pd.to_datetime(df["Date"], format = "mixed")

[2] We will need to change some variables that have been interpreted as float64 or object into category. While changing from object to category is not critical to the analysis, it is in terms of efficiency. Remember that little things add up.

These variables are:

df = df.astype({
    'Type': 'category',
    'Method': 'category',
    'postcode': 'category',
    'YearBuilt': 'category',
    'Regionname': 'category'
})

Now, lets execute the code once more and check if the summary looks correct now that we have made all the changes needed. You might be tempted to change the float64 to int64, while this is a great idea! in pandas float can handle missing values, while int can not. Therefore, we will have to keep these variables as they are.

Since you will go back to this summary multiple times, you probably want to have it at hand. Perhaps it would be a better idea to output from the summary into a pdf? There are a lot of libraries that you can use to accomplish this. I personally like Reportlab. This can be a nice challenge. Remember that the reward is that you get to keep those utils functions to achieve this tasks in few minutes in the future. Your future self will thank you for it.

1.2.2 count_nulls(df)

We can move now to the count_nulls function. This function analyzes missing values in a DataFrame, providing insights into the distribution of nulls at both the column and row levels. It also visualizes the results and saves the plot for further analysis.

Step-by-Step Breakdown

  1. Prints a message indicating the function is analyzing null values in the dataset.
  2. Count Null Values Per Column
  3. Count Null Values Per Row
  4. Identify Rows with the Most Nulls. Retrieves the row indice where this maximum number of nulls occurs.
  5. Calculate the Percentage of Rows with Nulls
  6. Generate Visualizations
    1. Histogram of Nulls Per Row
    2. Annotate Histogram Bars
    3. Box Plot of Nulls Per Row. Generates a box plot to visualize the spread of missing values per row.
  7. Save the Visualization in the Images folder with a name that will change based on the time.

From executing this function we notice that there are only three observations that contain more than ten outliers. Most rows in our data will have between zero and two.

1.2.3 describe_numeric(df)

This function provides a detailed analysis of numeric variables in a DataFrame. It calculates key statistics, visualizes distributions, and saves the plots for further analysis.

Step-by-Step Breakdown

  1. Prints a message indicating the function is analyzing numeric variables in the dataset.
  2. Select Numeric Columns
  3. Compute Descriptive Statistics. Uses .describe() to compute summary statistics (mean, min, max, quartiles, etc.).
  4. Iterates through each numeric column to generate visualizations. Drops missing values before plotting.
  5. Handle Missing Data. Checks if all values in a column are NaN. If a column has no valid data, prints a message and skips visualization.
  6. Generate Visualizations
    1. Histogram of Numeric Column
    2. Box Plot of Numeric Column
  7. Save the Visualization in the Images/Numeric folder with a name that will change based on the time.

When calling the function describe_numeric(df), we must ensure that the directory exists, and if not make sure we can create it. We can do this by adding:

  os.makedirs("Images/Numeric", exist_ok=True)

Now, the plot does not look great for int variables. If we want to make sure the bins are calculated differently, we can always set the bin for integer variables to 1. We would need to modify the function to take in the list of integer variable names:

#in the function definition
def describe_numeric2(df, ls_int):
# ...
# in main function
describe_numeric2(data, ["Rooms", "Car", "Bedroom2", "Bathroom"])
if column in ls_int:
    print(f"Setting bin width = 1 for: {column}")
    min_val, max_val = int(data.min()), int(data.max())
    bins = range(min_val, max_val + 1, 1)
    sns.histplot(data, ax=ax1, color='blue', alpha=0.7, kde=False, bins=bins, element='bars', stat='count')
    bin_width = 1  
else:
    sns.histplot(data, ax=ax1, color='blue', alpha=0.7, kde=False, binwidth=None, element='bars', stat='count')
    bin_width = ax1.patches[0].get_width() if ax1.patches else 0  

(see describe_numeric2(df, ls_int))

1.3 Additional code cleaning step

In the file cleaning_utils.py there is a function called correcting_datatypes(df, date_cols=None, categorical_cols=None, float_cols=None), this function will be able to achieve the conversion of datatypes in one line to not clutter our main function and generalize functionality for future datasets. You can replace the lines pertaining to data type conversion in main so that up until now your main looks like this:

From:

df["Date"] = pd.to_datetime(df["Date"], format = "mixed")
df = df.astype({
    'Type': 'category',
    'Method': 'category',
    'postcode': 'category',
    'YearBuilt': 'category',
    'Regionname': 'category'
})

To:

correcting_datatypes(data,["Date"],["Type", "Method" , "Postcode", "YearBuilt", "Regionname"])

Building this type of function, can be very useful specially for larger datasets with multiple attributes where you want to be in control, however, you can also use default functionality from pandas such as df = df.convert_dtypes(). If you experiment with both you will notice that convert_dtypes() does a pretty good job, however, some categories are marked as string with the convert_dtypes that we marked as category.

In general, we choose category for memory efficiency and performance when the data has few unique values. We use string for free-form text or when unique values are high. So, in this case our custom function would allow us to have a a performance gain.

If you have not many performance constrains convert_dtypes will work well, just make sure to double check the results at the end. For instance, Date was not converted to datetime by convert_dtypes(), probably because of the mixed date format.

1.4 Notes from the analysis

From this first analysis we notice that there are multiple values that are missing, and some variables seem to have outliers.

What should we look at first?

Generally, it’s best to handle outliers before imputing missing data to ensure that summary statistics like the mean or median are not skewed by extreme values. This method prevents outliers from overly influencing your dataset during imputation.

However, always assess each dataset on its own merits, as different data may require different approaches. Testing both sequences might be worth it in some cases

Before we dive in, are there any other aspects we should consider before addressing outliers and imputation?


2. Fixing the data

What is our objective?

Now even though we can tackle outliers or missing values, we need to have an objective in mind in order to pursue any other task. For instance: Do I want to clean every variable in my dataset? Probably not… if it is not aligned with my objective I can probably skip it.

With this in mind, lets assume that our end goal is to build an algorithm that can forecast the price of a house in Melbourne. Even though we might want to rush and find outliers everywhere, we might also want to start by reducing the data dimensionality and just understanding what variables we really want to focus on.

2.1 Reducing the data by removing unnecessary information

Now, while some of this decisions we can make by looking at the variables and analysis weather a theoretical relationship exist, some other we can analyze using a correlation analysis.

From the current set we know that we can remove from our data Method, SellerG, Lattitude and Longitude. This variables does not seem to be valuable to our analysis, and in larger datasets they might slow our analysis.

To remove this variables we will do:

data = data.drop(columns=['Method', 'SellerG', 'Lattitude', 'Longitude'])

We can also check if there are duplicates in out data, both at the attribute and observation level.

At the observation level, it is enough to use the following dataframe function, however, we can also call the remove_duplicates(df) function in cleaning_utils if we want to check which the duplicates are.

data.drop_duplicates()

or

remove_duplicates(data)

At the attribute level, we can check if there are duplicate attributes or overlapping information by reporting on the pearson correlation and chi-square values.

For this we can use the function plot_correlation_and_chi2(data):

This function will use a correlation heatmap and a chi-square heat map to detect if there are numerical attributes that are highly correlated or categorical attributes that might not have significant differences. This function:

  1. Computes and visualizes correlations for numerical columns (Pearson correlation).
  2. Computes and visualizes Chi-Square test p-values for categorical columns.
  3. Saves the results as images in predefined directories (Images/Numeric/ and Images/Nominal/).
plot_correlation_and_chi2(data):

From the results you can see that there are two numerical attributes that are highly correlated (~0.95). This is Rooms and Bedroom2. Given that our objective is that of predicting, highly correlated features might hinder our future analysis (multicolinearity). In addition, it makes sense that these two features are highly correlated because Rooms is the number of rooms in a house while Bedroom2 is the number of bedrooms. We can keep the feature that makes the most sense to us, in this case we will keep Bedroom2 using the following rationale [1] we do not have missing values in the Rooms attribute, which is an asset [2] rooms is the overall number of rooms in a house, with would most likely include the number of rooms.

For the chi-square it does not seem like there are duplicated attributes, since we can not see p-values that would be higher than 0.05. However, with the chi-square test we have to be careful because if we have attributes with a large number of categories the results might not be reliable.

2.2 Outlier detection

There are multiple methods that we can use to detect outliers. Therefore, we will choose a democratic approach. This means that, we will use multiple methods (three in our case) and flag outliers when they are flagged by at least two of the three methods. For this purpose, we will use the detect_outliers_democratic(df), this function identifies outliers in a dataset by applying multiple detection methods (IQR, Z-score, and MAD) and selecting values flagged by at least two techniques.

Step-by-Step Breakdown

  1. Creates a copy of the DataFrame
  2. Identifies numerical columns
  3. Initializes an Outlier Summary Dictionary which will store detected outliers for each numerical column.
  4. Iterates over each numerical column
  5. Applies IQR (Interquartile Range). This is: determines the interquartile range (IQR), defines lower and upper bounds for outlier detection, flags values outside these bounds as potential outliers.
  6. Applies Z-score Method. This is: Computes Z-scores for all values, identifies values where the absolute Z-score exceeds 3 as outliers.
  7. Applies MAD (Median Absolute Deviation) Method. This is: calculates the median absolute deviation (MAD), defines a threshold of 3 times MAD for detecting outliers, identifies outliers as values that deviate beyond this threshold.
  8. Prints the results
  9. Returns the outlier summary Dictionary

Now that we have detected the outliers, what do we do with them? For now, we will create a copy of the data that excludes outliers, so that we can input the missing data. However, we need to come back to outliers again. This is because outliers can happen due to errors in the data(in which case we can remove) or they are actual large values that we might want to make sure we keep in our data.

2.3 Imputation after outlier detection.

We are going to start by inputing the values for bathroom. We know that this variable has a ~0.61 correlation with Rooms, and we also know from experience that the Type of house can explain the number of bathrooms. Rooms and Type do not have missing data. Finally, this is a discrete variable, and per our observations from our numerical data analysis has a long right tail (most likely outliers).

All of this characteristics inform our decision of the method for imputation. In this case we will impute this values using the median bathroom count per building type and room count. The implementation is in the median_group_impute:

  1. Remove temporarily remove the outliers detected before from the variable
  2. Perform imputation using median by group.
  3. Return the dataset

It is important to check that the missing values do not correspond to an actual value. For example, bathroom is a count, checking that the value 0 exists, its important.

The second variable that we want to input is the variable Cars. For this we will check if there are any patterns and if so use them to perform the imputation.

There is the concept of “Missing at random”, which means that there are no patterns of missing. When our observations seem to be missing at random we can perform a imputation procedure that uses a single value not per group. However, exploring our data with check_missing_patterns we can see that Cars missing values seem to be related to Type and Rooms. It looks like missing values are more frequent in non-house types and in residencies with less number of rooms. Therefore we can use again the same strategy to input the data of the Cars variable, by using the median_group_impute function.

We cannot impute data for our prediction target variable, Price, because doing so would introduce bias and distort the true patterns in the data. In predictive modeling, the target variable is what we are trying to predict, meaning that any imputation method — such as using the mean, median, or regression-based imputation — would inject artificial information into the dataset. This could mislead the model, as it would learn from synthetically generated values rather than real-world relationships.

2.4 Note

We could also use scikit-learn to input the data. While this module does not have imputation per median group, it does have some other functions that can be helpful to us. One of them is the KNNinputer.

With the following lines of code we would be able to input a variable using knn:

knn_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('imputer', KNNImputer(n_neighbors=n_neighbors))
])
    
df_imputed = data.copy()
df_imputed[[target_variable]] = knn_pipeline.fit_transform(df_imputed[[target_variable]])

Important files


Back to BACK