SQL Project: Cleaning Nashville Housing Data
I completed this project simply to build my data-cleaning skills using SQL. You can find the full code for the project on GitHub.
Methods and Tools
Data source: I used Nashville housing data. The data file can be found in my GitHub repository.
SQL tools: For this project, I used PostgreSQL, which you can download from here. I also used pgAdmin, which is a web-based graphical user interface that makes it easier to work with SQL databases in PostgreSQL. You can download that from here.
Creating the database
My first step was to create the table and import the housing data.
CREATE TABLE PUBLIC.HOUSING( UniqueID NUMERIC, ParcelID VARCHAR(50), LandUse VARCHAR(50), PropertyAddress VARCHAR(50), SaleDate DATE, SalePrice VARCHAR(50), LegalReference VARCHAR(50), SoldAsVacant VARCHAR(50), OwnerName VARCHAR(100), OwnerAddress VARCHAR(50), Acreage NUMERIC, TaxDistrict VARCHAR(50), LandValue NUMERIC, BuildingValue NUMERIC, TotalValue NUMERIC, YearBuilt NUMERIC, Bedrooms INT, FullBath INT, HalfBath INT) -- Import the data COPY PUBLIC.HOUSING FROM '/Users/ramsay/Documents/Coding/3. SQL Portfolio Project/Part 3 - SQL Data Cleaning/Nashville Housing Data for Data Cleaning.csv' WITH CSV HEADER; -- Check it SELECT * FROM PUBLIC.HOUSING
Cleaning Task 1: Populating Address Data
Next, I noticed that some of the rows were missing address data.
I also noticed that each address has a unique parcel ID number (ParcelID) and that each parcel ID has the same address. In other words, these pieces of data are connected. That means I can use the parcel ID as a reference to populate the property address when it’s missing.
I used the following query to find parcels of land where there are different rows, but the parcel ID is the same, and where one property address is null.
SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, COALESCE(a.PropertyAddress, b.PropertyAddress) FROM PUBLIC.HOUSING a JOIN PUBLIC.HOUSING b on a.ParcelID = b.ParcelID AND a.UniqueID <> b.UniqueID WHERE a.PropertyAddress is null
Then I populated the property address column. I looked for properties that had the same parcel ID but where one was missing an address. Then I updated that missing address from one property with the address from the other.
UPDATE PUBLIC.HOUSING a SET PropertyAddress = COALESCE(a.PropertyAddress, b.PropertyAddress) FROM PUBLIC.HOUSING b WHERE a.ParcelID = b.ParcelID AND a.UniqueID <> b.UniqueID AND a.PropertyAddress is NULL
Cleaning task 2: Breaking up addresses
I noticed that the format of the address variables—both the property address and owner address variables—were in forms that weren’t super usable. I split them up into their parts: the street address, the city, and the state.
I did this two ways.
For the property address variable, I used the SUBSTRING and POSITION functions. I made two new columns, one for the street address and one for the city.
Then I used SUBSTRING and POSITION to pull out the street address from the address data and put it in the first column. I used the same strategy to pull out the city name and put it in the second column. In the end, I had two columns: one with the street address and one with the city.
-- Now I'm making two columns... -- One for the street address ALTER TABLE PUBLIC.HOUSING ADD PropertySplitAddress VARCHAR(255); UPDATE PUBLIC.HOUSING SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, POSITION(',' in PropertyAddress) -1) -- One for the city ALTER TABLE PUBLIC.HOUSING ADD PropertySplitCity VARCHAR(255); UPDATE PUBLIC.HOUSING SET PropertySplitCity = SUBSTRING(PropertyAddress, POSITION(',' in PropertyAddress) +1, LENGTH(PropertyAddress))
I also wanted to break up the owner address variable into parts: street address, city, and state. I did this using the SPLIT_PART function, using commas as delineators.
SELECT split_part(owneraddress, ',', 1), split_part(owneraddress, ',', 2), split_part(owneraddress, ',', 3) FROM PUBLIC.HOUSING
Then I created new columns for each part of the address to go into.
ALTER TABLE PUBLIC.HOUSING ADD OwnerSplitaddress VARCHAR(255); UPDATE PUBLIC.HOUSING SET OwnerSplitaddress = split_part(owneraddress, ',', 1) ALTER TABLE PUBLIC.HOUSING ADD OwnerSplitCity VARCHAR(255); UPDATE PUBLIC.HOUSING SET OwnerSplitCity = split_part(owneraddress, ',', 2) ALTER TABLE PUBLIC.HOUSING ADD OwnerSplitState VARCHAR(255); UPDATE PUBLIC.HOUSING SET OwnerSplitState = split_part(owneraddress, ',', 3)
The result was that I had new three columns, one with the street address, one with the city name, and one with the state.
Cleaning task 3: Standardizing “Yes” and “No”
The "sold as vacant” column was populated primarily with “Yes” and “No” responses. But I noticed in some places there was simply “Y” or “N”. I wanted to standardize these.
First I looked to see which were more common. “Yes” and “No” were significantly more common than “Y” and “N”.
SELECT DISTINCT(soldasvacant), COUNT(soldasvacant) FROM PUBLIC.HOUSING GROUP BY soldasvacant ORDER BY 2
So, I changed all the “Y” responses to "Yes" and all the "N" responses to "No".
SELECT soldasvacant , CASE when soldasvacant = 'Y' then 'Yes' When soldasvacant = 'N' then 'No' ELSE soldasvacant END FROM PUBLIC.HOUSING -- Now use that to update the table column UPDATE PUBLIC.HOUSING SET soldasvacant = CASE when soldasvacant = 'Y' then 'Yes' When soldasvacant = 'N' then 'No' ELSE soldasvacant END
Cleaning task 4: Remove duplicates
Normally, it’s not good practice to remove duplicate entries from raw data. Instead, we would create a temporary table, remove the duplicates from that, and then do the analyses with that temp table. But I removed duplicates from this data set just for the sake of demonstrating how to do it.
First, I created a CTE. In the CTE, I created a column called "Row_Num" that increased by 1 every time a row had the same values for each of the following columns: parcel ID, property address, sale price, sale date, and legal reference. That meant that when a row had a “Row_Num” value greater than 1, I knew it was a duplicate.
WITH RowNumCTE AS( SELECT *, ROW_NUMBER() OVER ( PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference ORDER BY UNIQUEID ) row_num FROM PUBLIC.HOUSING )
Next, I looked for all the rows that were duplicates.
SELECT * FROM RowNumCTE WHERE row_num > 1 ORDER BY PropertyAddress
Then I deleted them.
DELETE FROM PUBLIC.HOUSING WHERE UniqueID IN ( SELECT UniqueID FROM ( SELECT UniqueID, ROW_NUMBER() OVER (PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference ) row_num FROM PUBLIC.HOUSING ) s WHERE row_num > 1 )
Cleaning task 5: Remove unused columns
Again, normally we wouldn’t remove columns from raw data. Instead, we would create a temporary table and remove them from that. But for the sake of demonstration, I decided to remove any columns that I wasn't planning to use for analysis. That included the columns for owner address, tax district, property address, and sale date.
ALTER TABLE PUBLIC.HOUSING DROP COLUMN OwnerAddress, DROP COLUMN taxdistrict, DROP COLUMN propertyaddress, DROP COLUMN SaleDate;
Data Cleaning in SQL
In the words of Dean Abbot,
“No data is clean, but most is useful.”
Cleaning data is one of the most important steps in a data analysis project. If you want to learn more about data cleaning, here are some useful resources:
Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work by Q. Ethan McCallum is a great introduction to data cleaning for data analysis.
Data Cleaning by Ihab F. Ilyas, Xu Chu gives helpful instructions on how to use more advanced data cleaning techniques.
SQL for Data Science: Data Cleaning, Wrangling and Analytics with Relational Databases by Antonio Badia is a thorough guide for using SQL for data analysis and data science projects, and includes a comprehensive overview of data cleaning.
Need help preparing your data for analysis? Get in touch and let’s see if I can help.