• Ramsay Lewis

SQL Project: Cleaning Nashville Housing Data



I completed this project simply to build my data-cleaning skills using SQL. You can find the full code for the project on GitHub.


Methods and Tools

Data source: I used Nashville housing data. The data file can be found in my GitHub repository.


SQL tools: For this project, I used PostgreSQL, which you can download from here. I also used pgAdmin, which is a web-based graphical user interface that makes it easier to work with SQL databases in PostgreSQL. You can download that from here.


Creating the database

My first step was to create the table and import the housing data.

CREATE TABLE PUBLIC.HOUSING(
	UniqueID NUMERIC,
	ParcelID VARCHAR(50),
	LandUse VARCHAR(50),
	PropertyAddress VARCHAR(50),
	SaleDate DATE,
	SalePrice VARCHAR(50),
	LegalReference VARCHAR(50),
	SoldAsVacant VARCHAR(50),
	OwnerName VARCHAR(100),
	OwnerAddress VARCHAR(50),
	Acreage NUMERIC,
	TaxDistrict VARCHAR(50),
	LandValue NUMERIC,
	BuildingValue NUMERIC,
	TotalValue NUMERIC,
	YearBuilt NUMERIC,
	Bedrooms INT,
	FullBath INT,
	HalfBath INT)

-- Import the data
COPY PUBLIC.HOUSING FROM '/Users/ramsay/Documents/Coding/3. SQL Portfolio Project/Part 3 - SQL Data Cleaning/Nashville Housing Data for Data Cleaning.csv' WITH CSV HEADER;

-- Check it 
SELECT * FROM PUBLIC.HOUSING

Cleaning Task 1: Populating Address Data

Next, I noticed that some of the rows were missing address data.


I also noticed that each address has a unique parcel ID number (ParcelID) and that each parcel ID has the same address. In other words, these pieces of data are connected. That means I can use the parcel ID as a reference to populate the property address when it’s missing.


I used the following query to find parcels of land where there are different rows, but the parcel ID is the same, and where one property address is null.

SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, COALESCE(a.PropertyAddress, b.PropertyAddress)
FROM PUBLIC.HOUSING a
JOIN PUBLIC.HOUSING b
	on a.ParcelID = b.ParcelID
	AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress is null

Then I populated the property address column. I looked for properties that had the same parcel ID but where one was missing an address. Then I updated that missing address from one property with the address from the other.

UPDATE PUBLIC.HOUSING a
	SET PropertyAddress = COALESCE(a.PropertyAddress, 	    	 b.PropertyAddress)
	FROM PUBLIC.HOUSING b
	WHERE a.ParcelID = b.ParcelID 
	AND a.UniqueID <> b.UniqueID
	AND a.PropertyAddress is NULL

Cleaning task 2: Breaking up addresses

I noticed that the format of the address variables—both the property address and owner address variables—were in forms that weren’t super usable. I split them up into their parts: the street address, the city, and the state.


I did this two ways.


Using SUBSTRING

For the property address variable, I used the SUBSTRING and POSITION functions. I made two new columns, one for the street address and one for the city.


Then I used SUBSTRING and POSITION to pull out the street address from the address data and put it in the first column. I used the same strategy to pull out the city name and put it in the second column. In the end, I had two columns: one with the street address and one with the city.

-- Now I'm making two columns...

-- One for the street address
ALTER TABLE PUBLIC.HOUSING
 ADD PropertySplitAddress VARCHAR(255);
UPDATE PUBLIC.HOUSING 
 SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, POSITION(',' in PropertyAddress) -1)
 
-- One for the city
ALTER TABLE PUBLIC.HOUSING
 ADD PropertySplitCity VARCHAR(255);
UPDATE PUBLIC.HOUSING 
 SET PropertySplitCity = SUBSTRING(PropertyAddress, POSITION(',' in PropertyAddress) +1, LENGTH(PropertyAddress))

Using SPLIT_PART

I also wanted to break up the owner address variable into parts: street address, city, and state. I did this using the SPLIT_PART function, using commas as delineators.

SELECT 
 split_part(owneraddress, ',', 1),
 split_part(owneraddress, ',', 2),
 split_part(owneraddress, ',', 3)
FROM PUBLIC.HOUSING

Then I created new columns for each part of the address to go into.

ALTER TABLE PUBLIC.HOUSING
 ADD OwnerSplitaddress VARCHAR(255);
UPDATE PUBLIC.HOUSING 
 SET OwnerSplitaddress =  split_part(owneraddress, ',', 1)

ALTER TABLE PUBLIC.HOUSING
 ADD OwnerSplitCity VARCHAR(255);
UPDATE PUBLIC.HOUSING 
 SET OwnerSplitCity = split_part(owneraddress, ',', 2)
 
ALTER TABLE PUBLIC.HOUSING
 ADD OwnerSplitState VARCHAR(255);
UPDATE PUBLIC.HOUSING 
 SET OwnerSplitState = split_part(owneraddress, ',', 3)

The result was that I had new three columns, one with the street address, one with the city name, and one with the state.


Cleaning task 3: Standardizing “Yes” and “No”

The "sold as vacant” column was populated primarily with “Yes” and “No” responses. But I noticed in some places there was simply “Y” or “N”. I wanted to standardize these.


First I looked to see which were more common. “Yes” and “No” were significantly more common than “Y” and “N”.

SELECT DISTINCT(soldasvacant), COUNT(soldasvacant)
 FROM PUBLIC.HOUSING
 GROUP BY soldasvacant
 ORDER BY 2

So, I changed all the “Y” responses to "Yes" and all the "N" responses to "No".

SELECT soldasvacant
 , CASE when soldasvacant = 'Y' then 'Yes'
    	When soldasvacant = 'N' then 'No'
		ELSE soldasvacant
		END
 FROM PUBLIC.HOUSING
 
-- Now use that to update the table column
UPDATE PUBLIC.HOUSING 
 SET soldasvacant = CASE when soldasvacant = 'Y' then 'Yes'
    	When soldasvacant = 'N' then 'No'
		ELSE soldasvacant
		END

Cleaning task 4: Remove duplicates

Normally, it’s not good practice to remove duplicate entries from raw data. Instead, we would create a temporary table, remove the duplicates from that, and then do the analyses with that temp table. But I removed duplicates from this data set just for the sake of demonstrating how to do it.


First, I created a CTE. In the CTE, I created a column called "Row_Num" that increased by 1 every time a row had the same values for each of the following columns: parcel ID, property address, sale price, sale date, and legal reference. That meant that when a row had a “Row_Num” value greater than 1, I knew it was a duplicate.

WITH RowNumCTE AS(
SELECT *,
    ROW_NUMBER() OVER (
	PARTITION BY ParcelID,
				 PropertyAddress,
				 SalePrice,
				 SaleDate,
				 LegalReference
				 ORDER BY
					UNIQUEID
					) row_num
 FROM PUBLIC.HOUSING
 )

Next, I looked for all the rows that were duplicates.

SELECT *
 FROM RowNumCTE
 WHERE row_num > 1
 ORDER BY PropertyAddress

Then I deleted them.

DELETE FROM PUBLIC.HOUSING
WHERE UniqueID IN (
	SELECT
	  	UniqueID
	FROM (
		SELECT
			UniqueID,
			ROW_NUMBER() OVER (PARTITION BY ParcelID,
				 PropertyAddress,
				 SalePrice,
				 SaleDate,
				 LegalReference
					) row_num
		FROM PUBLIC.HOUSING
		) s
	WHERE row_num > 1	
)

Cleaning task 5: Remove unused columns

Again, normally we wouldn’t remove columns from raw data. Instead, we would create a temporary table and remove them from that. But for the sake of demonstration, I decided to remove any columns that I wasn't planning to use for analysis. That included the columns for owner address, tax district, property address, and sale date.

ALTER TABLE PUBLIC.HOUSING
  DROP COLUMN OwnerAddress,
  DROP COLUMN taxdistrict,
  DROP COLUMN propertyaddress,
  DROP COLUMN SaleDate;

Data Cleaning in SQL

In the words of Dean Abbot,

“No data is clean, but most is useful.”

Cleaning data is one of the most important steps in a data analysis project. If you want to learn more about data cleaning, here are some useful resources:


 

Need help preparing your data for analysis? Get in touch and let’s see if I can help.



2 views0 comments

Recent Posts

See All