2021-07-20
Today we published a new web page which provides IRS data by zip code.
The IRS publishes many details of Form 1040 by zip code. For example, the data includes the number of 1040s filed for each zip code and the total adjusted gross income (“AGI”) earned in that zip code. This data allows the average AGI to be computed for each zip code.
Our new web page shows a graph comparing the average AGI for (1) the specific zip code, (2) the relevant state, and (3) the U.S. as a whole. The web page also displays all of the other IRS data provided for each specific zip code.
We also created a web page that lists the 10 zip codes with the highest AGI and the 10 zip codes with the lowest AGI.
How We Created the App
The following is a description of how we created the web page. This description may not be of interest to most tax professionals. Instead, it may be of interest to people just getting started in programming, especially in the Python programming language.
Pandas
The IRS provides the zip code data as a comma separated values (“CSV”) file for each year. The most recent year available for the data is 2018. We downloaded the 2018 CSV file to our computer, and used the “Pandas” module to analyze the data.
import pandas as pd
df = pd.read_csv('18zpallnoagi.csv')
The CSV file has 27,658 rows and 153 columns.
df.shape
Two of the columns were not necessary and so we dropped those columns.
df.drop(columns=['STATEFIPS', 'AGI_STUB'], inplace=True)
Of the remaining 151 columns, 130 had a nice repeating pattern where the column names started with "N" (for number of returns) and "A" (for amounts). I wanted to use the repeating pattern in a “for” loop. Although most of the non-repeating columns were shown first, two of the non-repeating columns were mixed in with the repeating columns. Therefore, I reordered the columns so that the 130 repeating columns were all together at the end.
old_cols = list(df.columns)
desired_order = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 37, 64, 18, 19, 20, 21, . . . ]
new_cols = [old_cols[i] for i in desired_order]
df = df.reindex(columns=new_cols)
Now the last 130 columns all repeated as starting with “N” and then “A”. I used a for loop to insert an extra “N” or “A” at the beginning of these columns, just to distinguish these columns from the 21 non-repeating columns.
repeating_cols = new_cols[21:]
updated_repeating_cols = []
for col in repeating_cols:
if col.startswith('N'):
updated_repeating_cols.append('N' + col)
else:
updated_repeating_cols.append('A' + col)
The repeating columns now looked like: . . . 'NN02650', 'AA02650', 'NN00200', 'AA00200' . . .
I then renamed the columns.
updated_new_cols = new_cols[:21] + updated_repeating_cols
df.columns = updated_new_cols
Cleaning the Data
The IRS documentation guide for the CSV file states: "For all the files, the money amounts are reported in thousands of dollars." For the original IRS data that I downloaded, it appeared that the column for "Child and other dependent credit amount" (column A07225) was in whole dollars, rather than thousands of dollars. I contacted the IRS to notify them of this error. They have now corrected this error, and there is no longer a need to correct for this error.
Zip codes for some states (including Connecticut) begin with a leading zero. For example, the zip code for Essex, CT is 06426. The CSV file has truncated the leading zero for these zip codes. I did not change the data in the CSV file. Instead, my Python code checks to see if the zip code entered by the user starts with a zero. If it does, then the Python code truncates the zero so that it looks up the proper value in the database. The code also converts the input value from a string to an integer.
if zip_code.startswith('0'):
zip_code = int(zip_code[1:])
else:
zip_code = int(zip_code)
Sqlite3 Database
At first I used Pandas on the web server to look up the data for each requested zip code. However, it seemed that Pandas was putting a strain on the web server. Therefore, I decided to put the zip code data into a database, and have the web page query the database, rather than execute a Pandas filter each time. First, I created the database and used Pandas to populate the table in the database.
create_table_string = '''CREATE TABLE zip_codes (STATE text, ZIPCODE integer, N1 integer, MARS1 integer, MARS2 integer . . . '''
from pathlib import Path
Path('zip_data.db').touch()
import sqlite3
conn = sqlite3.connect('zip_data.db')
c = conn.cursor()
c.execute(create_table_string)
df.to_sql('zip_codes', conn, if_exists='append', index = False)
conn.close()
Flask
I typically use PHP to create the web pages on my website. However, I could not run Python on the back end using PHP. At least for now, I did not want to entirely move away from PHP, so I used Flask just for the single zip codes web page.
Using Flask and Python on the web server was a significant learning curve for me. Although I had some experience with Python, I had no experience with Flask, virtual environments, or WSGI files.
I had heard about virtual environments. However, when I just ran Python scripts on my own computer, I had never really understood the need for a virtual environment. Trying to use Flask on the web server, I soon realized that a virtual environment was necessary on the web server.
WSGI files really confused me at first. Once I realized that a WSGI file is like any other Python file, things moved forward much faster.
I still have a lot to learn, but this project has significantly advanced my knowledge of Python, Pandas, Flask, virtual environments, and more.