2022-08-16
I often use Cornell’s Legal Information Institute (“LII”) to access tax regulations. I noticed that the HTML on the LII website included regulation examples in <div> tags with a class of “example”. I thought it would be interesting to extract the text of all of the examples in the regulations and perform some analysis on those regulation examples.
Below I describe how I located over 13,000 examples in the tax regulations.
My first step was to identify all of the LII web pages with tax regulations. The following web pages included links to all of the tax regulations that were important to me:
26 CFR Part 26 - GENERATION-SKIPPING TRANSFER TAX
26 CFR Part 31 - EMPLOYMENT TAXES AND COLLECTION OF INCOME TAX AT SOURCE
26 CFR Part 301 - PROCEDURE AND ADMINISTRATION
I created a Python script to:
I ended up with more than four thousand web pages to download. I created another Python script to download all four thousand web pages. I included a delay in the downloads so as to be nice to the LII servers.
For each file, I wanted to get:
For each example, I wanted to get:
After downloading all of the files, I identified 11,178 examples in <div> tags with a class of example. The titles of the examples were easy to pull out because the titles were in <div> tags with a class of “hed”. Similarly, the texts of the examples were in <div> tags with a class of “pspace”.
Citations to the Examples – “id” Attributes
The citations to the examples were a bit tricky. LII uses the “id” attribute to track the specific citation. The id attribute allows you to link to a specific portion of a web page. I use id attributes all the time to more quickly access code sections and regulations on LII. For example, I often want to get to section 7701(b). However, section 7701(a) is quite long. If I just go to section 7701, I have to scroll down pages and pages before I get to section 7701(b). But if I just add “#b” at the end of the URL, I am directed straight to subsection (b):
https://www.law.cornell.edu/uscode/text/26/7701#b
You can even go deeper than just the subsection. Adding “#b_3” at the end of the URL will take you directly to section 7701(b)(3):
https://www.law.cornell.edu/uscode/text/26/7701#b_3
In theory, this approach allows me to extract the specific citation of each regulation example using the id attribute. In practice, however, the id attributes in the LII regulations are not consistent. Therefore, I was only able to get a rough approximation of the citations.
Missing Examples (Examples not in <div> tags)
I realized that some of the examples in the regulations were not in <div> tags. LII changed its process in recent years. Examples that have come out in the past few years are now in <p> tags, usually with a class of “psection-2” or “psection-3” or “section-4”, etc. Unfortunately, there is no tag that includes the entire example. The newer examples are usually spread across multiple <p> tags.
In a hacky/crude Python script, I was able to identify an additional 1,854 examples. The actual number of examples in <p> tags is probably higher than this.
With 11,178 examples in <div> tags and at least 1,854 examples in <p> tags, the total number of examples in the regulations is at least 13,032.
Pandas (Data Analysis)
I then used the Python pandas library to take a closer look at the examples that I had downloaded. I computed the length/number of characters for each example “title” and for each example “text”. One of the titles had a character length of zero, and 565 of the texts had a character length of zero. I corrected the titles and the texts, or deleted the examples that were not fixable.
CountVectorizer (Counting Words & Phrases)
I then used Python’s scikit-learn CountVectorizer to find the top 20 most common words (after excluding various “stop words”). The 20 most common words in the regulations examples were:
I also used CountVectorizer to find phrases (ngrams). Common phrases in the regulation examples included:
Charting Regulation Examples
I have created charts of more than 600 examples in the regulations. Not all examples make good charts. Typically, examples describing ownership structures are good prospects for charts.
I have started to use Python to identify additional examples that may be good for charting. Ownership structures are usually described with the words “own” or “owned”. I have started to use regular expressions to identify where these terms are used, but not in a possessive sense (i.e., I don’t want examples such as “his own”, “her own”, “its own”, etc.).
My preliminarily research suggests that there are many more examples for me to chart. ☺