Andrew Mitchel LLC

International Tax Blog - New and Interesting International Tax Issues


13,000 Regulation Examples

2022-08-16

I often use Cornell’s Legal Information Institute (“LII”) to access tax regulations.  I noticed that the HTML on the LII website included regulation examples in <div> tags with a class of “example”.  I thought it would be interesting to extract the text of all of the examples in the regulations and perform some analysis on those regulation examples.

Below I describe how I located over 13,000 examples in the tax regulations.

My first step was to identify all of the LII web pages with tax regulations.  The following web pages included links to all of the tax regulations that were important to me:

26 CFR Part 1 - INCOME TAXES

26 CFR Part 20 - ESTATE TAX

26 CFR Part 25 - GIFT TAX

26 CFR Part 26 - GENERATION-SKIPPING TRANSFER TAX

26 CFR Part 31 - EMPLOYMENT TAXES AND COLLECTION OF INCOME TAX AT SOURCE

26 CFR Part 301 - PROCEDURE AND ADMINISTRATION

I created a Python script to:

  1. Get all of the links from the above pages,
  2. Delete the links not to the title 26 regulations,
  3. Remove duplicate links, and
  4. Save the links to a file.

I ended up with more than four thousand web pages to download.  I created another Python script to download all four thousand web pages.  I included a delay in the downloads so as to be nice to the LII servers.

For each file, I wanted to get:

  1. The name of the file,
  2. The number of examples found, and
  3. A list of the examples.

For each example, I wanted to get:

  1. The title of the example,
  2. The text of the example, and
  3. A citation to the example.

After downloading all of the files, I identified 11,178 examples in <div> tags with a class of example.  The titles of the examples were easy to pull out because the titles were in <div> tags with a class of “hed”.  Similarly, the texts of the examples were in <div> tags with a class of “pspace”.

Citations to the Examples – “id” Attributes

The citations to the examples were a bit tricky.  LII uses the “id” attribute to track the specific citation.  The id attribute allows you to link to a specific portion of a web page.  I use id attributes all the time to more quickly access code sections and regulations on LII.  For example, I often want to get to section 7701(b).  However, section 7701(a) is quite long.  If I just go to section 7701, I have to scroll down pages and pages before I get to section 7701(b).  But if I just add “#b” at the end of the URL, I am directed straight to subsection (b):

https://www.law.cornell.edu/uscode/text/26/7701#b

You can even go deeper than just the subsection.  Adding “#b_3” at the end of the URL will take you directly to section 7701(b)(3):

https://www.law.cornell.edu/uscode/text/26/7701#b_3

In theory, this approach allows me to extract the specific citation of each regulation example using the id attribute.  In practice, however, the id attributes in the LII regulations are not consistent.  Therefore, I was only able to get a rough approximation of the citations.

Missing Examples (Examples not in <div> tags)

I realized that some of the examples in the regulations were not in <div> tags.  LII changed its process in recent years.  Examples that have come out in the past few years are now in <p> tags, usually with a class of “psection-2” or “psection-3” or “section-4”, etc.  Unfortunately, there is no tag that includes the entire example.  The newer examples are usually spread across multiple <p> tags.

In a hacky/crude Python script, I was able to identify an additional 1,854 examples.  The actual number of examples in <p> tags is probably higher than this.

With 11,178 examples in <div> tags and at least 1,854 examples in <p> tags, the total number of examples in the regulations is at least 13,032.

Pandas (Data Analysis)

I then used the Python pandas library to take a closer look at the examples that I had downloaded.  I computed the length/number of characters for each example “title” and for each example “text”.  One of the titles had a character length of zero, and 565 of the texts had a character length of zero.  I corrected the titles and the texts, or deleted the examples that were not fixable.

CountVectorizer (Counting Words & Phrases)

I then used Python’s scikit-learn CountVectorizer to find the top 20 most common words (after excluding various “stop words”).  The 20 most common words in the regulations examples were:

  • year           15750
  • income         11155
  • corporation    10878
  • stock           9102
  • taxable         6650
  • interest        6531
  • property        6381
  • tax             6331
  • amount          5791
  • percent         5779
  • facts           5610
  • plan            4898
  • business        4633
  • basis           4531
  • value           4127
  • partnership     3867
  • foreign         3731
  • example         3711
  • years           3465
  • respect         3425

I also used CountVectorizer to find phrases (ngrams).  Common phrases in the regulation examples included:

  • taxable year           4226
  • united states          2001
  • calendar year          1961
  • fair market value      1688
  • gross income           1633
  • income tax             1539
  • taxable income         1173
  • trade or business      1138
  • foreign corporation    1070
  • adjusted basis          888
  • interest expense        873
  • real property           804
  • earnings and profits    766

Charting Regulation Examples

I have created charts of more than 600 examples in the regulations.  Not all examples make good charts.  Typically, examples describing ownership structures are good prospects for charts.

I have started to use Python to identify additional examples that may be good for charting.  Ownership structures are usually described with the words “own” or “owned”.  I have started to use regular expressions to identify where these terms are used, but not in a possessive sense (i.e., I don’t want examples such as “his own”, “her own”, “its own”, etc.).

My preliminarily research suggests that there are many more examples for me to chart.

Tags: Charts - Situational Charts, Python