Elon Musk Does Not Understand Data Modeling
Over the past few days, Elon Musk made a bold and likely false claim that the Social Security Administration (SSA)’s incompetence regarding data management has caused fraudulent spending of government funds. The phrasing of the tweet points to one of four conclusions:
- Elon is lying
- Elon is clueless when it comes to data modeling
- Both 1 and 2
- Elon is correct
There are many valid reasons for having duplicate SSNs in a database. Let’s dive into a few of them.
Reasonable Explanation #1: Denormalization
Databases often contain views (or tables) that merge one or more other tables. These joined tables are easier to understand for humans since the data is combined; however, the process will often duplicate values in order to maintain data accuracy. Take a look below at the “Alex Smith Phone Report” that repeats certain values as it combines the “Person” and “Phone Number” tables.
Person Table
Person ID | Person Name | Organization |
---|---|---|
1 | Alex Smith | Logistics Company |
2 | Jamie Carter | Sales Organization |
Phone Number Table
Person ID | Phone Number | Type |
---|---|---|
1 | 867-5309 | Work |
1 | 321-5654 | Home |
Alex Smith Phone Report
Person ID | Person Name | Organization | Phone Number | Type |
---|---|---|---|---|
1 | Alex Smith | Logistics Company | 867-5309 | Work |
1 | Alex Smith | Logistics Company | 321-5654 | Home |
Reasonable Explanation #2: SSN is not a primary key
Every table in a well-designed database requires a primary key. This is a unique identifier that ensures every row in the table is unique, which maintains data integrity. To the data modeling novice, SSN would seem to be the obvious choice as a primary key since it is supposed to be unique for each citizen, permanent resident, or eligible nonimmigrant workers in the United States. One reason why SSN might not be the primary key is due to the sensitivity of the information. Because SSN’s are personal identifiable information (PII), the information must be managed with stricter security requirements than non-PII data. In this situation, an organization might create a new, unique identifier (or primary key) for individuals so the SSN is masked for most users in the organization. This allows the majority of users in an organization to proceed with their day-to-day work without having access to the SSN.
Reasonable Explanation #3: Data Quality Processes fix the “Issue”
Organizations typically collect data through a myriad of entry points. Many of these data collection methods rely on humans to enter data. This inevitably introduces some errors into any data store. To fix these errors, organizations will write scripts to check for data quality issues and will have semi-automated processes (with humans involved) where information gets corrected. One common data quality process will be identifying and merging duplicates in a master database. These processes, which would fix any issues before a payment is made, might be downstream of the data store that had “duplicate SSNs”.
Conclusion
Following the “duplicate SSN” claim by Elon, he proceeded to make the 100% false claim that the government does not use Structured Query Language (SQL). SQL is impossible for any large organization to avoid due to its ubiquity across databases and data warehouses for over 50 years. Here is one public link where the U.S. Air Force used SQL on an IT modernization effort.
The lack of specificity surrounding the dubious “duplicate SSN” claim coupled with the obviously false claim that the government does not use SQL suggests Elon was lying and/or is clueless when it comes to data modeling. I recommend that the Nazi Sympathizer enroll in a data modeling course at Trump University.
~ The Data Generalist
Data Science Career Advisor