Tuesday, July 07, 2009

Predicting SSN from public data

CMU researchers have published this paper which talks about statistical methods to predict SSN numbers from public data.

Overview of the paper :
The SSN Nomenclature :
SSN (9 digits) = AN (3 digits) + GN (2 digits)+ SN (4 digits)
AN - Area Number. It is assigned based on the zipcode of the mailing address provided in the SSN application form
GN - Group Number. Within each SSA area, GNs are assigned in a precise but nonconsecutive order between 01 and 99
SN - Serial Number. Within each GN, SNs are assigned "consecutively from 0001 through 9999"

Algorithm:
The prediction algorithm exploits the fact that people who were born in the same area are likely to have closer SSN numbers.
step 1: Use Death Master File (Itz a public file containing SSN #'s and place / date of birth of deceased people) to form clusters of people.
step 2: Now with the person's place / date of birth from social networking sites like Facebook or Orkut or watever, identify his / her cluster. This will reveal his / her ANGN.
step 3: Use regression to predict the SN.

Conclusion :
US Government is already working on randomizing SSN to defend against statistical attacks but those SSN's that we already hold are prone to prediction with certain accuracy as outlined above.

In the paper, they mention that aliens who got SSN long after their birth are outliers and wont be predicted. I am safe :) but nevertheless I will always remain skeptic & critic about the privacy of social networking sites

Excerpt from wired article
"With just two attempts, the researchers correctly guessed the first five digits of SSNs for 60 percent of deceased Americans born between 1989 and 2003. With fewer than 1,000 attempts, they could identify the entire nine digits for 8.5 percent of the group."

1 comment:

Kamesh said...

This is really interesting...