Social Network Analysis of Disclosure

unmask – python profiling tool

In one of the mailing lists I’m on, this cool little python-based profiling tool was posted.  Here is the README.

This is version 1.0 of Unmask – a python script that attempts to unmask anonymous text by matching its statistical properties against someone else’s text with a known identity.

Other uses include determining “area of origin”,gender,age, occupation, sexual orientation, etc from text’s statistical properties. Any decision YOU can make against an unknown author, Unmask will also make. Of course, it may be less or more accurate than your determination.

You should probably fiddle with it as you go, to make it work on whatever sample set you have, before using it in the wild.

You can download it from http://www.immunitysec.com/downloads/unmask1.0.tar.gz.

Looking at textstats.py, it profiles based on words per sentence, types of punctuation statistics, and what types of words are used.

Pretty simple to use.  I collected a bunch of Chuck Norris text from around (you can see it here) and added it to my “chuck norris profile”.

Do this by

unmask.py -s chuck -f /path/chuck1.txt
unmask.py -s chuck -f /fath/chuck2.txt

Now, compare it with the benchmark Chuck Norris text that’s here.  This is also known Chuck Norris text.

./unmask.py -i -f ./unknown.txt
Comparing against chuck.pkl
Compared to store/chuck.pkl with match value of: 92.0
Matched closest with matchfile store/chuck.pkl:
Identification information – file: store/chuck.pkl matchvalue:92.0

not bad.  92 percent.

Now, I compare Chuck Norris with some Jesus quotes, to see if he is likely to be the second coming of Jesus.  I compared it with Mathew5:43-47

Comparing against chuck.pkl
Compared to store/chuck.pkl with match value of: 43.0
Matched closest with matchfile store/chuck.pkl:
Identification information – file: store/chuck.pkl matchvalue:43.0

Hmmm, I guess there are some flaws with the program or at least my small sample size.  I’m sure if we would have had a big enough sample, we could have verified Chuck was the second coming of Christ.  I guess I’ll save that for next time.

To use it, simple “store” text (with -s bob -f file.txt). Then just compare your unknown file to that particular store, or use -i to compare it to all stores. Make up a store of all male and all female text and then compare some random weblog, just for kicks.