unmask – python profiling tool

In one of the mailing lists I’m on, this cool little python-based profiling tool was posted.  Here is the README.

This is version 1.0 of Unmask – a python script that attempts to unmask anonymous text by matching its statistical properties against someone else’s text with a known identity.

Other uses include determining “area of origin”,gender,age, occupation, sexual orientation, etc from text’s statistical properties. Any decision YOU can make against an unknown author, Unmask will also make. Of course, it may be less or more accurate than your determination.

You should probably fiddle with it as you go, to make it work on whatever sample set you have, before using it in the wild.

You can download it from http://www.immunitysec.com/downloads/unmask1.0.tar.gz.

Looking at textstats.py, it profiles based on words per sentence, types of punctuation statistics, and what types of words are used.

Pretty simple to use.  I collected a bunch of Chuck Norris text from around (you can see it here) and added it to my “chuck norris profile”.

Do this by

unmask.py -s chuck -f /path/chuck1.txt
unmask.py -s chuck -f /fath/chuck2.txt

Now, compare it with the benchmark Chuck Norris text that’s here.  This is also known Chuck Norris text.

./unmask.py -i -f ./unknown.txt
Comparing against chuck.pkl
Compared to store/chuck.pkl with match value of: 92.0
Matched closest with matchfile store/chuck.pkl:
Identification information – file: store/chuck.pkl matchvalue:92.0

not bad.  92 percent.

Now, I compare Chuck Norris with some Jesus quotes, to see if he is likely to be the second coming of Jesus.  I compared it with Mathew5:43-47

Comparing against chuck.pkl
Compared to store/chuck.pkl with match value of: 43.0
Matched closest with matchfile store/chuck.pkl:
Identification information – file: store/chuck.pkl matchvalue:43.0

Hmmm, I guess there are some flaws with the program or at least my small sample size.  I’m sure if we would have had a big enough sample, we could have verified Chuck was the second coming of Christ.  I guess I’ll save that for next time.

To use it, simple “store” text (with -s bob -f file.txt). Then just compare your unknown file to that particular store, or use -i to compare it to all stores. Make up a store of all male and all female text and then compare some random weblog, just for kicks.

ssh-keygen -R

This is the correct way to remove old public keys from your known_hosts file.

There are times when public keys change for legitimate reasons.  Like when you get a new operating system installed or something.

In the past, I’ve just edited .ssh/known_hosts and deleted the entries (there is normally at least two, one for the IP and one for the host name).  This can be a bit hard because these entries aren’t really labeled clearly.

A better way is to:

ssh -R hostname
ssh -R IP


scanrand is a cool tool for network scanning written by Dan Kaminski.  The big advantage to this tool as a network scanner is that it can scan very large networks very very fast.

It works by splitting into two completely independent processes, one for sending packets and one for receiving them.  The sending process fires off syn packets and doesn’t try to retain state information.  Also, the receiving process doesn’t retain state.  It works by using a stateful protocol in a stateless way.

How does this prevent a smart router or something from just sending weird information in response to  a detected scan?

Normally, an ISN of a syn packet is meant to be basically random.  scanrand builds a deterministic iSN by running the source ip source port destination ip and destination port concatinated with a secret key and run through a one way hashing function – meaning these “random” isns can be calculated. this is called an “inverse syn cookie”.


Again, to reiterate the advantages:a class C network has been known to be scanned in as little as four seconds with this tool.

Here is how I tried it on my local network:

scanrand -d eth1 -b10M

pretty simple.  the 10M limits the scan to 10 mbps.  One thing with scanrand is you probably ususally want to throttle the traffic -or else your network could easily become overloaded.  the -d eth1 is just specifying my wireless card.  the is specifying which IPs and ports to scan.  quick is a shortcut meaning ports

80,443,445,53,20-23,25,135,139,8080, 110,111,143,1025,5000,465,993,31337, 79,8010,8000,6667,2049,3306

the biggest disadvantage might be how noisy it is.  But it’s not meant to be quiet.

Anyway, this is a pretty innovative “why didn’t I think of that” tools.  Give it a try.