Linkedin Crawler

The following is also source used in the grad project. I’ll post the actual paper at some point. But here is the linkedin crawler portion with the applicable source. By it’s nature, this code is breakable, and may not work even at the time of posting. But it did work long enough for me to gather addresses, which was the point.

Usage is/was

LinkedinPageGatherer.py Linkedinusername Linkedinpassword

Following is an excerpt from the ‘paper’.

the HTMLParser libraries are more resilient to changes in source. Both HTMLParser and lxml libraries have different code available to process broken HTML. The HTMLParser libraries were chosen as more appropriate for these problems [lxml][htmlparsing].

There has been an effort to put all HTML specific logic in debuggable places so if the HTML generated changes then it is easy to modify the code parsing to reflect those changes (assuming equivalent information is available). However, changes in source are frequent, and the source code has had to be modified roughly every 3 months to reflect changes in HTML layout.

Unfortunately, although the functionality is simple, this program has grown to be much more complex due to roadblocks put in place by both LinkedIn Google.

To search LinkedIn from itself, it is necessary to have a LinkedIn account. With an account, it is possible to search with or without connections, although the searching criteria differ depending on the type of account you have. Because of this, one of the criteria for searching LinkedIn is cookie management, which has to be written to keep track of the HTTP session. In addition, LinkedIn uses a POST parameter nonce at each page that must be retrieved and POSTed for every page submission. Because of the nonce, it is also necessary to login at the login page, save the nonce and the cookie, and proceed to search through the same path an actual user would.

Once the tool is able to search for companies, there is an additional limitation. With the free account, the search is limited to displaying only 100 connections. This is inconvenient as the desired number of actual connections is often much larger. The tool I’ve written takes various criteria (such as location, title, etc) to perform multiple more specific searches of 100 results each. The extra information is harvested at each search to use for later searches. With more specific searches, the tool inserts unique items into a list of users. When the initial search initiates, LinkedIn reports the total number of results (although it only lets the account view 100 at a time) so the tool uses this total number as one possible stopping condition – when a percentage of that number has been reached or a certain number of failed searches have been tried.

This is easier to illustrate with an example. In the case of FPL, there are over 2000 results. However, it can be asserted that at least one of the results is from a certain Miami address. Using this as a search restriction the total results may be reduced to 500, the first 100 of which can be inserted. It can also be asserted that there is at least one result from the Miami address who is a project manager. Using this restriction, there are only 5 results, which have different criteria to do advanced searches on. Using this iterative approach, it is possible to gather most of the 2000. In the program I have written, this functionality is still experimental and the parameters must be adjusted.

One additional difficulty with LinkedIn is that with these results it does not display a name, only a job title associated with the company. Obviously, this is not ideal. A name is necessary for even the most basic spear phishing attacks. An email may sound slightly awkward if addressed as “Dear Project Manager in the Cyber Security Group”. The solution I found to retrieve employee names is to use Google. Using specific Google queries based on the LinkedIn names, it is possible to retrieve the names associated with a job, company, and job title.

Google has a use policy prohibiting automated crawlers. Because of this policy, it does various checks on the queries to verify that the browser is a known real browser. If it is not, Google returns a 403 status stating that the browser is not known. To circumvent this, a packet dump was performed on a valid browser. The code now has a snippet to send information exactly like an actual browser would along with randomized time delays to mimic a person. It should be impossible for Google to tell the difference over the long run – whatever checks they do can be mimicked. The code includes several configurable browsers to masquerade as. Below is the code snippet including the default spoofed browser which is Firefox running on Linux.

def getHeaders(self, browser="ubuntuFF"):
  #ubuntu firefox spoof
  if browser == "ubuntuFF":
    headers = {
      "Host": "www.google.com",
      "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
      "Accept" : "text/html,application/xhtml+xml,application xml;q=0.9,*/*;q=0.8",
      "Accept-Language" : "en-us,en;q=0.5",
      "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
      "Keep-Alive" : "300",
      "Proxy-Connection" : "keep-alive"
    }
...

Although both Google and LinkedIn make it difficult to automate information mining, their approach will fundamentally fail a motivated adversary. Because these companies want to make information available to users, this information can also be retrieved automatically. Captcha technology has been one traditional solution, though by its nature it suffers from similar flaws in design.

The LinkedIn crawler program demonstrates the possibility of an attacker targeting a company to harvest people’s names, which many times can be mapped to email addresses as demonstrated in previous sections.

GoogleQueery.py

#! /usr/bin/python

#class to make google queries
#must masquerade as a legitimate browser
#Using this violates Google ToS

import httplib
import urllib
import sys
import HTMLParser
import re

#class is basically fed a google url for linkedin for the
#sole purpose of getting a linkedin link
class GoogleQueery(HTMLParser.HTMLParser):
  def __init__(self, goog_url):
    HTMLParser.HTMLParser.__init__(self)
    self.linkedinurl = []
    query = urllib.urlencode({"q": goog_url})
    conn = httplib.HTTPConnection("www.google.com")
    headers = self.getHeaders()
    conn.request("GET", "/search?hl=en&"+query, headers=headers)
    resp = conn.getresponse()
    data = resp.read()
    self.feed(data)
    self.get_num_results(data)
    conn.close()

  #this is necessary because google wants to be mean and 403 based on... not sure
  #but it seems  I must look like a real browser to get a 200
  def getHeaders(self, browser="chromium"):
    #if browser == "random":
      #TODO randomize choice
    #ubuntu firefox spoof
    if browser == "ubuntuFF":
      headers = {
        "Host": "www.google.com",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
        "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language" : "en-us,en;q=0.5",
        "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
        "Keep-Alive" : "300",
        "Proxy-Connection" : "keep-alive"
        }
    elif browser == "chromium":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.5 Safari/533.2",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Avail-Dictionary": "FcpNLYBN",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    elif browser == "ie":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    return headers

  def get_num_results(self, data):
    index = re.search("<b>1</b> - <b>[d]+</b> of [w]*[ ]?<b>([d,]+)", data)
    try:
      self.numResults = int(index.group(1).replace(",", ""))
    except:
      self.numResults = 0
      if not "- did not match any documents. " in data:
        print "Warning: numresults parsing problem"
        print "setting number of results to 0"

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "a" and ((("linkedin.com/pub/" in attrs[0][1])
                    or  ("linkedin.com/in" in attrs[0][1]))
                    and ("http://" in attrs[0][1])
                    and ("search?q=cache" not in attrs[0][1])
                    and ("/dir/" not in attrs[0][1])):
        self.linkedinurl.append(attrs[0][1])
        #print self.linkedinurl
      #perhaps add a google cache option here in the future
    except IndexError:
      pass

#for testing
if __name__ == "__main__":
  #url = "site:linkedin.com "PROJECT ADMINISTRATOR at CAT INL QATAR W.L.L." "Qatar""
  m = GoogleQueery(url)

LinkedinHTMLParser.py

#! /usr/bin/python

#this should probably be put in LinkedinPageGatherer.py

import HTMLParser

from person_searchobj import person_searchobj

class LinkedinHTMLParser(HTMLParser.HTMLParser):
  """
  subclass of HTMLParser specifically for parsing Linkedin names to person_searchobjs
  requires a call to .feed(data), stored data in the personArray
  """
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.personArray = []
    self.personIndex = -1
    self.inGivenName = False
    self.inFamilyName = False
    self.inTitle = False
    self.inLocation = False

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "li" and attrs[0][0] == "class" and ("vcard" in attrs[0][1]):
        self.personIndex += 1
        self.personArray.append(person_searchobj())
      if attrs[0][1] == "given-name" and self.personIndex >=0:
        self.inGivenName = True
      elif attrs[0][1] == "family-name" and self.personIndex >= 0:
        self.inFamilyName = True
      elif tag == "dd" and attrs[0][1] == "title" and self.personIndex >= 0:
        self.inTitle = True
      elif tag == "span" and attrs[0][1] == "location" and self.personIndex >= 0:
        self.inLocation = True
    except IndexError:
      pass

  def handle_endtag(self, tag):
    if tag == "span":
      self.inGivenName = False
      self.inFamilyName = False
      self.inLocation = False
    elif tag == "dd":
      self.inTitle = False

  def handle_data(self, data):
    if self.inGivenName:
      self.personArray[self.personIndex].givenName = data.strip()
    elif self.inFamilyName:
      self.personArray[self.personIndex].familyName = data.strip()
    elif self.inTitle:
      self.personArray[self.personIndex].title = data.strip()
    elif self.inLocation:
      self.personArray[self.personIndex].location = data.strip()

#for testing - use a file since this is just a parser
if __name__ == "__main__":
  import sys
  file = open ("test.htm")
  df = file.read()
  parser = LinkedinHTMLParser()
  parser.feed(df)
  print "================"
  for person in parser.personArray:
    print person.goog_printstring()
  file.close()

LinkedinPageGatherer.py – this is what should be called directly.

#!/usr/bin/python

import urllib
import urllib2
import sys
import time
import copy
import pickle
import math

from person_searchobj import person_searchobj
from LinkedinHTMLParser import LinkedinHTMLParser
from GoogleQueery import GoogleQueery

#TODO add a test function that tests the website format for easy diagnostics when HTML changes
#TODO use HTMLParser like a sane person
class LinkedinPageGatherer:
  """
  class that generates the initial linkeding queeries using the company name
  as a search parameter. These search strings will be searched using google
  to obtain additional information (these limited initial search strings usually lack
  vital info like names)
  """
  def __init__(self, companyName, login, password, maxsearch=100,
               totalresultpercent=.7, maxskunk=100):
    """
    login and password are params for a valid linkedin account
    maxsearch is the number of results - linkedin limit unpaid accounts to 100
    totalresultpercent is the number of results this script will try to find
    maxskunk is the number of searches this class will attempt before giving up
    """
    #list of person_searchobj
    self.people_searchobj = []
    self.companyName = companyName
    self.login = login
    self.password = password
    self.fullurl = ("http://www.linkedin.com/search?search=&company="+companyName+
                    "&currentCompany=currentCompany", "&page_num=", "0")
    self.opener = self.linkedin_login()
    #for the smart_people_adder
    self.searchSpecific = []
    #can only look at 100 people at a time. Parameters used to narrow down queries
    self.total_results = self.get_num_results()
    self.maxsearch = maxsearch
    self.totalresultpercent = totalresultpercent
    #self.extraparameters = {"locationinfo" : [], "titleinfo" : [], "locationtitle" : [] }
    #extraparameters is a simple stack that adds keywords to restrict the search
    self.extraparameters = []
    #TODO can only look at 100 people at a time - like to narrow down queries
    #and auto grab more
    currrespercent = 0.0
    skunked = 0
    currurl = self.fullurl[0] + self.fullurl[1]
    extraparamindex = 0

    while currrespercent < self.totalresultpercent and skunked <= maxskunk:
      numresults = self.get_num_results(currurl)
      save_num = len(self.people_searchobj)

      print "-------"
      print "currurl", currurl
      print "percentage", currrespercent
      print "skunked", skunked
      print "numresults", numresults
      print "save_num", save_num

      for i in range (0, int(min(math.ceil(self.maxsearch/10), math.ceil(numresults/10)))):
        #function adds to self.people_searchobj
        print "currurl" + currurl + str(i)
        self.return_people_links(currurl + str(i))
      currrespercent = float(len(self.people_searchobj))/self.total_results
      if save_num == len(self.people_searchobj):
        skunked += 1
      for i in self.people_searchobj:
        pushTitles = [("title", gName) for gName in i.givenName.split()]
        #TODO this could be inproved for more detailed results, etc, but keeping it simple for now
        pushKeywords = [("keywords", gName) for gName in i.givenName.split()]
        pushTotal = pushTitles[:] + pushKeywords[:]
        #append to extraparameters if unique
        self.push_search_parameters(pushTotal)
      print "parameters", self.extraparameters
      #get a new url to search for, if necessary
      #use the extra params in title, "keywords" parameters
      try:
        refineel = self.extraparameters[extraparamindex]
        extraparamindex += 1
        currurl = self.fullurl[0] + "&" + refineel[0] + "=" + refineel[1] + self.fullurl[1]
      except IndexError:
        break

  """
  #TODO: This idea is fine, but we should get names first to better distinguish people
  #also maybe should be moved
  def smart_people_adder(self):
    #we've already done a basic search, must do more
    if "basic" in self.searchSpecific:
  """
  def return_people_links(self, linkedinurl):
    req = urllib2.Request(linkedinurl)
    fd = self.opener.open(req)
    pagedata = ""
    while 1:
      data = fd.read(2056)
      pagedata = pagedata + data
      if not len(data):
        break
    #print pagedata
    self.parse_page(pagedata)

  def parse_page(self, page):
    thesePeople = LinkedinHTMLParser()
    thesePeople.feed(page)
    for newperson in thesePeople.personArray:
      unique = True
      for oldperson in self.people_searchobj:
        #if all these things match but they really are different people, they
        #will likely still be found as unique google results
        if (oldperson.givenName == newperson.givenName and
            oldperson.familyName == newperson.familyName and
            oldperson.title == newperson.title and
            oldperson.location == oldperson.location):
              unique = False
              break
      if unique:
        self.people_searchobj.append(newperson)
  """
    print "======================="
    for person in self.people_searchobj:
      print person.goog_printstring()
  """

  #return the number of results, very breakable
  def get_num_results(self, url=None):
    #by default return total in company
    if url == None:
      fd = self.opener.open(self.fullurl[0] + "1")
    else:
      fd = self.opener.open(url)
    data = fd.read()
    fd.close()
    searchstr = "<p class="summary">"
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find("</strong>", sindex)
    return(int(data[sindex:eindex].strip().strip("<strong>").replace(",", "").strip()))

  #returns an opener object that contains valid cookies
  def linkedin_login(self):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)
    #login page
    fd = opener.open("https://www.linkedin.com/secure/login?trk=hb_signin")
    data = fd.read()
    fd.close()
    #csrf 'prevention' login value
    searchstr = """<input type="hidden" name="csrfToken" value="ajax:"""
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find('"', sindex)
    params = urllib.urlencode(dict(csrfToken="ajax:-"+data[sindex:eindex],
                              session_key=self.login,
                              session_password=self.password,
                              session_login="Sign+In",
                              session_rikey=""))
    #need the second request to get the csrf stuff, initial cookies
    request = urllib2.Request("https://www.linkedin.com/secure/login")
    request.add_header("Host", "www.linkedin.com")
    request.add_header("Referer", "https://www.linkedin.com/secure/login?trk=hb_signin")
    time.sleep(1.5)
    fd = opener.open(request, params)
    data = fd.read()
    if "<div id="header" class="guest">" in data:
      print "Linkedin authentication faild. Please supply a valid linkedin account"
      sys.exit(1)
    else:
      print "Linkedin authentication Successful"
    fd.close()
    return opener

  def push_search_parameters(self, extraparam):
    uselesswords = [ "for", "the", "and", "at", "in"]
    for pm in extraparam:
      pm = (pm[0], pm[1].strip().lower())
      if (pm not in self.extraparameters) and (pm[1] not in uselesswords) and pm != None:
        self.extraparameters.append(pm)

class LinkedinTotalPageGather(LinkedinPageGatherer):
  """
  Overhead class that generates the person_searchobjs, using GoogleQueery
  """
  def __init__(self, companyName, login, password):
    LinkedinPageGatherer.__init__(self, companyName, login, password)
    extraPeople = []
    for person in self.people_searchobj:
      mgoogqueery = GoogleQueery(person.goog_printstring())
      #making the assumption that each pub url is a unique person
      count = 0
      for url in mgoogqueery.linkedinurl:
        #grab the real name from the url
        begindex = url.find("/pub/") + 5
        endindex = url.find("/", begindex)
        if count == 0:
          person.url = url
          person.name = url[begindex:endindex]
        else:
          extraObj = copy.deepcopy(person)
          extraObj.url = url
          extraObj.name = url[begindex:endindex]
          extraPeople.append(extraObj)
        count += 1
      print person
    print "Extra People"
    for person in extraPeople:
      print person
      self.people_searchobj.append(person)

if __name__ == "__main__":
  #args are email and password for linkedin
  my = LinkedinTotalPageGather(company, sys.argv[1], sys.argv[2])

person_searchobj.py

#! /usr/bin/python

class person_searchobj():
  """this object is used for the google search and the final person object"""

  def __init__ (self, givenname="", familyname="", title="", organization="", location=""):
    """
    given name could be a title in this case, does not matter in terms of google
    but then may have to change for the final person object
    """
    #"name" is their actual name, unlike givenName and family name which are linkedin names
    self.name = ""
    self.givenName = givenname
    self.familyName = familyname
    self.title = title
    self.organization = organization
    self.location = location

    #this is retrieved by GoogleQueery
    self.url = ""

  def goog_printstring(self):
    """return the google print string used for queries"""
    retrstr = "site:linkedin.com "
    for i in  [self.givenName, self.familyName, self.title, self.organization, self.location]:
      if i != "":
        retrstr += '"' + i +'" '
    return retrstr

  def __repr__(self):
    """Overload __repr__ for easy printing. Mostly for debugging"""
    return (self.name + "n" +
            "------n"
            "GivenName: " + self.givenName + "n" +
            "familyName:" + self.familyName + "n" +
            "Title:" + self.title + "n" +
            "Organization:" + self.organization + "n" +
            "Location" + self.location + "n" +
            "URL:" + self.url + "nn")

email_spider

This was a small part of a project that was itself about 1/3 of my graduate project. I used it to collect certain information. Here is the excerpt from the paper.

Website Email Spider Program

In order to automatically process publicly available email addresses, a simple tool was developed, with source code available in Appendix A. An automated tool is able to process web pages in a way that is less error prone than manual methods, and it also makes processing the sheer number of websites possible (or at least less tedious).
This tool begins at a few root pages, which can be comma delimited. From these, it searches for all unique links by keeping track of a queue so that pages are not usually revisited (although revisiting a page is still possible in case the server is case insensitive or equivalent pages are dynamically generated with unique URLs). In addition, the base class is passed a website scope so that pages outside of that scope are not spidered. By default, the scope is simply a regular expression including the top domain name of the organization.

Each page requested searches the contents for the following regular expression to identify common email formats:

[w_.-]{3,}@[w_.-]{6,}

The 3 and 6 repeaters were necessary because of false positives otherwise obtained due to various encodings. This regular expression will not obtain all email addresses. However, it will obtain the most common addresses with a minimum of false positives. In addition, the obtained email addresses are run against a blacklist of uninteresting generic form addresses (such as help@example.com, info@example.com, or sales@example.com).

These email addresses are saved in memory and reported when the program completes or is interrupted. Note because of the dynamic nature of some pages, these can potentially spider infinitely and must be interrupted (for example, a calendar application that uses links to go back in time indefinitely). Most emails seemed to be obtained in the first 1,000 pages crawled. A limit of 10,000 pages was chosen as a reasonable scope. Although this limit was reached several times, the spider program uses a breadth search method. It was observed that most unique addresses were obtained early in the spidering process, and extending the number of pages tended to have a diminishing return. Despite this, websites with more pages also tended to correlate with greater email addresses returned (see analysis section).

Much of the logic in the spidering tool is dedicated to correctly parsing html. By their nature, web pages vary widely with links, with many sites using a mix of directory traversal, absolute URLs, and partial URLs. It is no surprise there are so many security vulnerabilities related to browsers parsing this complex data.
There is also an effort made to make the software somewhat more efficient by ignoring superfluous links to objects such as documents, executables, etc. Although if such a file is encountered an exception will catch the processing error, these files consume resources.

Using this tool is straightforward, but a certain familiarity is expected – it was not developed for an end user but for this specific experiment. For example, a URL is best processed in the format http://example.com/ since in its current state it would use example.com to verify that spidered addresses are within a reasonable scope. It prints debugging messages constantly because every site seemed to have unique parsing quirks. Although other formats and usages may work, there was little effort to make this software easy to use.

Here is the source.
#!/usr/bin/python

import HTMLParser
import urllib2
import re
import sys
import signal
import socket

socket.setdefaulttimeout(20)

#spider is meant for a single url
#proto can be http, https, or any
class PageSpider(HTMLParser.HTMLParser):
  def __init__(self, url, scope, searchList=[], emailList=[], errorDict={}):
    HTMLParser.HTMLParser.__init__(self)
    self.url = url
    self.scope = scope
    self.searchList = searchList
    self.emailList = emailList
    try:
      urlre = re.search(r"(w+):[/]+([^/]+).*", self.url)
      self.baseurl = urlre.group(2)
      self.proto = urlre.group(1)
    except AttributeError:
      raise Exception("URLFormat", "URL passed is invalid")
    if self.scope == None:
      self.scope = self.baseurl
    try:
      req = urllib2.urlopen(self.url)
      htmlstuff = req.read()
    except KeyboardInterrupt:
      raise
    except urllib2.HTTPError:
      #not able to fetch a url eg 404
      errorDict["link"] += 1
      print "Warning: link error"
      return
    except urllib2.URLError:
      errorDict["link"] += 1
      print "Warning: URLError"
      return
    except ValueError:
      errorDict["link"] += 1
      print "Warning link error"
      return
    except:
      print "Unknown Error", self.url
      errorDict["link"] += 1
      return
    emailre = re.compile(r"[w_.-]{3,}@[w_.-]{2,}.[w_.-]{2,}")
    nemail = re.findall(emailre, htmlstuff)
    for i in nemail:
      if i not in self.emailList:
        self.emailList.append(i)
    try:
      self.feed(htmlstuff)
    except HTMLParser.HTMLParseError:
      errorDict["parse"] += 1
      print "Warning: HTML Parse Error"
      pass
    except UnicodeDecodeError:
      errorDict["decoding"] += 1
      print "Warning: Unicode Decode Error"
      pass
  def handle_starttag(self, tag, attrs):
    if (tag == "a" or tag =="link") and attrs:
      #process the url formats, make sure the base is in scope
      for k, v in attrs:
        #check it's an htref and that it's within scope
        if  (k == "href" and
            ((("http" in v) and (re.search(self.scope, v))) or
            ("http" not in v)) and
            (not (v.endswith(".pdf") or v.endswith(".exe") or
             v.endswith(".doc") or v.endswith(".docx") or
             v.endswith(".jpg") or v.endswith(".jpeg") or
             v.endswith(".png") or v.endswith(".css") or
             v.endswith(".gif") or v.endswith(".GIF") or
             v.endswith(".mp3") or v.endswith(".mp4") or
             v.endswith(".mov") or v.endswith(".MOV") or
             v.endswith(".avi") or v.endswith(".flv") or
             v.endswith(".wmv") or v.endswith(".wav") or
             v.endswith(".ogg") or v.endswith(".odt") or
             v.endswith(".zip") or v.endswith(".gz") or
             v.endswith(".bz") or v.endswith(".tar") or
             v.endswith(".xls") or v.endswith(".xlsx") or
             v.endswith(".qt") or v.endswith(".divx") or
             v.endswith(".JPG") or v.endswith(".JPEG")))):
          #Also todo - modify regex so that >= 3 chars in front >= 7 chars in back
          url = self.urlProcess(v)
          #TODO 10000 is completely arbitrary
          if (url not in self.searchList) and (url != None) and len(self.searchList) < 10000:
            self.searchList.append(url)
  #returns complete url in the form http://stuff/bleh
  #as input handles (./url, http://stuff/bleh/url, //stuff/bleh/url)
  def urlProcess(self, link):
    link = link.strip()
    if "http" in link:
      return (link)
    elif link.startswith("//"):
      return self.proto + "://" + link[2:]
    elif link.startswith("/"):
      return self.proto + "://" + self.baseurl + link
    elif link.startswith("#"):
      return None
    elif ":" not in link and " " not in link:
      while link.startswith("../"):
        link = link[3:]
        #TODO [8:-1] is just a heuristic, but too many misses shouldn't be bad... maybe?
        if self.url.endswith("/") and ("/" in self.url[8:-1]):
          self.url = self.url[:self.url.rfind("/", 0, -1)] + "/"
      dir = self.url[:self.url.rfind("/")] + "/"
      return dir + link
    return None

class SiteSpider:
  def __init__(self, searchList, scope=None, verbocity=True, maxDepth=4):
    #TODO maxDepth logic
    #necessary to add to this list to avoid infinite loops
    self.searchList = searchList
    self.emailList = []
    self.errors = {"decoding":0, "link":0, "parse":0, "connection":0, "unknown":0}
    if scope == None:
      try:
        urlre = re.search(r"(w+):[/]+([^/]+).*", self.searchList[0])
        self.scope = urlre.group(2)
      except AttributeError:
        raise Exception("URLFormat", "URL passed is invalid")
    else:
      self.scope = scope
    index = 0
    threshhold = 0
    while 1:
      try:
        PageSpider(self.searchList[index], self.scope, self.searchList, self.emailList, self.errors)
        if verbocity:
          print self.searchList[index]
          print " Total Emails:", len(self.emailList)
          print " Pages Processed:", index
          print " Pages Found:", len(self.searchList)
        index += 1
      except IndexError:
        break
      except KeyboardInterrupt:
        break
      except:
        threshhold += 1
        print "Warning: unknown error"
        self.errors["unknown"] += 1
        if threshhold >= 40:
          break
        pass
    garbageEmails =   [ "help",
                        "webmaster",
                        "contact",
                        "sales" ]
    print "REPORT"
    print "----------"
    for email in self.emailList:
      if email not in garbageEmails:
        print email
    print "nTotal Emails:", len(self.emailList)
    print "Pages Processed:", index
    print "Errors:", self.errors

if __name__ == "__main__":
  SiteSpider(sys.argv[1].split(","))

Follow

Get every new post delivered to your Inbox.

Join 34 other followers