email_spider

This was a small part of a project that was itself about 1/3 of my graduate project. I used it to collect certain information. Here is the excerpt from the paper.

Website Email Spider Program

In order to automatically process publicly available email addresses, a simple tool was developed, with source code available in Appendix A. An automated tool is able to process web pages in a way that is less error prone than manual methods, and it also makes processing the sheer number of websites possible (or at least less tedious).
This tool begins at a few root pages, which can be comma delimited. From these, it searches for all unique links by keeping track of a queue so that pages are not usually revisited (although revisiting a page is still possible in case the server is case insensitive or equivalent pages are dynamically generated with unique URLs). In addition, the base class is passed a website scope so that pages outside of that scope are not spidered. By default, the scope is simply a regular expression including the top domain name of the organization.

Each page requested searches the contents for the following regular expression to identify common email formats:

[w_.-]{3,}@[w_.-]{6,}

The 3 and 6 repeaters were necessary because of false positives otherwise obtained due to various encodings. This regular expression will not obtain all email addresses. However, it will obtain the most common addresses with a minimum of false positives. In addition, the obtained email addresses are run against a blacklist of uninteresting generic form addresses (such as help@example.com, info@example.com, or sales@example.com).

These email addresses are saved in memory and reported when the program completes or is interrupted. Note because of the dynamic nature of some pages, these can potentially spider infinitely and must be interrupted (for example, a calendar application that uses links to go back in time indefinitely). Most emails seemed to be obtained in the first 1,000 pages crawled. A limit of 10,000 pages was chosen as a reasonable scope. Although this limit was reached several times, the spider program uses a breadth search method. It was observed that most unique addresses were obtained early in the spidering process, and extending the number of pages tended to have a diminishing return. Despite this, websites with more pages also tended to correlate with greater email addresses returned (see analysis section).

Much of the logic in the spidering tool is dedicated to correctly parsing html. By their nature, web pages vary widely with links, with many sites using a mix of directory traversal, absolute URLs, and partial URLs. It is no surprise there are so many security vulnerabilities related to browsers parsing this complex data.
There is also an effort made to make the software somewhat more efficient by ignoring superfluous links to objects such as documents, executables, etc. Although if such a file is encountered an exception will catch the processing error, these files consume resources.

Using this tool is straightforward, but a certain familiarity is expected – it was not developed for an end user but for this specific experiment. For example, a URL is best processed in the format http://example.com/ since in its current state it would use example.com to verify that spidered addresses are within a reasonable scope. It prints debugging messages constantly because every site seemed to have unique parsing quirks. Although other formats and usages may work, there was little effort to make this software easy to use.

Here is the source.
#!/usr/bin/python

import HTMLParser
import urllib2
import re
import sys
import signal
import socket

socket.setdefaulttimeout(20)

#spider is meant for a single url
#proto can be http, https, or any
class PageSpider(HTMLParser.HTMLParser):
  def __init__(self, url, scope, searchList=[], emailList=[], errorDict={}):
    HTMLParser.HTMLParser.__init__(self)
    self.url = url
    self.scope = scope
    self.searchList = searchList
    self.emailList = emailList
    try:
      urlre = re.search(r"(w+):[/]+([^/]+).*", self.url)
      self.baseurl = urlre.group(2)
      self.proto = urlre.group(1)
    except AttributeError:
      raise Exception("URLFormat", "URL passed is invalid")
    if self.scope == None:
      self.scope = self.baseurl
    try:
      req = urllib2.urlopen(self.url)
      htmlstuff = req.read()
    except KeyboardInterrupt:
      raise
    except urllib2.HTTPError:
      #not able to fetch a url eg 404
      errorDict["link"] += 1
      print "Warning: link error"
      return
    except urllib2.URLError:
      errorDict["link"] += 1
      print "Warning: URLError"
      return
    except ValueError:
      errorDict["link"] += 1
      print "Warning link error"
      return
    except:
      print "Unknown Error", self.url
      errorDict["link"] += 1
      return
    emailre = re.compile(r"[w_.-]{3,}@[w_.-]{2,}.[w_.-]{2,}")
    nemail = re.findall(emailre, htmlstuff)
    for i in nemail:
      if i not in self.emailList:
        self.emailList.append(i)
    try:
      self.feed(htmlstuff)
    except HTMLParser.HTMLParseError:
      errorDict["parse"] += 1
      print "Warning: HTML Parse Error"
      pass
    except UnicodeDecodeError:
      errorDict["decoding"] += 1
      print "Warning: Unicode Decode Error"
      pass
  def handle_starttag(self, tag, attrs):
    if (tag == "a" or tag =="link") and attrs:
      #process the url formats, make sure the base is in scope
      for k, v in attrs:
        #check it's an htref and that it's within scope
        if  (k == "href" and
            ((("http" in v) and (re.search(self.scope, v))) or
            ("http" not in v)) and
            (not (v.endswith(".pdf") or v.endswith(".exe") or
             v.endswith(".doc") or v.endswith(".docx") or
             v.endswith(".jpg") or v.endswith(".jpeg") or
             v.endswith(".png") or v.endswith(".css") or
             v.endswith(".gif") or v.endswith(".GIF") or
             v.endswith(".mp3") or v.endswith(".mp4") or
             v.endswith(".mov") or v.endswith(".MOV") or
             v.endswith(".avi") or v.endswith(".flv") or
             v.endswith(".wmv") or v.endswith(".wav") or
             v.endswith(".ogg") or v.endswith(".odt") or
             v.endswith(".zip") or v.endswith(".gz") or
             v.endswith(".bz") or v.endswith(".tar") or
             v.endswith(".xls") or v.endswith(".xlsx") or
             v.endswith(".qt") or v.endswith(".divx") or
             v.endswith(".JPG") or v.endswith(".JPEG")))):
          #Also todo - modify regex so that >= 3 chars in front >= 7 chars in back
          url = self.urlProcess(v)
          #TODO 10000 is completely arbitrary
          if (url not in self.searchList) and (url != None) and len(self.searchList) < 10000:
            self.searchList.append(url)
  #returns complete url in the form http://stuff/bleh
  #as input handles (./url, http://stuff/bleh/url, //stuff/bleh/url)
  def urlProcess(self, link):
    link = link.strip()
    if "http" in link:
      return (link)
    elif link.startswith("//"):
      return self.proto + "://" + link[2:]
    elif link.startswith("/"):
      return self.proto + "://" + self.baseurl + link
    elif link.startswith("#"):
      return None
    elif ":" not in link and " " not in link:
      while link.startswith("../"):
        link = link[3:]
        #TODO [8:-1] is just a heuristic, but too many misses shouldn't be bad... maybe?
        if self.url.endswith("/") and ("/" in self.url[8:-1]):
          self.url = self.url[:self.url.rfind("/", 0, -1)] + "/"
      dir = self.url[:self.url.rfind("/")] + "/"
      return dir + link
    return None

class SiteSpider:
  def __init__(self, searchList, scope=None, verbocity=True, maxDepth=4):
    #TODO maxDepth logic
    #necessary to add to this list to avoid infinite loops
    self.searchList = searchList
    self.emailList = []
    self.errors = {"decoding":0, "link":0, "parse":0, "connection":0, "unknown":0}
    if scope == None:
      try:
        urlre = re.search(r"(w+):[/]+([^/]+).*", self.searchList[0])
        self.scope = urlre.group(2)
      except AttributeError:
        raise Exception("URLFormat", "URL passed is invalid")
    else:
      self.scope = scope
    index = 0
    threshhold = 0
    while 1:
      try:
        PageSpider(self.searchList[index], self.scope, self.searchList, self.emailList, self.errors)
        if verbocity:
          print self.searchList[index]
          print " Total Emails:", len(self.emailList)
          print " Pages Processed:", index
          print " Pages Found:", len(self.searchList)
        index += 1
      except IndexError:
        break
      except KeyboardInterrupt:
        break
      except:
        threshhold += 1
        print "Warning: unknown error"
        self.errors["unknown"] += 1
        if threshhold >= 40:
          break
        pass
    garbageEmails =   [ "help",
                        "webmaster",
                        "contact",
                        "sales" ]
    print "REPORT"
    print "----------"
    for email in self.emailList:
      if email not in garbageEmails:
        print email
    print "nTotal Emails:", len(self.emailList)
    print "Pages Processed:", index
    print "Errors:", self.errors

if __name__ == "__main__":
  SiteSpider(sys.argv[1].split(","))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: