email_spider
August 13, 2010 Leave a comment
This was a small part of a project that was itself about 1/3 of my graduate project. I used it to collect certain information. Here is the excerpt from the paper.
Website Email Spider Program
In order to automatically process publicly available email addresses, a simple tool was developed, with source code available in Appendix A. An automated tool is able to process web pages in a way that is less error prone than manual methods, and it also makes processing the sheer number of websites possible (or at least less tedious).
This tool begins at a few root pages, which can be comma delimited. From these, it searches for all unique links by keeping track of a queue so that pages are not usually revisited (although revisiting a page is still possible in case the server is case insensitive or equivalent pages are dynamically generated with unique URLs). In addition, the base class is passed a website scope so that pages outside of that scope are not spidered. By default, the scope is simply a regular expression including the top domain name of the organization.
Each page requested searches the contents for the following regular expression to identify common email formats:
[w_.-]{3,}@[w_.-]{6,}
The 3 and 6 repeaters were necessary because of false positives otherwise obtained due to various encodings. This regular expression will not obtain all email addresses. However, it will obtain the most common addresses with a minimum of false positives. In addition, the obtained email addresses are run against a blacklist of uninteresting generic form addresses (such as help@example.com, info@example.com, or sales@example.com).
These email addresses are saved in memory and reported when the program completes or is interrupted. Note because of the dynamic nature of some pages, these can potentially spider infinitely and must be interrupted (for example, a calendar application that uses links to go back in time indefinitely). Most emails seemed to be obtained in the first 1,000 pages crawled. A limit of 10,000 pages was chosen as a reasonable scope. Although this limit was reached several times, the spider program uses a breadth search method. It was observed that most unique addresses were obtained early in the spidering process, and extending the number of pages tended to have a diminishing return. Despite this, websites with more pages also tended to correlate with greater email addresses returned (see analysis section).
Much of the logic in the spidering tool is dedicated to correctly parsing html. By their nature, web pages vary widely with links, with many sites using a mix of directory traversal, absolute URLs, and partial URLs. It is no surprise there are so many security vulnerabilities related to browsers parsing this complex data.
There is also an effort made to make the software somewhat more efficient by ignoring superfluous links to objects such as documents, executables, etc. Although if such a file is encountered an exception will catch the processing error, these files consume resources.
Using this tool is straightforward, but a certain familiarity is expected – it was not developed for an end user but for this specific experiment. For example, a URL is best processed in the format http://example.com/ since in its current state it would use example.com to verify that spidered addresses are within a reasonable scope. It prints debugging messages constantly because every site seemed to have unique parsing quirks. Although other formats and usages may work, there was little effort to make this software easy to use.
#!/usr/bin/python import HTMLParser import urllib2 import re import sys import signal import socket socket.setdefaulttimeout(20) #spider is meant for a single url #proto can be http, https, or any class PageSpider(HTMLParser.HTMLParser): def __init__(self, url, scope, searchList=[], emailList=[], errorDict={}): HTMLParser.HTMLParser.__init__(self) self.url = url self.scope = scope self.searchList = searchList self.emailList = emailList try: urlre = re.search(r"(w+):[/]+([^/]+).*", self.url) self.baseurl = urlre.group(2) self.proto = urlre.group(1) except AttributeError: raise Exception("URLFormat", "URL passed is invalid") if self.scope == None: self.scope = self.baseurl try: req = urllib2.urlopen(self.url) htmlstuff = req.read() except KeyboardInterrupt: raise except urllib2.HTTPError: #not able to fetch a url eg 404 errorDict["link"] += 1 print "Warning: link error" return except urllib2.URLError: errorDict["link"] += 1 print "Warning: URLError" return except ValueError: errorDict["link"] += 1 print "Warning link error" return except: print "Unknown Error", self.url errorDict["link"] += 1 return emailre = re.compile(r"[w_.-]{3,}@[w_.-]{2,}.[w_.-]{2,}") nemail = re.findall(emailre, htmlstuff) for i in nemail: if i not in self.emailList: self.emailList.append(i) try: self.feed(htmlstuff) except HTMLParser.HTMLParseError: errorDict["parse"] += 1 print "Warning: HTML Parse Error" pass except UnicodeDecodeError: errorDict["decoding"] += 1 print "Warning: Unicode Decode Error" pass def handle_starttag(self, tag, attrs): if (tag == "a" or tag =="link") and attrs: #process the url formats, make sure the base is in scope for k, v in attrs: #check it's an htref and that it's within scope if (k == "href" and ((("http" in v) and (re.search(self.scope, v))) or ("http" not in v)) and (not (v.endswith(".pdf") or v.endswith(".exe") or v.endswith(".doc") or v.endswith(".docx") or v.endswith(".jpg") or v.endswith(".jpeg") or v.endswith(".png") or v.endswith(".css") or v.endswith(".gif") or v.endswith(".GIF") or v.endswith(".mp3") or v.endswith(".mp4") or v.endswith(".mov") or v.endswith(".MOV") or v.endswith(".avi") or v.endswith(".flv") or v.endswith(".wmv") or v.endswith(".wav") or v.endswith(".ogg") or v.endswith(".odt") or v.endswith(".zip") or v.endswith(".gz") or v.endswith(".bz") or v.endswith(".tar") or v.endswith(".xls") or v.endswith(".xlsx") or v.endswith(".qt") or v.endswith(".divx") or v.endswith(".JPG") or v.endswith(".JPEG")))): #Also todo - modify regex so that >= 3 chars in front >= 7 chars in back url = self.urlProcess(v) #TODO 10000 is completely arbitrary if (url not in self.searchList) and (url != None) and len(self.searchList) < 10000: self.searchList.append(url) #returns complete url in the form http://stuff/bleh #as input handles (./url, http://stuff/bleh/url, //stuff/bleh/url) def urlProcess(self, link): link = link.strip() if "http" in link: return (link) elif link.startswith("//"): return self.proto + "://" + link[2:] elif link.startswith("/"): return self.proto + "://" + self.baseurl + link elif link.startswith("#"): return None elif ":" not in link and " " not in link: while link.startswith("../"): link = link[3:] #TODO [8:-1] is just a heuristic, but too many misses shouldn't be bad... maybe? if self.url.endswith("/") and ("/" in self.url[8:-1]): self.url = self.url[:self.url.rfind("/", 0, -1)] + "/" dir = self.url[:self.url.rfind("/")] + "/" return dir + link return None class SiteSpider: def __init__(self, searchList, scope=None, verbocity=True, maxDepth=4): #TODO maxDepth logic #necessary to add to this list to avoid infinite loops self.searchList = searchList self.emailList = [] self.errors = {"decoding":0, "link":0, "parse":0, "connection":0, "unknown":0} if scope == None: try: urlre = re.search(r"(w+):[/]+([^/]+).*", self.searchList[0]) self.scope = urlre.group(2) except AttributeError: raise Exception("URLFormat", "URL passed is invalid") else: self.scope = scope index = 0 threshhold = 0 while 1: try: PageSpider(self.searchList[index], self.scope, self.searchList, self.emailList, self.errors) if verbocity: print self.searchList[index] print " Total Emails:", len(self.emailList) print " Pages Processed:", index print " Pages Found:", len(self.searchList) index += 1 except IndexError: break except KeyboardInterrupt: break except: threshhold += 1 print "Warning: unknown error" self.errors["unknown"] += 1 if threshhold >= 40: break pass garbageEmails = [ "help", "webmaster", "contact", "sales" ] print "REPORT" print "----------" for email in self.emailList: if email not in garbageEmails: print email print "nTotal Emails:", len(self.emailList) print "Pages Processed:", index print "Errors:", self.errors if __name__ == "__main__": SiteSpider(sys.argv[1].split(","))