Linkedin Crawler

The following is also source used in the grad project. I’ll post the actual paper at some point. But here is the linkedin crawler portion with the applicable source. By it’s nature, this code is breakable, and may not work even at the time of posting. But it did work long enough for me to gather addresses, which was the point.

Usage is/was

LinkedinPageGatherer.py Linkedinusername Linkedinpassword

Following is an excerpt from the ‘paper’.

the HTMLParser libraries are more resilient to changes in source. Both HTMLParser and lxml libraries have different code available to process broken HTML. The HTMLParser libraries were chosen as more appropriate for these problems [lxml][htmlparsing].

There has been an effort to put all HTML specific logic in debuggable places so if the HTML generated changes then it is easy to modify the code parsing to reflect those changes (assuming equivalent information is available). However, changes in source are frequent, and the source code has had to be modified roughly every 3 months to reflect changes in HTML layout.

Unfortunately, although the functionality is simple, this program has grown to be much more complex due to roadblocks put in place by both LinkedIn Google.

To search LinkedIn from itself, it is necessary to have a LinkedIn account. With an account, it is possible to search with or without connections, although the searching criteria differ depending on the type of account you have. Because of this, one of the criteria for searching LinkedIn is cookie management, which has to be written to keep track of the HTTP session. In addition, LinkedIn uses a POST parameter nonce at each page that must be retrieved and POSTed for every page submission. Because of the nonce, it is also necessary to login at the login page, save the nonce and the cookie, and proceed to search through the same path an actual user would.

Once the tool is able to search for companies, there is an additional limitation. With the free account, the search is limited to displaying only 100 connections. This is inconvenient as the desired number of actual connections is often much larger. The tool I’ve written takes various criteria (such as location, title, etc) to perform multiple more specific searches of 100 results each. The extra information is harvested at each search to use for later searches. With more specific searches, the tool inserts unique items into a list of users. When the initial search initiates, LinkedIn reports the total number of results (although it only lets the account view 100 at a time) so the tool uses this total number as one possible stopping condition – when a percentage of that number has been reached or a certain number of failed searches have been tried.

This is easier to illustrate with an example. In the case of FPL, there are over 2000 results. However, it can be asserted that at least one of the results is from a certain Miami address. Using this as a search restriction the total results may be reduced to 500, the first 100 of which can be inserted. It can also be asserted that there is at least one result from the Miami address who is a project manager. Using this restriction, there are only 5 results, which have different criteria to do advanced searches on. Using this iterative approach, it is possible to gather most of the 2000. In the program I have written, this functionality is still experimental and the parameters must be adjusted.

One additional difficulty with LinkedIn is that with these results it does not display a name, only a job title associated with the company. Obviously, this is not ideal. A name is necessary for even the most basic spear phishing attacks. An email may sound slightly awkward if addressed as “Dear Project Manager in the Cyber Security Group”. The solution I found to retrieve employee names is to use Google. Using specific Google queries based on the LinkedIn names, it is possible to retrieve the names associated with a job, company, and job title.

Google has a use policy prohibiting automated crawlers. Because of this policy, it does various checks on the queries to verify that the browser is a known real browser. If it is not, Google returns a 403 status stating that the browser is not known. To circumvent this, a packet dump was performed on a valid browser. The code now has a snippet to send information exactly like an actual browser would along with randomized time delays to mimic a person. It should be impossible for Google to tell the difference over the long run – whatever checks they do can be mimicked. The code includes several configurable browsers to masquerade as. Below is the code snippet including the default spoofed browser which is Firefox running on Linux.

def getHeaders(self, browser="ubuntuFF"):
  #ubuntu firefox spoof
  if browser == "ubuntuFF":
    headers = {
      "Host": "www.google.com",
      "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
      "Accept" : "text/html,application/xhtml+xml,application xml;q=0.9,*/*;q=0.8",
      "Accept-Language" : "en-us,en;q=0.5",
      "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
      "Keep-Alive" : "300",
      "Proxy-Connection" : "keep-alive"
    }
...

Although both Google and LinkedIn make it difficult to automate information mining, their approach will fundamentally fail a motivated adversary. Because these companies want to make information available to users, this information can also be retrieved automatically. Captcha technology has been one traditional solution, though by its nature it suffers from similar flaws in design.

The LinkedIn crawler program demonstrates the possibility of an attacker targeting a company to harvest people’s names, which many times can be mapped to email addresses as demonstrated in previous sections.

GoogleQueery.py

#! /usr/bin/python

#class to make google queries
#must masquerade as a legitimate browser
#Using this violates Google ToS

import httplib
import urllib
import sys
import HTMLParser
import re

#class is basically fed a google url for linkedin for the
#sole purpose of getting a linkedin link
class GoogleQueery(HTMLParser.HTMLParser):
  def __init__(self, goog_url):
    HTMLParser.HTMLParser.__init__(self)
    self.linkedinurl = []
    query = urllib.urlencode({"q": goog_url})
    conn = httplib.HTTPConnection("www.google.com")
    headers = self.getHeaders()
    conn.request("GET", "/search?hl=en&"+query, headers=headers)
    resp = conn.getresponse()
    data = resp.read()
    self.feed(data)
    self.get_num_results(data)
    conn.close()

  #this is necessary because google wants to be mean and 403 based on... not sure
  #but it seems  I must look like a real browser to get a 200
  def getHeaders(self, browser="chromium"):
    #if browser == "random":
      #TODO randomize choice
    #ubuntu firefox spoof
    if browser == "ubuntuFF":
      headers = {
        "Host": "www.google.com",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5",
        "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language" : "en-us,en;q=0.5",
        "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
        "Keep-Alive" : "300",
        "Proxy-Connection" : "keep-alive"
        }
    elif browser == "chromium":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.5 Safari/533.2",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Avail-Dictionary": "FcpNLYBN",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    elif browser == "ie":
      headers = {
        "Host": "www.google.com",
        "Proxy-Connection": "keep-alive",
        "User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)",
        "Referer": "http://www.google.com/",
        "Accept": "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.8",
        "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3"
      }
    return headers

  def get_num_results(self, data):
    index = re.search("<b>1</b> - <b>[d]+</b> of [w]*[ ]?<b>([d,]+)", data)
    try:
      self.numResults = int(index.group(1).replace(",", ""))
    except:
      self.numResults = 0
      if not "- did not match any documents. " in data:
        print "Warning: numresults parsing problem"
        print "setting number of results to 0"

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "a" and ((("linkedin.com/pub/" in attrs[0][1])
                    or  ("linkedin.com/in" in attrs[0][1]))
                    and ("http://" in attrs[0][1])
                    and ("search?q=cache" not in attrs[0][1])
                    and ("/dir/" not in attrs[0][1])):
        self.linkedinurl.append(attrs[0][1])
        #print self.linkedinurl
      #perhaps add a google cache option here in the future
    except IndexError:
      pass

#for testing
if __name__ == "__main__":
  #url = "site:linkedin.com "PROJECT ADMINISTRATOR at CAT INL QATAR W.L.L." "Qatar""
  m = GoogleQueery(url)

LinkedinHTMLParser.py

#! /usr/bin/python

#this should probably be put in LinkedinPageGatherer.py

import HTMLParser

from person_searchobj import person_searchobj

class LinkedinHTMLParser(HTMLParser.HTMLParser):
  """
  subclass of HTMLParser specifically for parsing Linkedin names to person_searchobjs
  requires a call to .feed(data), stored data in the personArray
  """
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.personArray = []
    self.personIndex = -1
    self.inGivenName = False
    self.inFamilyName = False
    self.inTitle = False
    self.inLocation = False

  def handle_starttag(self, tag, attrs):
    try:
      if tag == "li" and attrs[0][0] == "class" and ("vcard" in attrs[0][1]):
        self.personIndex += 1
        self.personArray.append(person_searchobj())
      if attrs[0][1] == "given-name" and self.personIndex >=0:
        self.inGivenName = True
      elif attrs[0][1] == "family-name" and self.personIndex >= 0:
        self.inFamilyName = True
      elif tag == "dd" and attrs[0][1] == "title" and self.personIndex >= 0:
        self.inTitle = True
      elif tag == "span" and attrs[0][1] == "location" and self.personIndex >= 0:
        self.inLocation = True
    except IndexError:
      pass

  def handle_endtag(self, tag):
    if tag == "span":
      self.inGivenName = False
      self.inFamilyName = False
      self.inLocation = False
    elif tag == "dd":
      self.inTitle = False

  def handle_data(self, data):
    if self.inGivenName:
      self.personArray[self.personIndex].givenName = data.strip()
    elif self.inFamilyName:
      self.personArray[self.personIndex].familyName = data.strip()
    elif self.inTitle:
      self.personArray[self.personIndex].title = data.strip()
    elif self.inLocation:
      self.personArray[self.personIndex].location = data.strip()

#for testing - use a file since this is just a parser
if __name__ == "__main__":
  import sys
  file = open ("test.htm")
  df = file.read()
  parser = LinkedinHTMLParser()
  parser.feed(df)
  print "================"
  for person in parser.personArray:
    print person.goog_printstring()
  file.close()

LinkedinPageGatherer.py – this is what should be called directly.

#!/usr/bin/python

import urllib
import urllib2
import sys
import time
import copy
import pickle
import math

from person_searchobj import person_searchobj
from LinkedinHTMLParser import LinkedinHTMLParser
from GoogleQueery import GoogleQueery

#TODO add a test function that tests the website format for easy diagnostics when HTML changes
#TODO use HTMLParser like a sane person
class LinkedinPageGatherer:
  """
  class that generates the initial linkeding queeries using the company name
  as a search parameter. These search strings will be searched using google
  to obtain additional information (these limited initial search strings usually lack
  vital info like names)
  """
  def __init__(self, companyName, login, password, maxsearch=100,
               totalresultpercent=.7, maxskunk=100):
    """
    login and password are params for a valid linkedin account
    maxsearch is the number of results - linkedin limit unpaid accounts to 100
    totalresultpercent is the number of results this script will try to find
    maxskunk is the number of searches this class will attempt before giving up
    """
    #list of person_searchobj
    self.people_searchobj = []
    self.companyName = companyName
    self.login = login
    self.password = password
    self.fullurl = ("http://www.linkedin.com/search?search=&company="+companyName+
                    "&currentCompany=currentCompany", "&page_num=", "0")
    self.opener = self.linkedin_login()
    #for the smart_people_adder
    self.searchSpecific = []
    #can only look at 100 people at a time. Parameters used to narrow down queries
    self.total_results = self.get_num_results()
    self.maxsearch = maxsearch
    self.totalresultpercent = totalresultpercent
    #self.extraparameters = {"locationinfo" : [], "titleinfo" : [], "locationtitle" : [] }
    #extraparameters is a simple stack that adds keywords to restrict the search
    self.extraparameters = []
    #TODO can only look at 100 people at a time - like to narrow down queries
    #and auto grab more
    currrespercent = 0.0
    skunked = 0
    currurl = self.fullurl[0] + self.fullurl[1]
    extraparamindex = 0

    while currrespercent < self.totalresultpercent and skunked <= maxskunk:
      numresults = self.get_num_results(currurl)
      save_num = len(self.people_searchobj)

      print "-------"
      print "currurl", currurl
      print "percentage", currrespercent
      print "skunked", skunked
      print "numresults", numresults
      print "save_num", save_num

      for i in range (0, int(min(math.ceil(self.maxsearch/10), math.ceil(numresults/10)))):
        #function adds to self.people_searchobj
        print "currurl" + currurl + str(i)
        self.return_people_links(currurl + str(i))
      currrespercent = float(len(self.people_searchobj))/self.total_results
      if save_num == len(self.people_searchobj):
        skunked += 1
      for i in self.people_searchobj:
        pushTitles = [("title", gName) for gName in i.givenName.split()]
        #TODO this could be inproved for more detailed results, etc, but keeping it simple for now
        pushKeywords = [("keywords", gName) for gName in i.givenName.split()]
        pushTotal = pushTitles[:] + pushKeywords[:]
        #append to extraparameters if unique
        self.push_search_parameters(pushTotal)
      print "parameters", self.extraparameters
      #get a new url to search for, if necessary
      #use the extra params in title, "keywords" parameters
      try:
        refineel = self.extraparameters[extraparamindex]
        extraparamindex += 1
        currurl = self.fullurl[0] + "&" + refineel[0] + "=" + refineel[1] + self.fullurl[1]
      except IndexError:
        break

  """
  #TODO: This idea is fine, but we should get names first to better distinguish people
  #also maybe should be moved
  def smart_people_adder(self):
    #we've already done a basic search, must do more
    if "basic" in self.searchSpecific:
  """
  def return_people_links(self, linkedinurl):
    req = urllib2.Request(linkedinurl)
    fd = self.opener.open(req)
    pagedata = ""
    while 1:
      data = fd.read(2056)
      pagedata = pagedata + data
      if not len(data):
        break
    #print pagedata
    self.parse_page(pagedata)

  def parse_page(self, page):
    thesePeople = LinkedinHTMLParser()
    thesePeople.feed(page)
    for newperson in thesePeople.personArray:
      unique = True
      for oldperson in self.people_searchobj:
        #if all these things match but they really are different people, they
        #will likely still be found as unique google results
        if (oldperson.givenName == newperson.givenName and
            oldperson.familyName == newperson.familyName and
            oldperson.title == newperson.title and
            oldperson.location == oldperson.location):
              unique = False
              break
      if unique:
        self.people_searchobj.append(newperson)
  """
    print "======================="
    for person in self.people_searchobj:
      print person.goog_printstring()
  """

  #return the number of results, very breakable
  def get_num_results(self, url=None):
    #by default return total in company
    if url == None:
      fd = self.opener.open(self.fullurl[0] + "1")
    else:
      fd = self.opener.open(url)
    data = fd.read()
    fd.close()
    searchstr = "<p class="summary">"
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find("</strong>", sindex)
    return(int(data[sindex:eindex].strip().strip("<strong>").replace(",", "").strip()))

  #returns an opener object that contains valid cookies
  def linkedin_login(self):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)
    #login page
    fd = opener.open("https://www.linkedin.com/secure/login?trk=hb_signin")
    data = fd.read()
    fd.close()
    #csrf 'prevention' login value
    searchstr = """<input type="hidden" name="csrfToken" value="ajax:"""
    sindex = data.find(searchstr) + len(searchstr)
    eindex = data.find('"', sindex)
    params = urllib.urlencode(dict(csrfToken="ajax:-"+data[sindex:eindex],
                              session_key=self.login,
                              session_password=self.password,
                              session_login="Sign+In",
                              session_rikey=""))
    #need the second request to get the csrf stuff, initial cookies
    request = urllib2.Request("https://www.linkedin.com/secure/login")
    request.add_header("Host", "www.linkedin.com")
    request.add_header("Referer", "https://www.linkedin.com/secure/login?trk=hb_signin")
    time.sleep(1.5)
    fd = opener.open(request, params)
    data = fd.read()
    if "<div id="header" class="guest">" in data:
      print "Linkedin authentication faild. Please supply a valid linkedin account"
      sys.exit(1)
    else:
      print "Linkedin authentication Successful"
    fd.close()
    return opener

  def push_search_parameters(self, extraparam):
    uselesswords = [ "for", "the", "and", "at", "in"]
    for pm in extraparam:
      pm = (pm[0], pm[1].strip().lower())
      if (pm not in self.extraparameters) and (pm[1] not in uselesswords) and pm != None:
        self.extraparameters.append(pm)

class LinkedinTotalPageGather(LinkedinPageGatherer):
  """
  Overhead class that generates the person_searchobjs, using GoogleQueery
  """
  def __init__(self, companyName, login, password):
    LinkedinPageGatherer.__init__(self, companyName, login, password)
    extraPeople = []
    for person in self.people_searchobj:
      mgoogqueery = GoogleQueery(person.goog_printstring())
      #making the assumption that each pub url is a unique person
      count = 0
      for url in mgoogqueery.linkedinurl:
        #grab the real name from the url
        begindex = url.find("/pub/") + 5
        endindex = url.find("/", begindex)
        if count == 0:
          person.url = url
          person.name = url[begindex:endindex]
        else:
          extraObj = copy.deepcopy(person)
          extraObj.url = url
          extraObj.name = url[begindex:endindex]
          extraPeople.append(extraObj)
        count += 1
      print person
    print "Extra People"
    for person in extraPeople:
      print person
      self.people_searchobj.append(person)

if __name__ == "__main__":
  #args are email and password for linkedin
  my = LinkedinTotalPageGather(company, sys.argv[1], sys.argv[2])

person_searchobj.py

#! /usr/bin/python

class person_searchobj():
  """this object is used for the google search and the final person object"""

  def __init__ (self, givenname="", familyname="", title="", organization="", location=""):
    """
    given name could be a title in this case, does not matter in terms of google
    but then may have to change for the final person object
    """
    #"name" is their actual name, unlike givenName and family name which are linkedin names
    self.name = ""
    self.givenName = givenname
    self.familyName = familyname
    self.title = title
    self.organization = organization
    self.location = location

    #this is retrieved by GoogleQueery
    self.url = ""

  def goog_printstring(self):
    """return the google print string used for queries"""
    retrstr = "site:linkedin.com "
    for i in  [self.givenName, self.familyName, self.title, self.organization, self.location]:
      if i != "":
        retrstr += '"' + i +'" '
    return retrstr

  def __repr__(self):
    """Overload __repr__ for easy printing. Mostly for debugging"""
    return (self.name + "n" +
            "------n"
            "GivenName: " + self.givenName + "n" +
            "familyName:" + self.familyName + "n" +
            "Title:" + self.title + "n" +
            "Organization:" + self.organization + "n" +
            "Location" + self.location + "n" +
            "URL:" + self.url + "nn")

3 Responses to Linkedin Crawler

  1. Lulu says:

    Hi, Great article, Thank you! Any idea if this is still working today? Do you have a updated version?

    • I’m fairly sure this doesn’t work anymore since changes in the website would break it unfortunately. The same concepts should basically work, but it would definitely take some debugging.

      • Lulu says:

        Thanks a lot, i will consider adjusting it ;) will keep u posted for sure if i find the time to do it..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: