Scraping by Example - SelectMinds ATS
27 May 2020In this post I’ll show how to develop a scraper for the SelectMinds Applicant Tracking System (ATS). For my example, I’ll use the jobs site at https://eygbl.referrals.selectminds.com/
Background
Before developing the code for the scraper, let’s inspect the site and see how the jobs are loaded into the page. Click on the link above to open the jobs site. Then open the Chrome developer tools and switch to the Network tab. Click the Search button to load all jobs.
In the Network tab you’ll see a series of XHR requests being made. I’ve highlighted the two requests (first and last) that we’ll need to recreate in order to scrape the jobs from the site.
The first request is a POST to
https://eygbl.referrals.selectminds.com/ajax/jobs/search/create
with the following parameters and form data:
We also need to pay attention to the request headers:
The tss-token
header is required. Without it we’ll get back a 403 Invalid Access response from the
server. When a valid request is made the response sent back will be a JSON string with the following fields:
{
"Status": "OK",
"UserMessage": "",
"Result": {
"JobSearch.id": 84067040
}
}
We’ll need to save the JobSearch.id
value from the response because it’s used as a parameter in the second XHR request, a POST to
https://eygbl.referrals.selectminds.com/ajax/content/job_results
with the following query string parameters
We’ll see where the rest of the parameters come from later in the article when we dig into how this request
gets generated. The response for this request contains the HTML for the jobs that we’ll scrape in the Result
field of the JSON string returned:
{
"Status":"OK", "UserMessage": "",
"Result": "<div class=\"results_content..."
}
Implementation
Now that we’ve gotten an overview of the requests, let’s dig into the site’s code to see how the two XHR requests are being generated so that we can duplicate them in our scraper. First, let’s create a base class with some boilerplate code for our SelectMinds scraper. We’ll add methods to it as we go:
import re
import sys
import json
import time
import logging
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin
class SelectMindsJobScraper(object):
def __init__(self):
FORMAT = "%(asctime)s [ %(filename)s:%(lineno)s - %(funcName)s() ] %(message)s"
logging.basicConfig(format=FORMAT, datefmt='%Y-%m-%d %H:%M:%S')
self.logger = logging.getLogger(__name__)
self.logger.setLevel(logging.DEBUG)
self.session = requests.Session()
Now we’ll go back and take a look at that first XHR request again. In the Chrome developer tools, search for the URL used in the first POST request:
Click on the search result to open up the code in the Sources tab. Inside the Sources tab you’ll see that the first XHR request is implemented by code in job_search_banner.js which gets triggered when the Search button is clicked:
// Main submit binding
j$('#jSearchSubmit', search_banner).click(function() {
....
j$.ajax({
type: 'POST',
url: TVAPP.guid('/ajax/jobs/search/create'),
data: data,
success: function(result){
job_search_id = result.Result['JobSearch.id'];
j$.log('job_search_id: ' + job_search_id);
// Load results
j$(document).trigger('loadSearchResults', {'job_search_id':job_search_id});
},
dataType: 'json',
error: function(xhr, textStatus, error) {
TVAPP.masterErrorHandler(xhr, textStatus, error, null);
}
});
});
The URL is set to the return value of the call to
TVAPP.guid('/ajax/jobs/search/create')
This method is where the uid: 219
parameter is being generated. To find the source code for this method go
into the console of the Chrome developer tools and type in TVAPP.guid
. The console will display the beginning
of the code for this method.
Click on that code and it will take you into the Sources tab. The Sources tab shows that the guid
method is defined in
desktop.js:
TVAPP.guid = function(url) {
var date = new Date
var uid = date.getMilliseconds();
var additionType = "?uid=";
for (var i = 0; i < url.length; i++) {
if(url.charAt(i) == '?') {
additionType = "&uid="
}
}
var newURL = url + additionType + uid;
return newURL;
};
So the uid
comes from the milliseconds portion of the current local time date.getMilliseconds()
. Let’s add the corresponding
method inside our scraper:
def guid(self):
dt = datetime.now()
guid = dt.microsecond / 1000
return guid
Now that we’ve seen how the uid
parameter is being generated, let’s see how to get the tss-token
value required in the
request headers. Search the HTML for tss
inside the job site’s main page and you’ll find an <input>
element containing
the token value:
<input type="hidden" name="tsstoken" id="tsstoken" value="BNT9g...">
We can look that up using the tag name and id in BeautifulSoup:
def get_tss_token(self, soup):
i = soup.find('input', id='tsstoken')
tss_token = i['value']
return tss_token
At this point, we’ve seen enough to implement the first request in our scraper:
def guid(self):
dt = datetime.now()
guid = dt.microsecond / 1000
return guid
def get_tss_token(self, soup):
i = soup.find('input', id='tsstoken')
tss_token = i['value']
return tss_token
def get_job_search_id(self, tss_token):
headers = {
'X-Requested-With': 'XMLHttpRequest',
'tss-token': tss_token,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}
uid = self.guid()
params = {
'uid': uid
}
data = {
'keywords': ''
}
url = urljoin(self.url, '/ajax/jobs/search/create')
resp = self.session.post(url, headers=headers, params=params, data=data)
data = resp.json()
return data['Result']['JobSearch.id']
def scrape(self):
jobs = []
resp = self.session.get(self.url)
soup = BeautifulSoup(resp.text, 'lxml')
tss_token = self.get_tss_token(soup)
job_search_id = self.get_job_search_id(tss_token)
...
Now that we have the job_search_id
value from the first request, we’re ready to move on and see how the second XHR request is generated.
As shown earlier in the code bound to the Search button, the job_search_id value from the first response is passed as the
argument to loadSearchResults
:
// Main submit binding
j$('#jSearchSubmit', search_banner).click(function() {
...
j$(document).trigger('loadSearchResults', {'job_search_id':job_search_id});
Let’s examine the code for loadSearchResults
, which is defined in job_list.js:
// function to find existing or get new search results
function loadSearchResults(data) {
...
// load new results and display them
if (cached_results.length && !data.force_refresh) {
...
} else {
// Load new results
var context = 'JobSearch.id=' + data.job_search_id;
if (TVAPP.property.fb) context += '&fb=true';
if (campaign_id) context += '&Campaign.id=' + campaign_id + '&campaign_page=2';
if (data.page) context += '&page_index=' + data.page;
// Sync load is bad news bears. instead, let's async and callback:
j$.ajax({
type: 'POST',
url: TVAPP.guid('/ajax/content/job_results?' + context + '&site-name=' + TVAPP.property.site.short_name + '&include_site=true'),
dataType: 'json',
success: function(response) {
var new_results = j$(response.Result);
...
// display the new results
displayLoadedResults(new_results, data);
},
error: function(xhr, textStatus,error) {
TVAPP.masterErrorHandler(xhr, textStatus, error, null);
}
});
}
};
Once again we have a call to TVAPP.guid
generating the URL for the AJAX request:
TVAPP.guid('/ajax/content/job_results?' + context + '&site-name=' + TVAPP.property.site.short_name + '&include_site=true'),
As you can see, we need the values for the context
and site-name
parameters. The context
variable is defined just
above the call to guid()
. It’s initialized with JobSearch.id
and updated with page_index
if a page number is set
via page
in data
:
var context = 'JobSearch.id=' + data.job_search_id;
if (TVAPP.property.fb) context += '&fb=true';
if (campaign_id) context += '&Campaign.id=' + campaign_id + '&campaign_page=2';
if (data.page) context += '&page_index=' + data.page;
We can extract TVAPP.property.site.short_name
from the HTML on the main page. View the HTML on the job site’s main search
page and search for TVAPP
and you’ll find where TVAPP.property.site.short_name
is set:
<script type="text/javascript">
var TVAPP = TVAPP || {}; // There can only be ONE.
TVAPP.property = {
...
site: {
id: "2",
short_name: "default909"
},
We’ll use a regex to extract the short_name
value:
def get_site_short_name(self, soup):
x = { 'type': 'text/javascript' }
r = re.compile(r'short_name:\s+"([^"]+)')
m = None
for script in soup.find_all('script', attrs=x):
m = re.search(r, script.text)
if m:
break
short_name = None
if m:
short_name = m.group(1)
return short_name
Now we have enough information to replicate the second request with our own version of loadSearchResults
:
def load_search_results(self, tss_token, job_search_id, short_name, pageno=1):
headers = {
'tss-token': tss_token,
}
params = {
'JobSearch.id': job_search_id,
'page_index': pageno,
'site-name': short_name, # From TVAPP.property.site.short_name : short_name: "default909"
'include_site': 'true',
'uid': self.guid()
}
url = urljoin(self.url, '/ajax/content/job_results')
resp = self.session.post(url, headers=headers, params=params)
data = json.loads(resp.text)
return data['Result']
Now that we’ve got the code for the two requests, we have everything we need to complete our scrape()
method:
def scrape(self):
jobs = []
resp = self.session.get(self.url)
soup = BeautifulSoup(resp.text, 'lxml')
tss_token = self.get_tss_token(soup)
short_name = self.get_site_short_name(soup)
job_search_id = self.get_job_search_id(tss_token)
pageno = 1
while True:
self.logger.info(f'Getting page {pageno}')
html = self.load_search_results(tss_token, job_search_id, short_name, pageno)
soup = BeautifulSoup(html, 'html.parser')
d = soup.find('div', id='job_results_list_hldr')
x = {'class': 'job_link'}
y = {'class': 'location'}
for a in d.find_all('a', attrs=x):
l = a.find_next('span', attrs=y)
job = {}
job['title'] = a.text
job['url'] = urljoin(self.url, a['href'])
job['location'] = l.text.strip()
jobs.append(job)
self.logger.info(f'{len(jobs)} jobs scraped')
d = soup.find('div', id='jPaginateNumPages')
num_pages = int(float(d.text))
if pageno >= num_pages:
break
time.sleep(1) # Don't hit the server too quickly
pageno += 1
Let’s go over the parts of the code that are new. First we’ll start with the code that parses the HTML
returned by load_search_results
. Examine the HTML on the job search results page:
The jobs are contained in a div
with id job_results_list_hldr
. We can find the job links and locations
under this div using classes job_link
and location
respectively after feeding the HTML into BeautifulSoup:
html = self.load_search_results(tss_token, job_search_id, short_name, pageno)
soup = BeautifulSoup(html, 'html.parser')
d = soup.find('div', id='job_results_list_hldr')
x = {'class': 'job_link'}
y = {'class': 'location'}
for a in d.find_all('a', attrs=x):
l = a.find_next('span', attrs=y)
Since pagination is handled via a parameter we can goto the next page by simply incrementing the pageno
variable that gets passed to load_search_results
. We do need a way to check when we’ve reached the last
page though.
Examine the HTML for the pagination links at the bottom of the search results and you’ll see that one of
the div
elements contains the total number of pages for the search results:
We can compare this element’s value with the number of pages we’ve scraped so far, and if the two values are equal it means we’ve reached the last page:
d = soup.find('div', id='jPaginateNumPages')
num_pages = int(float(d.text))
if pageno >= num_pages:
break
time.sleep(1) # Don't hit the server too quickly
pageno += 1
That’s it! If you’d like to see a working version of the code developed in this post, it’s available on github at:
https://gist.github.com/thayton/05748dc48a6d6357b27edfb5363e8c04
Shameless Plug
Have a scraping project you’d like done? I’m available for hire. Contact me with some details about your project and I’ll give you a quote.