Revisiting Taleo with Puppeteer
12 Feb 2019I’ve demonstrated how to scrape Taleo sites in a couple of my previous posts [1] [2]. In those articles I used the CasperJS and Python/Selenium to scrape the Taleo job site at https://l3com.taleo.net. In this post, I’ll show how to scrape that same site again, this time using Puppeteer.
We begin by importing puppeteer and defining the URL we’re going to scrape:
const puppeteer = require('puppeteer');
const url = 'https://l3com.taleo.net/careersection/l3_ext_us/jobsearch.ftl';
The main logic of the scraper is in main
.
async function main() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await waitForJobsToLoad(page);
let pageno = 2;
while (true) {
const jobs = await getJobs(page);
jobs.forEach(j => console.log( JSON.stringify(j, null, 2) ));
const noMorePages = await gotoNextPage(page, pageno++);
if (noMorePages) {
break;
}
}
await browser.close();
}
main().then(() => console.log('Complete!'));
Inside main
we first launch Chrome via Puppeteer and then load the Taleo site in a new page:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
Now, before we start scraping the jobs we have to wait until they’e finished loading. If you
inspect the network and HTML you’ll see that the jobs are being loaded into element table#jobs
via an AJAX call. We need a way to determine once that loading has completed.
If you open up the site in your browser, you’ll see a spinning progress indicator appears while the jobs are being loaded and then disappears once the jobs get rendered. One approach we could take is to wait for that ‘loading’ progress indicator to appear and then wait for it to go away.
The progress indicator shows up in the div#progressIndicator
element. If we inspect the site’s
code in SearchHandler.js, we can see that the progress indicator has its width set to 0 to make it
‘invisible’ once the jobs have finished loading:
this.hideProgress = function() {
$('#progressIndicator').width(0);
};
So we can determine when the progress indicator is visible and when it has gone away by checking
that element’s offsetWidth
:
async function waitForJobsToLoad(page) {
await page.waitFor(() => document.querySelector('div#progressIndicator').offsetWidth !== 0);
await page.waitFor(() => document.querySelector('div#progressIndicator').offsetWidth === 0);
}
Using this approach worked when I tested it, but I don’t think it’s robust as it relies on the timing
working out in our favor. If the jobs have already loaded by the time the first waitFor
is called,
we’ll end up timing out waiting for the progress indicator to appear.
In this case it’s better to wait for the contents of the span#reloadMessage
to change. This element
is where the message describing how many jobs were loaded appears. Initially there’s no message until
the jobs have loaded and then it will contain a string like like Job Openings 1 - 25 of 1051
:
var waitForJobsToLoad = (function () {
let reloadMessage = '';
return async function(page) {
await page.waitForFunction(
oldText => document.querySelector('span#reloadMessage').innerText !== oldText,
{}, reloadMessage
);
reloadMessage = await page.$eval('span#reloadMessage', e => e.innerText);
};
})();
I’ve used an IIFE for waitForJobsToLoad
so that we can track the string in the span#reloadMessage
element using the
reloadMessage
variable.
When we first open the jobs page, the message in the span#reloadMessage
element will be empty so reloadMessage
starts
out as an empty string. At the end of the function call reloadMessage
gets set to the new value of the span#reloadMessage
element. So each time waitForJobsToLoad
gets called it waits for the message to change from whatever value it had previously.
Writing it this way means we can reuse the function in the next page handler to determine once a new page of results has finished loading:
/*------------------------------------------------------------------------------
* Look for link for pageno in pager. So if pageno was 6 we'd look for 'Page$6'
* in href:
*
* <a href="#" title="Go to page 6" aria-disabled="false">6</a>
*/
async function gotoNextPage(page, pageno) {
let noMorePages = true;
let nextPageXp = `//ul[@class='pager']/li[@class='pagerlink']/a[text()='${pageno}']`;
let nextPage;
nextPage = await page.$x(nextPageXp)
if (nextPage.length > 0) {
await nextPage[0].click();
await waitForJobsToLoad(page);
noMorePages = false;
}
return noMorePages;
}
Back in main
, once the jobs have finished loading we scrape the jobs on the current page and then
click onto the next page, continuing until we’ve reached the last page.
let pageno = 2;
while (true) {
const jobs = await getJobs(page);
jobs.forEach(j => console.log( JSON.stringify(j, null, 2) ));
const noMorePages = await gotoNextPage(page, pageno++);
if (noMorePages) {
break;
}
}
The getJobs
function is where we scrape the job attributes:
async function getJobs(page) {
const jobs = await page.evaluate(jobSelector => {
//debugger;
var results = [];
Array.from(document.querySelectorAll(jobSelector)).forEach((tr) => {
th = tr.querySelector('th');
td = tr.querySelectorAll('td');
results.push({
'title': th.innerText.trim(),
'href': th.querySelector('a').href,
'location': td[1].innerText.trim(),
'postingDate': td[2].innerText.trim()
});
});
return results;
}, 'table#jobs tr[id^="job"]');
return jobs;
}
The debugger
statement is commented out. But if you uncomment it and launch Chrome with
headless
set to false and devtools
set to true
, then you can use Chrome’s debugger
to step through the code in getJobs
once execution reaches the debugger
statement:
const browser = await puppeteer.launch({ headless: false, devtools: true });
And that’s it! The complete code is available as a gist at:
https://gist.github.com/thayton/330951c308bd525fc2abea49793d583c