Scraping with Django Backend
18 Feb 2015I like to use Django as a backend in my scraping scripts. I use it to model the data being scraped with the Django ORM. Then, once the data has been collected, I view the results using the automatic admin interface which is easy to setup in just a few lines of code.
In this post, I’ll show how to set up a project template for scraping with Django.
First let’s set up the environment. Create a directory for the template and start up a virtual environment in that directory.
$ mkdir django_scraping_template && cd django_scraping_template
$ virtualenv venv
$ source venv/bin/activate
Then install the dependencies (Django obviously being one of them) that you want to use for your scraping work:
$ pip install mechanize
$ pip install beautifulsoup4
$ pip install Django
$ pip freeze > requirements.txt
Next create the Django project for the settings and an app for modeling the data:
$ django-admin.py startproject scraper
$ cd scraper
$ python manage.py startapp custom_scraper
In the top level directory, create a file named scraper.py. We’ll edit that it in a second.
Your directory layout at this point should look as follows:
django_scraping_template
├── scraper.py
└── scraper
├── scraper
│ ├── wsgi.py
│ ├── urls.py
│ ├── settings.py
│ └── __init__.py
├── manage.py
└── custom_scraper
├── views.py
├── tests.py
├── models.py
├── migrations
│ └── __init__.py
├── admin.py
└── __init__.py
Edit the scraper/settings.py file, and change the INSTALLED_APPS setting to include the string ‘custom_scraper’. So it’ll look like this:
INSTALLED_APPS = (
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'custom_scraper'
)
Edit custom_scraper/models.py and create your model definitions. Then run migrate to sync the models into the database.
$ python manage.py makemigrations
$ python manage.py migrate
Now in the top scraper directory edit the scraper.py file:
#!/usr/bin/env python
import os
import sys
import django
sys.path.append(os.path.realpath(os.path.join(os.path.dirname(__file__), 'scraper/')))
sys.path.append(os.path.realpath(os.path.join(os.path.dirname(__file__), 'scraper/scraper/')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
django.setup()
from django.core.exceptions import ObjectDoesNotExist
from custom_scraper.models import *
class CustomScraper(object):
def __init__(self):
self.url = ""
def scrape(self):
print 'scraping...'
if __name__ == '__main__':
scraper = CustomScraper()
scraper.scrape()
The model(s) you define in custom_scraper/models.py can now be used in this standalone script.
Now just make the script executable and you’re all set.
$ chmod +x scraper.py
$ ./scraper.py
Shameless Plug
Have a scraping project you’d like done? I’m available for hire. Contact me with some details about your project and I’ll give you a quote.