Scrapy: Scraping each type of Item to it's own collection in mongodb

I am using Scrapy and I have two different Items. I want to store entries for each specific item to it’s own mongo collection. For example, let’s assume this is what I have in the items.py file:

import scrapy

class Student(scrapy.Item):
    name = scrapy.Field()
    email = scrapy.Field()
    phone = scrapy.Field()


class Course(scrapy.Item):
    name = scrapy.Field()
    teacher_name = scrapy.Field()

import scrapy

class Student(scrapy.Item):

name = scrapy.Field()

email = scrapy.Field()

phone = scrapy.Field()

class Course(scrapy.Item):

name = scrapy.Field()

teacher_name = scrapy.Field()

I want to store Student items to student collection and Course items to course collection. How do we do that?

If you have used Scrapy before, you already know that for storing data, we use Pipelines. Here’s our own MongoPipeline that stores items to their own collection:

from pymongo import MongoClient
from scrapy.conf import settings
import logging


class MongoPipeline(object):
    def __init__(self):
        connection = MongoClient(settings['MONGODB_HOST'], settings['MONGODB_PORT'])
        self.db = connection[settings['MONGODB_DATABASE']]

    def process_item(self, item, spider):
        collection = self.db[type(item).__name__.lower()]
        logging.info(collection.insert(dict(item)))
        return item

from pymongo import MongoClient

from scrapy.conf import settings

import logging

class MongoPipeline(object):

def __init__(self):

connection = MongoClient(settings['MONGODB_HOST'], settings['MONGODB_PORT'])

self.db = connection[settings['MONGODB_DATABASE']]

def process_item(self, item, spider):

collection = self.db[type(item).__name__.lower()]

logging.info(collection.insert(dict(item)))

return item

So this is what’s happening:

We’re using PyMongo as the mongodb driver
I have the MongoDB related configurations to settings. I am getting them and constructing a mongodb client. I am also selecting the database based on a setting
In the process_item function, we are getting the type of the item and lowering it’s name. This type name would serve as the mongodb collection name for us.
We are inserting the item. We’re calling dict() on the item to get a dictionary representation which we can directly save using PyMongo.

That’s it. Now if you run your spiders, items of each type will go to it’s own collection on mongodb.

Recent Posts

Recent Comments

Archives

Categories