I am using Scrapy and I have two different Item
s. I want to store entries for each specific item to it’s own mongo collection. For example, let’s assume this is what I have in the items.py
file:
1 2 3 4 5 6 7 8 9 10 11 |
import scrapy class Student(scrapy.Item): name = scrapy.Field() email = scrapy.Field() phone = scrapy.Field() class Course(scrapy.Item): name = scrapy.Field() teacher_name = scrapy.Field() |
I want to store Student
items to student
collection and Course
items to course
collection. How do we do that?
If you have used Scrapy before, you already know that for storing data, we use Pipelines
. Here’s our own MongoPipeline
that stores items to their own collection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from pymongo import MongoClient from scrapy.conf import settings import logging class MongoPipeline(object): def __init__(self): connection = MongoClient(settings['MONGODB_HOST'], settings['MONGODB_PORT']) self.db = connection[settings['MONGODB_DATABASE']] def process_item(self, item, spider): collection = self.db[type(item).__name__.lower()] logging.info(collection.insert(dict(item))) return item |
So this is what’s happening:
- We’re using PyMongo as the mongodb driver
- I have the MongoDB related configurations to settings. I am getting them and constructing a mongodb client. I am also selecting the database based on a setting
- In the
process_item
function, we are getting the type of the item and lowering it’s name. This type name would serve as the mongodb collection name for us. - We are inserting the item. We’re calling
dict()
on the item to get a dictionary representation which we can directly save using PyMongo.
That’s it. Now if you run your spiders, items of each type will go to it’s own collection on mongodb.