Quickstart

Pomp is fun to use, incredibly easy for basic applications.

A Minimal Application

For a minimal application all you need is to define you crawler by inherit base.BaseCrawler

import re
from pomp.core.base import BaseCrawler
from pomp.contrib.urllibtools import UrllibHttpRequest


python_sentence_re = re.compile('[\w\s]{0,}python[\s\w]{0,}', re.I | re.M)


class MyCrawler(BaseCrawler):
    """Extract all sentences with `python` word"""
    ENTRY_REQUESTS = UrllibHttpRequest('http://python.org/news')  # entry point

    def extract_items(self, response):
        for i in python_sentence_re.findall(response.body.decode('utf-8')):
            sentence = i.strip()
            print("Sentence: {}".format(sentence))
            yield sentence


if __name__ == '__main__':
    from pomp.core.engine import Pomp
    from pomp.contrib.urllibtools import UrllibDownloader

    pomp = Pomp(
        downloader=UrllibDownloader(),
    )

    pomp.pump(MyCrawler())

Item pipelines

For processing extracted items pomp has pipelines mechanism. Define pipe by subclass of base.BasePipeline and pass it to engine.Pomp constructor.

Pipe calls one by one

Example pipelines for filtering items with length less the 10 symbols and printing sentence:

class FilterPipeline(BasePipeline):
    def process(self, crawler, item):
        # None - skip item for following processing
        return None if len(item) < 10 else item

class PrintPipeline(BasePipeline):
    def process(self, crawler, item):
        print('Sentence:', item, ' length:', len(item))
        return item # return item for following processing

pomp = Pomp(
    downloader=UrllibDownloader(),
    pipelines=(FilterPipeline(), PrintPipeline(),)
)

See Simple pipelines

Custom downloader

For download data from source target application can define downloader to implement special protocols or strategies.

Custom downloader must be subclass of base.BaseDownloader

For example downloader fetching data by requests package.

import requests as requestslib
from pomp.core.base import BaseDownloader, BaseCrawlException
from pomp.core.base import BaseHttpRequest, BaseHttpResponse


class ReqRequest(BaseHttpRequest):
    def __init__(self, url):
        self.url = url


class ReqResponse(BaseHttpResponse):
    def __init__(self, request, response):
        self.request = request

        if not isinstance(response, Exception):
            self.body = response.text

    def get_request(self):
        return self.request


class RequestsDownloader(BaseDownloader):

    def process(self, crawler, request):
        try:
            return ReqResponse(request, requestslib.get(request.url))
        except Exception as e:
            print('Exception on %s: %s', request, e)
            return BaseCrawlException(request, exception=e)


if __name__ == '__main__':
    from pomp.core.base import BaseCrawler
    from pomp.core.engine import Pomp

    class Crawler(BaseCrawler):
        ENTRY_REQUESTS = ReqRequest('http://python.org/news/')

        def extract_items(self, response):
            print(response.body)

        def next_requests(self, response):
            return None  # one page crawler

    pomp = Pomp(
        downloader=RequestsDownloader(),
    )

    pomp.pump(Crawler())

Downloader middleware

For hook request before it executed by downloader or response before it passed to crawler in pomp exists middlewares framework.

Middleware must be subclass of base.BaseMiddleware.

Each request will be passed to middlewares one by one in order it will passed to downloader. Each response/exception will be passed to middlewares one by one in reverse order.

For example statistic middleware:

from pomp.core.base import BaseMiddleware

class StatisticMiddleware(BaseMiddleware):
    def __init__(self):
        self.requests = self.responses = self.exceptions = 0

    def process_request(self, request, crawler, downloader):
        self.requests += 1
        return request

    def process_response(self, response, crawler, downloader):
        self.responses += 1
        return response

    def process_exception(self, exception, crawler, downloader):
        self.exceptions += 1
        return exception