API

This part of the documentation documents all the public classes and functions in pomp.

Contrib

Urllib

Downloader and middleware implementations.

  • Downloaders: Fetches data by standard urllib.urlopen (Python 3.x) or urllib2.urlopen (Python 2.7+)
class pomp.contrib.urllibtools.UrllibAdapterMiddleware

Middlerware for adapting urllib.Request to pomp.core.base.BaseHttpRequest

class pomp.contrib.urllibtools.UrllibDownloader(timeout=None)

Simplest downloader

Parameters:timeout – request timeout in seconds
class pomp.contrib.urllibtools.UrllibHttpRequest(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Adapter for urllib request to pomp.core.base.BaseHttpRequest

class pomp.contrib.urllibtools.UrllibHttpResponse(request, response)

Adapter for urllib response to pomp.core.base.BaseHttpResponse

Concurrent future

Concurrent downloaders

class pomp.contrib.concurrenttools.ConcurrentCrawler(worker_class, worker_kwargs=None, pool_size=5)

Concurrent ProcessPoolExecutor crawler

Parameters:
  • pool_size – pool size of ProcessPoolExecutor
  • timeout – request timeout in seconds
class pomp.contrib.concurrenttools.ConcurrentDownloader(worker_class, worker_kwargs=None, pool_size=5)

Concurrent ProcessPoolExecutor downloader

Parameters:
  • pool_size – size of ThreadPoolExecutor
  • timeout – request timeout in seconds
class pomp.contrib.concurrenttools.ConcurrentUrllibDownloader(pool_size=5, timeout=None)

Concurrent ProcessPoolExecutor downloader for fetching data with urllib pomp.contrib.SimpleDownloader

Parameters:
  • pool_size – pool size of ProcessPoolExecutor
  • timeout – request timeout in seconds

Simple pipelines

Simple pipelines

class pomp.contrib.pipelines.CsvPipeline(output_file, *args, **kwargs)

Save items to CSV format

Params *args and **kwargs passed to csv.writer constuctor.

Parameters:output_file – Filename of file-like object or a file object. If output_file is a file-like object, then the file will remain open after the pipe is stopped.
class pomp.contrib.pipelines.UnicodeCsvWriter(f, dialect=<class 'csv.excel'>, encoding='utf-8', **kwds)

A CSV writer that writes rows to CSV file f with the given encoding.

Item and Field

class pomp.contrib.item.Item(*args, **kwargs)

OrderedDict subclass

Engine

Engine

class pomp.core.engine.Pomp(downloader, middlewares=None, pipelines=None, queue=None, breadth_first=False)

Configuration object

This class glues together all parts of a Pomp instance:

  • Downloader implementation and middleware
  • Item pipelines
  • Crawler
Parameters:
LOCK_FACTORY()

allocate_lock() -> lock object (allocate() is an obsolete synonym)

Create a new lock object. See help(type(threading.Lock())) for information about locks.

pump(crawler)

Start crawling

Parameters:crawler – instance of pomp.core.base.BaseCrawler

Interfaces

Base classes

Note

All classes in this package must be subclassed.

exception pomp.core.base.BaseCrawlException(request=None, response=None, exception=None, exc_info=None)

Download exception interface

Parameters:
  • request – request raises this exception
  • response – response raises this exception
  • exception – original exception
  • exc_info – result of sys.exc_info call
class pomp.core.base.BaseCrawler

Crawler interface

The crawler must implement two tasks:

  • Extract data from response
  • Extract urls from response for follow-up processing

Each crawler must have one or more url starting points. To set the entry urls, declare them as class attribute ENTRY_REQUESTS:

class MyGoogleCrawler(BaseCrawler):
    ENTRY_REQUESTS = 'http://google.com/'
    ...

ENTRY_REQUESTS may be a list of urls or list of requests (instances of BaseHttpRequest).

extract_items(response)

Parse page and extract items.

May be awaitable.

Parameters:response – the instance of BaseHttpResponse
Return type:item/items of any type or type of pomp.contrib.item.Item or request/requests type of BaseHttpRequest or string/strings for following processing as requests
next_requests(response)

Returns follow-up requests for processing.

Called after the extract_items method.

May be awaitable.

Note:Subclass may not implement this method. Next requests may be returned with items in extrat_items method.
Parameters:response – the instance of BaseHttpResponse
Return type:None or request or requests (instance of BaseHttpRequest or str). None response indicates that that this page does not any urls for follow-up processing.
on_processing_done(response)

Called when request/response was fully processed by middlewares, this crawler and and pipelines.

May be awaitable.

Parameters:response – the instance of BaseHttpResponse
class pomp.core.base.BaseDownloadWorker

Download worker interface

process(request)

Execute request

May be awaitable.

Parameters:request – instance of BaseHttpRequest
Return type:instance of BaseHttpResponse or BaseCrawlException or Planned or asyncio.Future for async behavior
class pomp.core.base.BaseDownloader

Downloader interface

The downloader must implement one task:

  • make http request and fetch response.
get_workers_count()
Return type:count of workers (pool size), by default 0
process(crawler, request)

Execute request

May be awaitable.

Parameters:
  • crawler – crawler that extracts items
  • request – instances of BaseHttpRequest
Return type:

instance of BaseHttpResponse or BaseCrawlException or Planned or asyncio.Future object for async behavior

start(crawler)

Prepare downloader before processing starts.

May be awaitable.

Parameters:crawler – crawler that extracts items
stop(crawler)

Stop downloader.

May be awaitable.

Parameters:crawler – crawler that extracts items
class pomp.core.base.BaseHttpRequest

HTTP request interface

class pomp.core.base.BaseHttpResponse

HTTP response interface

get_request()

Request BaseHttpRequest

class pomp.core.base.BaseMiddleware

Middleware interface

process_exception(exception, crawler, downloader)

Handle exception

May be awaitable.

Parameters:
Return type:

changed response or None to skip processing of this exception

process_request(request, crawler, downloader)

Change request before it will be executed by downloader

May be awaitable.

Parameters:
Return type:

changed request or None to skip execution of this request

process_response(response, crawler, downloader)

Modify response before content is extracted by the crawler.

May be awaitable.

Parameters:
Return type:

changed response or None to skip processing of this response

class pomp.core.base.BasePipeline

Pipeline interface

The function of pipes are to:

  • filter items
  • change items
  • store items
process(crawler, item)

Process extracted item

May be awaitable.

Parameters:
  • crawler – crawler that extracts items
  • item – extracted item
Return type:

item or None if this item is to be skipped

start(crawler)

Initialize pipe

Open files and database connections, etc.

May be awaitable.

Parameters:crawler – crawler that extracts items
stop(crawler)

Finalize pipe

Close files and database connections, etc.

May be awaitable.

Parameters:crawler – crawler that extracts items
class pomp.core.base.BaseQueue

Blocking queue interface

get_requests(count=None)

Get from queue

Note

must block execution until item is available

Parameters:count – count of requests to be processed by downloader in concurrent mode, None - downloader have not concurrency (workers). This param can be ignored.
Return type:instance of BaseRequest or Planned or list of them
put_requests(requests)

Put to queue

Parameters:requests – instance of BaseRequest or list of them
class pomp.core.base.BaseRequest

Request interface

class pomp.core.base.BaseResponse

Response interface

Utils

exception pomp.core.utils.CancelledError

The Planned was cancelled.

exception pomp.core.utils.Error

Base class for all planned-related exceptions.

exception pomp.core.utils.NotDoneYetError

The Planned was not completed.

class pomp.core.utils.Planned

Clone of Future object, but without thread conditions (locks).

Represents the result of an asynchronous computation.

add_done_callback(fn)

Attaches a callable that will be called when the future finishes.

Args:
fn: A callable that will be called with this future as its only
argument when the future completes or is cancelled. If the future has already completed or been cancelled then the callable will be called immediately. These callables are called in the order that they were added.
cancel()

Cancel the future if possible.

Returns True if the future was cancelled, False otherwise. A future cannot be cancelled if it is running or has already completed.

cancelled()

Return True if the future was cancelled.

done()

Return True of the future was cancelled or finished executing.

result()

Return the result of the call that the future represents.

Returns:
The result of the call that the future represents.
Raises:
CancelledError: If the future was cancelled. Exception: If the call raised then that exception will be raised.
set_result(result)

Sets the return value of work associated with the future.

Should only be used by Executor implementations and unit tests.