LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.
There are two Link Extractors available in Scrapy by default, but you create your own custom Link Extractors to suit your needs by implementing a simple interface.
The only public method that every LinkExtractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link Extractors are meant to be instantiated once and their extract_links method called several times with different responses, to extract links to follow.
Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use it in your spiders, even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.
All available link extractors classes bundled with Scrapy are provided in the scrapy.contrib.linkextractors module.
If you don’t know what link extractor to choose, just use the default which is the same as LxmlLinkExtractor (see below):
from scrapy.contrib.linkextractors import LinkExtractor
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.
Parameters: |
|
---|
Warning
SGMLParser based link extractors are unmantained and its usage is discouraged. It is recommended to migrate to LxmlLinkExtractor if you are still using SgmlLinkExtractor.
The SgmlLinkExtractor is built upon the base BaseSgmlLinkExtractor and provides additional filters that you can specify to extract links, including regular expressions patterns that the links must match to be extracted. All those filters are configured through these constructor parameters:
Parameters: |
|
---|
The purpose of this Link Extractor is only to serve as a base class for the SgmlLinkExtractor. You should use that one instead.
The constructor arguments are:
Parameters: |
|
---|