pattern.web
The pattern.web module bundles robust tools for online data mining: asynchronous requests, a uniform API for various web services (Google, Bing, Yahoo, Twitter, Wikipedia, Flickr, RSS, Atom), a HTML DOM parser, HTML tag stripping functions, web crawler, webmail, caching mechanisms, Unicode support.
It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.
Documentation
- URL downloads
- Asynchronous requests
- Search engine + web services (google, bing, yahoo, twitter, wikipedia, flickr)
- Web sort
- HTML to plaintext
- HTML DOM parser
- PDF parser
- Spider
- Locale
- Cache
URL downloads
The URL object is based on Python's urllib2.Request and offers a robust way of working with online content. It has a URL.download() method that retrieves the content associated with a web address. The method parameter determines how query data is encoded:
- GET: query is encoded in the URL string (typically used for retrieving data).
- POST: query is encoded in the message body (used for posting data).
url = URL(string='', method=GET, query={})
url.string # u'http://user:pw@domain.com:30/path/page?p=1#anchor' url.parts # Dictionary of attributes:
url.protocol # u'http' url.username # u'user' url.password # u'pw' url.domain # u'domain.com' url.port # 30 url.path # [u'path'] url.page # u'page' url.query # {u'p': 1} url.querystring # u'p=1' url.anchor # u'anchor'
url.exists # False if URL.open() raises a HTTP404NotFound. url.redirect # Actual URL after redirection, or None. url.headers # Dictionary of HTTP response headers. url.mimetype # Document MIME-type.
url.open(timeout=10, proxy=None) url.download(timeout=10, cached=True, throttle=0, proxy=None, unicode=False) url.copy()
- URL() expects a string starting with a valid protocol (e.g. http://).
- URL.open() returns a connection from which data can be retrieved with connection.read().
- URL.download() will cache the retrieved data locally by default (faster next time).
Raises a URLTimeoutError if the download takes longer than the given timeout.
Sleeps for throttle seconds after the download is complete.
A proxy server can be given as a (host, protocol)-tuple, e.g., ("proxy.com", "https").
With unicode=True, returns the data as a Unicode string. By default it is False (data can be an image, for example) but unicode=True is advised for HTML.
The following example downloads an image.
The helper function extension() parses the file extension from a file name:
>>> from pattern.web import URL, extension >>> url = URL('http://www.clips.ua.ac.be/media/pattern_schema.gif') >>> f = open('test' + extension(url.page), 'w') # save as test.gif >>> f.write(url.download()) >>> f.close()
URL exceptions
URL.open() and URL.download() raise a URLError if something goes wrong. This is quite common since a lot of things can fail – no internet connection, server is down, etc. URLError has a number of subclasses that can help you figure out the error:
Exception | Description |
URLError | URL contains errors (e.g. a missing t in htp://) |
URLTimeout | URL takes too long to load. |
HTTPError | URL causes an error on the contacted server. |
HTTP400BadRequest | URL contains an invalid request. |
HTTP401Authentication | URL requires a login and password. |
HTTP403Forbidden | URL is not accessible (check user-agent). |
HTTP404NotFound | URL doesn't exist on the internet. |
URL mime-type
URL.mimetype can be used to check the type of content at the given address, which is more reliable than simply looking at the filename extension (which may be missing).
>>> url = URL('http://www.clips.ua.ac.be/media/pattern_schema.gif') >>> print url.mimetype in MIMETYPE_IMAGE True
Global | Value |
MIMETYPE_WEBPAGE | ['text/html'] |
MIMETYPE_STYLESHEET | ['text/css'] |
MIMETYPE_PLAINTEXT | ['text/plain'] |
MIMETYPE_PDF | ['application/pdf'] |
MIMETYPE_NEWSFEED | ['application/rss+xml', 'application/atom+xml'] |
MIMETYPE_IMAGE | ['image/gif', 'image/jpeg', 'image/png'] |
MIMETYPE_AUDIO | ['audio/mpeg', 'audio/mp4', 'audio/x-wav'] |
MIMETYPE_VIDEO | ['video/mpeg', 'video/mp4', 'video/quicktime'] |
MIMETYPE_ARCHIVE | ['application/x-tar', 'application/zip'] |
MIMETYPE_SCRIPT | ['application/javascript'] |
User-agent and referrer
URL.open() and URL.download() have two optional parameters: user_agent and referrer, used to identify the application accessing the web. Some websites include code to block out any application except browsers. By setting a user_agent you can make the application appear as a browser. This is called spoofing and it is not encouraged, though sometimes necessary.
For example, to pose as a Firefox browser:
>>> URL('http://www.clips.ua.ac.be').download(user_agent='Mozilla/5.0')
Find URLs
The find_urls() function can be used to parse URLs from a text string. This will work on links starting with http://, https://, www. or domain names ending in .com, .org. .net. The parser will detect and strip leading punctuation (open parens) and trailing punctuation (period, comma, close parens). Comparably, the find_email() function can be used to parse e-mail addresses from a string.
>>> print find_urls('Visit our website (www․clips.ua.ac.be)', unique=True) ['www.clips.ua.ac.be']
Asynchronous requests
The asynchronous() function can be used to execute a function in the background. It takes the function, its arguments and optional keyword arguments. It returns an AsynchronousRequest object that stores the given function's return value – once it is done. The main program can continue to run in the meantime.
request = asynchronous(function, *args, **kwargs)
request.done # True when the function is done. request.elapsed # Time running, in seconds. request.value # Function return value when done (or None). request.error # Function Exception (if any).
request.now() # Waits for function and returns its value.
This is useful to execute a web query without hanging the user interface of your application, but instead display a progress bar (for example). The example below illustrates this. In a real-world setup you would poll request.done in the application's event loop:
>>> import time >>> request = asynchronous(Google().search, 'holy grail', timeout=4) >>> while not request.done: >>> time.sleep(0.1) >>> print 'busy...' >>> print request.value
For a number of good reasons, there is no way to interrupt or "kill" a background process (i.e. a Python thread). You are responsible for ensuring that the given function doesn't hang.
Search engine + web services
The SearchEngine object offers a uniform way to address different web services, such as Google and Wikipedia. The SearchEngine.search() method returns a list of Result objects for a given query string – similar to a search field in a browser.
engine = SearchEngine(license=None, throttle=1.0, language=None)
engine.license # Service license key. engine.throttle # Time between requests (being nice to server). engine.language # Restriction for Result.language (e.g., 'en').
engine.search(query, type = SEARCH, # SEARCH | IMAGE | NEWS start = 1, # Starting page. count = 10, # Results per page. sort = RELEVANCY, # Results sort order: RELEVANCY | LATEST size = None # Image size: TINY | SMALL | MEDIUM | LARGE cached = True) # Cache locally?
Note: SearchEngine.search() also takes the same optional parameters as URL.download().
Google | Yahoo | Bing | Twitter | Facebook | Wikipedia | Flickr
SearchEngine is subclassed by Google, Yahoo, Bing, Twitter, Facebook, Wikipedia, Flickr, Newsfeed:
engine = Google(license=None, throttle=0.5, language=None)
engine = Yahoo(license=None, throttle=0.5, language=None)
engine = Bing(license=None, throttle=0.5, language=None)
engine = Twitter(license=None, throttle=0.5, language=None)
engine = Facebook(license=None, throttle=1.0, language='en')
engine = Wikipedia(license=None, throttle=5.0, language=None)
engine = Flickr(license=None, throttle=5.0, language=None)
engine = Newsfeed(license=None, throttle=1.0, language=None)
Each of these has different settings for search(). For example, Twitter.search() returns up to 1500 results for a search term (15 queries with 100 results each, or 150 queries with 10 results each). It has an hourly limit of 150 queries (each call to search() counts as one query).
Engine | type | start | count | sort | limit | throttle |
SEARCH1 | 1-100/count | 1-10 | RELEVANCY | paid | 0.5 | |
Yahoo | SEARCH|NEWS|IMAGE12 | 1-1000/count | 1-50 | RELEVANCY | paid | 0.5 |
Bing | SEARCH|NEWS|IMAGE13 | 1-1000/count | 1-50 | RELEVANCY | paid | 0.5 |
SEARCH | 1-1500/count | 1-100 | RELEVANCY | 150/hour | 0.5 | |
SEARCH|NEWS | 1 | 1-100 | RELEVANCY | 500/hour | 1.0 | |
Wikipedia | SEARCH | 1 | 1 | RELEVANCY | - | 5.0 |
Wikia | SEARCH | 1 | 1 | RELEVANCY | - | 5.0 |
Flickr |
IMAGE | 1+ | 1-500 | RELEVANCY|LATEST | - | 5.0 |
Newsfeed | NEWS | 1+ | - | LATEST | ? | 1.0 |
1 Google, Yahoo and Bing are paid services – see further how to obtain a license key.
2 Yahoo.search(type=IMAGES) has a count of 1-35.
3 Bing.search(type=NEWS) has a count of 1-15.
Results
SearchEngine.search() returns a list of Result objects. The list has an additional total attribute, an estimate of the total number of results available for the given query. Each Result holds useful information:
result = Result(url)
result.url # URL of content associated with the given query. result.title # Content title. result.text # Content summary. result.language # Content language. result.author # For news items and images. result.date # For news items.
result.download(timeout=10, cached=True, proxy=None)
- All attributes are Unicode strings.
- The Result.download() method uses the URL object internally.
It takes the same optional parameters as URL.download().
For example:
>>> engine = Bing(license=None) # Enter your license key. >>> for i in range(1,5): >>> for result in engine.search('holy handgrenade', type=SEARCH, start=i): >>> print repr(plaintext(result.text)) >>> print u"The Holy Hand Grenade of Antioch is a fictional weapon from ..." u'Once the number three, being the third number, be reached, then ...'
Since SearchEngine.search() takes the same optional parameters as URL.download(), it is easy to disable local caching, set a timeout or a throttle, or use a proxy server if you are behind a firewall:
>>> engine = Google(license=None) # Enter your license key. >>> for result in engine.search('tim', cached=False, proxy=('proxy.com', 'https')) >>> print result.url >>> print result.text
Image search
For Yahoo, Bing and Flickr, image links retrieved with search(type=IMAGE) can be filtered by setting the size parameter to TINY | SMALL | MEDIUM | LARGE (or None for any size). Note that Yahoo image search has a count of 1-35.
For Twitter, each result has a Result.profile property with the URL to the user's profile picture.
The use of downloaded images may be restricted by authorship. Flickr.search() can be used with an optional copyright=False. This will only retrieve results with no copyright restrictions (either under Creative Commons license by-sa or in the public domain).
Service license key
Some services require a license key. They may work without one, but this implies that you share a public license key (and query limit) with all other users of the web module. If you reach the query limit, SearchEngine.search() will raise an SearchEngineLimitError.
Google is a paid service (5$ per 1o00 queries). If you have a license key you get 100 free queries per day. To get a key, follow the link below, activate "Custom Search API" and "Translate API" under "Services" and look up your key under "API Access".
Bing is a paid service (20$ per 10,000 queries). If you have a license key you get 5,000 free queries per month. To get a key, follow the link below.
Yahoo is a paid service (0.8$ per 1000 queries) that requires an OAuth key + secret, which you can pass as a tuple to: Yahoo(license=(key, secret)).
Obtain a license key for: Google, Bing, Yahoo, Flickr, Facebook.
Service request throttle
A SearchEngine.search() request takes a minimum amount of time to complete, as outlined in the table above. This is intended as etiquette towards the servers providing the service. Be polite and raise the throttle value when you plan to run a lot of queries in batch.
Note that Wikipedia requests are especially intensive. If you plan to mine a lot of data from Wikipedia, download the Wikipedia database instead.
RSS + Atom newsfeeds
The Newsfeed object is a wrapper around Mark Pilgrim's Universal Feed Parser. Newsfeed.search() takes the web address of an RSS or Atom news feed and returns a list of Result objects. For example:
>>> NATURE = 'http://www.nature.com/nature/current_issue/rss/index.html' >>> for result in Newsfeed().search(NATURE)[:5]: >>> print repr(result.title) u'Biopiracy rules should not block biological control' u'Animal behaviour: Same-shaped shoals' u'Genetics: Fast disease factor' u'Biomimetics: Material monitors mugginess' u'Cell biology: Lung lipid hurts breathing'
Newsfeed.search() has an optional parameter tags, which is a list of custom tags to parse:
>>> for result in Newsfeed().search(NATURE, tags=["dc:identifier"]): >>> print result.dc_identifier
Google translate
The Google.translate() method returns the translation of a string in the given language.
The Google.identify() method returns a (language code, confidence)-tuple for a given string.
>>> s = "C'est un lapin, lapin de bois. Quoi? Un cadeau." >>> g = Google() >>> print g.translate(s, input='fr', output='en', cached=False) >>> print g.identify(s) u"It's a bunny, rabbit wood. What? A gift." (u'fr', 0.76)
Twitter streams
The Twitter.stream() method returns an endless, live stream of Result objects. A Stream is a Python list that accumulates each time Stream.update() is called:
>>> s = Twitter().stream('#fail') >>> for i in range(10): >>> time.sleep(1) >>> s.update(bytes=1024) >>> print s[-1].text if s else ""
To clear the accumulated list, call Stream.clear().
Twitter trends
The Twitter.trends() method returns a list of 10 "trending topics":
>>> print Twitter().trends(cached=False) [u'#neverunderstood', u'Not Top 10', ...]
Wikipedia articles
Wikipedia.search() does not return a list of Result objects. Instead, it returns a single WikipediaArticle for the given (case-sensitive) query – usually the title of an article. Wikipedia.list() returns an iterator over all article titles on Wikipedia. The Wikipedia constructor has an additional language parameter (by default, "en") that determines the language of the returned articles.
article = WikipediaArticle(title='', source='', links=[])
article.source # Article HTML source. article.string # Article plaintext unicode string.
article.title # Article title. article.sections # Article sections. article.links # List of titles of linked articles. article.external # List of external links. article.categories # List of categories. article.media # List of linked media (images, sounds, ...) article.languages # Dictionary of (language, article)-items. article.language # Article language (i.e. 'en'). article.disambiguation # True if it is a disambiguation page
article.plaintext(**kwargs) # See plaintext() for parameter overview. article.download(media, **kwargs)
WikipediaArticle.plaintext() is similar to the plaintext() function, but with extra attention for Wikipedia markup. It will remove metadata, infobox, table of contents, annotations, thumbnails and disambiguation links.
Wikipedia article sections
WikipediaArticle.sections is a list of WikipediaSection objects. A section has a title and a number of paragraphs that belong together.
section = WikipediaSection(article, title='', start=0, stop=0, level=1)
section.article # WikipediaArticle parent. section.parent # WikipediaSection this section is part of. section.children # WikipediaSections belonging to this section.
section.title # Section title. section.source # Section HTML source. section.string # Section plaintext unicode string. section.content # Section string minus title. section.level # Section nested depth. section.tables # List of WikipediaTable objects.
The following example downloads a Wikipedia article and prints the title of each section, indented according to the section level:
>>> article = Wikipedia().search('nodebox') >>> for section in article.sections: >>> print repr(' '*(section.level-1) + section.title) u'NodeBox' u' Features' u' Supported Primitives' u' Output' u'Libraries' u'NodeBox 2' u'NodeBox for OpenGL' u'Applications' u'See also'
Wikipedia article tables
WikipediaSection.tables is a list of WikipediaTable objects. A table can have a title, column headers, and rows.
table = WikipediaTable(section, title='', headers=[], rows=[], source='')
table.section # WikipediaSection parent. table.source # Table HTML source. table.title # Table title. table.headers # List of table column headers. table.rows # List of table rows, each a list of column values.
Wikia
Wikia is a free hosting service for thousands of wikis. It works in the same way as Wikipedia, both classes inherit from the MediaWiki base class. The Wikia constructor takes the name of a domain on Wikia. Notice the use of Wikia.list(), which returns an iterator over all article titles (Wikipedia has a similar method):
>>> wiki = Wikia(domain='montypython') >>> for i, title in enumerate(wiki.list(start='a', throttle=1.0, cached=True)): >>> if i >= 3: >>> break >>> article = wiki.search(title) >>> print repr(article.title) u'Albatross' u'Always Look on the Bright Side of Life' u'And Now for Something Completely Different'
Facebook posts, comments & likes
Facebook.search(query, type=SEARCH) returns a list of Result objects, where each result is a (publicly available) post matching a given query.
Facebook.search(id, type=NEWS, returns posts from a given user profile, if you use a personal license key. You can get a key one when you authorize Pattern to search Facebook in your name.
Facebook.search(id, type=COMMENTS) retrieves comments for a given Result.id of a post. You can also pass the id of a post or comment to Facebook.search(id, type=LIKES) to retrieve users that liked it.
>>> fb = Facebook(license='your key') >>> me = fb.profile(id=None) # (id, name, date, gender, locale)-tuple >>> >>> for post in fb.search(me[0], type=NEWS, count=25): >>> print repr(post.id) >>> print repr(post.text) >>> print repr(post.url) >>> if post.comments > 0: >>> print '%i comments' % post.comments >>> print [(r.text, r.author) for r in fb.search(post.id, type=COMMENTS)] >>> if post.likes > 0: >>> print '%i likes' % post.likes >>> print [r.author for r in fb.search(post.id, type=LIKES)] u'530415277_10151455896030278' u'Tom De Smedt likes CLiPS Research Center' u'http://www.facebook.com/CLiPS.UA' 1 likes [(u'485942414773810', u'CLiPS Research Center')] ....
Web sort
Interestingly, each list of results returned from SearchEngine.search() has a total property by which we can compare lists. The sort() function sorts the given terms according to search result count.
sort( terms = [], # List of search terms. context = '', # Term used for sorting. service = GOOGLE, # GOOGLE | BING | YAHOO | FLICKR license = None, # Service license key. strict = True, # Wrap query in quotes? reverse = False, cached = True)
It returns a list of (percentage, term)-tuples for the given list of terms. When a context is defined, sorts according to relevancy to the context: sort(["black", "white"], context="Darth Vader") yields black as the best candidate, because "black Darth Vader" is more common in search results.
Now let's see who is more dangerous:
>>> results = sort(terms=[ >>> 'arnold schwarzenegger', >>> 'chuck norris', >>> 'dolph lundgren', >>> 'steven seagal', >>> 'sylvester stallone', >>> 'mickey mouse'], context='dangerous') >>> >>> for weight, term in results: >>> print "%5.2f" % (weight*100) + '%', term 43.75% 'dangerous chuck norris' 25.00% 'dangerous arnold schwarzenegger' 12.50% 'dangerous steven seagal' 12.50% 'dangerous mickey mouse' 6.25% 'dangerous sylvester stallone' 0.00% 'dangerous dolph lundgren'
HTML to plaintext
Typically, URL.download() is used to retrieve HTML documents. HTML is a markup language that uses tags to define the text formatting. For example: <b>hello</b> displays hello in bold. Usually, we just want the text without the formatting so we can analyze (e.g. parse) it.
The plaintext() function removes HTML formatting from a string.
plaintext(html, keep=[], replace=blocks, linebreaks=2, indentation=False)
It will perform the following steps to clean up the given string:
- Strip javascript: remove all <script> elements.
- Strip CSS: remove all <style> elements.
- Strip comments: remove all <!-- --> elements.
- Strip forms: remove all <form> elements.
- Strip tags: remove all HTML tags.
- Decode entities: replace < with < (for example).
- Collapse spaces: consecutive spaces are replaced with a single space.
- Collapse linebreaks: consecutive linebreaks are replace with a single linebreak.
- Collapse tabs: consecutive tabs are replaced by a single space, optionally indentation (tabs at the start of a line) can be preserved.
plaintext tweaks
The keep parameter is a list of tags to keep. By default, element attributes are stripped, e.g. <table border="0"> becomes <table>. To preserve specific attributes, instead of a list a dictionary can be passed: {"a": ["href"]}.
The replace parameter defines how HTML elements are replaced with other characters to improve plain text layout. It is a dictionary of tag → (before, after) items. The default blocks dictionary replaces block elements (<h1>, <h2>, <p>, <div>, <table>, ...) with two linebreaks after, <th> and <tr> with one linebreak after, <td> with one tab after, <li> with an asterisk (*) before and a linebreak after.
The linebreaks parameter defines the maximum amount of consecutive linebreaks to keep.
The indentation parameter defines whether or not to keep tab indentation.
For example, the following script downloads a HTML document and keeps only a minimal amount of formatting (headings, bold, links).
>>> s = URL('http://www.clips.ua.ac.be').download() >>> s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']}) >>> print s
plaintext = strip + decode + collapse
The different steps in plaintext() are also available as separate functions:
decode_utf8(string) # Byte string to Unicode string.
encode_utf8(string) # Unicode string to byte string.
strip_tags(html, keep=[], replace=blocks) # Non-trivial, using SGML parser.
strip_between(a, b, string) # Remove anything between (and including) a and b.
strip_javascript(html) # Strips between '<script*>' and '</script'.
strip_inline_css(html) # Strips between '<style*>' and '</style>'.
strip_comments(html) # Strips between '<!--' and '-->'.
strip_forms(html) # Strips between '<form*>' and '</form>'.
decode_entities(string) # '<' => '<'
encode_entities(string) # '<' => '<'
collapse_spaces(string, indentation=False, replace=' ')
collapse_tabs(string, indentation=False, replace=' ')
collapse_linebreaks(string, threshold=1)
HTML DOM parser
The Document Object Model (DOM) is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. The web module includes a HTML DOM parser (Leonard Richardson's BeautifulSoup), that allows you to traverse a HTML document as a tree of linked Python objects. This is useful to extract specific portions from a HTML string retrieved with URL.download().
Node
The DOM consists of a DOM object that contains Text, Comment and Element objects.
All of these are subclasses of Node.
node = Node(html, type=NODE)
node.type # NODE | TEXT | COMMENT | ELEMENT | DOCUMENT node.source # HTML source. node.parent # Parent node. node.children # List of child nodes. node.next # Next child in node.parent (or None). node.previous # Previous child in node.parent (or None).
node.traverse(visit=lambda node: None)
Element
Text, Comment and Element are subclasses of Node. For example: 'the <b>cat</b>' is parsed to Text('the') + Element('cat', tag='b'). The Element object has a number of additional properties:
element = Element(html)
element.tag # Tag name. element.attributes # Dictionary of attributes, e.g. {'class':'menu'}. element.id # Value for id attribute (or None).
element.source # HTML source. element.content # HTML source minus open and close tag.
element.by_id(str) # First nested Element with given id. element.by_tag(str) # List of nested Elements with given tag name. element.by_class(str) # List of nested Elements with given class. element.by_attribute(**kwargs)
- Element.by_tag() values can include a class (e.g. "div.header") or an id (e.g. "div#content").
A wildcard can be used to match any tag. (e.g. "*.even").
The element is searched recursively (children in children, etc.) - Element.by_attribute() takes one or more keyword arguments (e.g. name="keywords").
DOM
The top-level element in the Document Object Model.
dom = DOM(html)
dom.declaration # <!doctype> TEXT Node. dom.head # <head> Element. dom.body # <body> Element.
For example, the following script retrieves the last three entries from reddit. The web module does not include a reddit search engine, but we can parse entries directly from the HTML source. This is called screen scraping, and many websites will strongly dislike it (server pressure).
>>> url = URL('http://www.reddit.com/top/') >>> dom = DOM(url.download(cached=True)) >>> for e in dom.by_tag('div.entry')[:3]: # Top 3 reddit entries. >>> for a in e.by_tag('a.title')[:1]: # First <a class="title">. >>> print repr(plaintext(a.content)) u'Invisible Kitty' u'Naturally, he said yes.' u"I'd just like to remind everyone that /r/minecraft exists and not everyone wants" "to have 10 Minecraft posts a day on their front page."
Absolute url's
Links parsed from the DOM may be relative (e.g., starting with "../" instead of "http://".
To get the absolute URL, you can use the abs() function in combination with URL.redirect:
>>> from pattern.web import abs >>> url = URL('http://nodebox.net') >>> dom = DOM(url.download()) >>> for link in dom.by_tag('a'): >>> print abs(link.attributes.get('href',''), base=url.redirect or url.string)
PDF Parser
Portable Document Format (PDF) is a popular open standard for document exchange. The text, fonts, images and layout are contained in a single document that displays the same across systems. However, extracting the source text from a PDF is a non-trivial matter.
The PDF object (based on PDFMiner) parses the source text from a PDF file. This is useful for mining the text, not so much for displaying it (i.e., formatting is lost and some passages may have become garbled).
>>> from pattern.web import URL, PDF >>> url = URL('http://www.clips.ua.ac.be/sites/default/files/ctrs-002_0.pdf') >>> pdf = PDF(url.download()) >>> print pdf.string CLiPS Technical Report series 002 September 7, 2010 Tom De Smedt, Vincent Van Asch, Walter Daelemans Computational Linguistics & Psycholinguistics Research Center ...
URL's linking to a PDF document can be identified with URL.mimetype in MIMETYPE_PDF.
Spider
A web crawler or web spider is used to browse the web in an automatter manner. The Spider class is initialized with a list of URLs. These are then visited by the spider. If they lead to a web page (i.e. HTML), the content is parsed for new links. These are added to a list of links scheduled for a visit.
The given domains is a list of allowed domain names. An empty list means the spider can visit the entire web. The given delay defines the amount of seconds to wait before revisiting the same (sub)domain. This is a politeness policy: continually hammering a server with a robot disrupts requests from the website's regular visitors (this is called a denial-of-service attack).
spider = Spider(links=[], domains=[], delay=20.0, parser=HTMLLinkParser().parse)
spider.domains # Domains allowed to visit (e.g., ['clips.ua.ac.be']). spider.delay # Delay between visits to the same (sub)domain. spider.history # Dictionary of (domain, time last visited). spider.visited # Dictionary of URLs visited. spider.parse # Function, returns list of Links from a HTML string. spider.sort # FIFO | LIFO (how new links are queued). spider.done # True when all links have been visited.
spider.push(link, priority=1.0, sort=LIFO) spider.pop(remove=True) spider.next # Yields the next scheduled link = Spider.pop(False)
spider.crawl(method=DEPTH) # DEPTH | BREADTH | None.
spider.priority(link, method=DEPTH) spider.follow(link) spider.visit(link, source=None) spider.fail(link)
spider.normalize(url)
Crawling process
- Spider.crawl() is meant to be called continuously in a loop. It selects a link to visit, and parses its content for new links. The default parser is a robust HTMLLinkParser based on Python's sgmllib.SGMLParser. The method parameter specifies whether the spider prefers to visit internal links (DEPTH) or external links to other domains (BREADTH). If the link is on a domain recently visited (elapsed time < Spider.delay) it is temporarily skipped. If this is undesirable, set an optional throttle parameter of Spider.crawl() to the same value as Spider.delay.
- Spider.priority() is called from Spider.crawl() to determine the priority of a new Link, as a number between 0.0-1.0. Links with higher priority are visited first. Spider.priority() can be overridden in a subclass, for example to demote URLs with a query string (which could just be another sort order). Each URL is passed through Spider.normalize()which can also be overridden (for example, to strip the query string entirely).
- Spider.follow() is called from Spider.crawl() to determine if it should visit the given Link. By default it yields True, but it can be overridden to disallow selected links.
- Spider.visit() is called from Spider.crawl() once a Link is visited. The source will be an HTML string with the content. By default this method does nothing, but it can be overridden.
- Spider.fail() is called from Spider.crawl() for links whose MIME-type could not be determined or which raised a URLError on download.
The spider uses Link objects internally, which hold additional information besides the URL string:
link = Link(url, text='', relation='')
link.url # Parsed from <a href=''> attribute. link.text # Parsed from <a title=''> attribute. link.relation # Parsed from <a rel=''> attribute. link.referrer # Parent web page URL.
For example, here is a subclass of Spider that simply prints each link it visits. Since it uses DEPTH for crawling, it will prefer website internal links.
>>> class Spiderling(Spider): >>> def visit(self, link, source=None): >>> print 'visited:', repr(link.url), 'from:', link.referrer >>> def fail(self, link): >>> print 'failed:', repr(link.url) >>> >>> s = Spiderling(links=['http://www.clips.ua.ac.be/'], delay=5, queue=True) >>> while not s.done: >>> s.crawl(method=DEPTH, cached=False, throttle=5) visited: u'http://www.clips.ua.ac.be/' visited: u'http://www.clips.ua.ac.be/#navigation' visited: u'http://www.clips.ua.ac.be/colloquia' visited: u'http://www.clips.ua.ac.be/computational-linguistics' visited: u'http://www.clips.ua.ac.be/contact'
Note: Spider.crawl() takes the same parameters as URL.download(), e.g. cached=False or throttle=10.
The Internet Message Access Protocol (IMAP) is a protocol for retrieving e-mail messages from a mail server or webmail service. The Mail object is a wrapper for the Python imaplib. It can be used to search + read e-mail messages from your webmail.
Currently, it supports one service: GMAIL. However, it may work with other services by passing the server address to the service parameter (e.g. service="imap.gmail.com"). Note that you need to enable IMAP from your Gmail account if you want to access it.
With secure=False (no SSL) the default port is 143.
mail = Mail(username, password, service=GMAIL, port=993, secure=True)
mail.folders # Dictionary of name => MailFolder. mail.[folder] # For example: Mail.inbox.read(i) mail.[folder].count # Number of messages in folder.
mail.[folder].search(query, field=FROM) # FROM | SUBJECT | DATE mail.[folder].read(index, attachments=False, cached=True)
E-mail messages are organized in folders. The Mail.folders is a name → MailFolder dictionary. Folders can also be accessed directly by name, as an attribute. Common names include: inbox, spam, trash.
MailFolder.search() returns a list of e-mail indices, latest-first.
MailFolder.read() retrieves the e-mail with given index as a Message:
message = Mail.[folder].read(i)
message.author # Unicode string, sender name + e-mail address. message.email_address # Unicode string, sender e-mail address. message.date # Unicode string, date received. message.subject # Unicode string, message subject. message.body # Unicode string, message body. message.attachments # List of (MIME-type, str)-tuples.
For example:
>>> from pattern.web import Mail, GMAIL, SUBJECT >>> gmail = Mail(username='me', password='secret', service=GMAIL) >>> print gmail.folders.keys() ['drafts', 'spam', 'personal', 'work', 'inbox', 'mail', 'starred', 'trash']
>>> i = gmail.spam.search('wish', field=SUBJECT)[0] # What riches await... >>> m = gmail.spam.read(i) >>> print ' From:', m.author >>> print 'Subject:', m.subject >>> print 'Message:' >>> print m.body From: Vegas VIP Clib <amllhbmjb@acciongeoda.org> Subject: Your wish has been granted Message: No one has claimed our jackpot! This is your chance to try! http://www.top-hot-casino.ru
Locale
The helper module pattern.web.locale contains functions for region and language codes, based on the ISO-639 language code (e.g., en), the ISO-3166 region code (e.g., US) and the IETF BCP 47 language-region specification (en-US):
encode_language(name) # 'English' => 'en'
decode_language(code) # 'en' => 'English'
encode_region(name) # 'United States' => 'US'
decode_region(code) # 'US' => 'United States'
languages(region) # 'US' => ['en']
regions(language) # 'en' => ['GB', 'NZ', 'TT', ...]
regionalize(language) # 'en' => ['en-US', 'en-GB', ...]
market(language) # 'en' => 'en-US'
The geocode() function recognizes a number of world capital cities and returns a tuple (latitude, longitude, ISO-639, region).
geocode(location) # 'Brussels' => (50.83, 4.33, u'nl', u'Belgium')
This is useful in combination with the geo parameter for Twitter.search() to obtain regional tweets:
>>> from pattern.web import Twitter >>> from pattern.web.locale import geocode >>> twitter = Twitter(language='en') >>> for tweet in twitter.search('restaurant', geo=geocode('Brussels')[:2]): >>> print tweet.text u'Did you know: every McDonalds restaurant has free internet in Belgium...'
Cache
By, default, URL.download() and SearchEngine.search() will cache results locally. This way, there is no need to connect to the internet once a query has been cached. Over time the cache can grow quite large, filled with whatever was downloaded – from Wikipedia pages to zip archives.
Emptying it is easy (and permanent):
>>> from pattern.web import cache >>> cache.clear()
See also
- BeautifulSoup (BSD): robust HTML parser for Python.
- Scrapy (BSD): screen scraping and web crawling with Python.