This is G o o g l e's cache of http://hybrid.academic.cmri.ac.th/ftp/presentation/os2001/Burke_S_1432/burke2001.html.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting. To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:UX-na4-996sC:hybrid.academic.cmri.ac.th/ftp/presentation/os2001/Burke_S_1432/burke2001.html+lwp+http+cookie+tutor&hl=en&ie=UTF8
Google is not affiliated with the authors of this page nor responsible for its content. |
| These search terms have been highlighted: | lwp | http | cookie | tutor |
|
|
Web Access with Perl's LWP Modules
Sean Burke published a new article on perl.com about LWP which is mostly part of his LWP book. That article is up-to-date and well written. Highly recommend!!!
Web Access with Perl's LWP Modules
- LWP: Lib Web Perl
- Needed Versions
- About the Docs
- LWP::Simple and GETting URLs
- Perspective on LWP::Simple
- LWP::Simple's get($url)
- LWP::Simple's head($url) Function
- head()-based Link Checker
- The Beginning of HTTP Hassles
- A Google Frequency Reporter
- Running a Google Search
- Coding it Up
- ...And How That Looks
- URL-encoding
- URI::Escape
- Using URI::Escape
- LWP::Simple in Conclusion
- HTTP Basics
- HTTP Session: GET, and 200
- HTTP Session: GET, and 404
- HTTP Session: POST, and 200
- HTTP Session: GET, and 301
- Common HTTP Status Codes
- LWP Classes
- OOP Basics
- Basics of Objects in Perl
- OOP Jargon
- Class Details
- An Object's Meaning & Function
- An Object's Meaning & Function (b)
- An Object's Meaning & Function (c)
- Each Object's Attributes
- Each Object's Attributes (b)
- Each Object's Attributes (c)
- OOP in Action
- OOP Details
- Where to Learn More About Perl OOP
- Perl OOP Not-Oddities
- LWP OOP Oddities
- LWP OOP Oddities (b)
- LWP Class Model
- LWP::UserAgent
- HTTP::Response
- Simple ->get
- Simple ->head
- Wait, $browser->cookie_jar({}) ??
- $browser->cookie_jar({}) !
- $browser->cookie_jar({}) !!
- $browser->cookie_jar({}) !!!
- Must-Read LWP Docs
- Behind the Scenes: HTTP::Request
- HTTP::Request
- HTTP::Request (b)
- $browser->request
- $browser->request (b)
- Behind the Scenes: $browser->request
- ->request internals
- What's $browser->simple_request?
- ->simple_request internals
- LWP Access examples
- Bookmark Link Checker
- Matching Links
- Actual Useful Working Code!
- ...And How That Looks
- Noticing Redirection
- ...And How That Looks
- Primitive Remote Link Checker
- And a Checker Procedure
- Checking Just Absolute HTTP URLs
- ...And How That Looks
- The URI Class!
- URI Vitals
- URI Stringification
- Parts of a URI
- Using URI
- Making Our Checker a Bit Smarter
- ...And How That Looks
- Interfacing to Babelfish via POST
- POSTing data
- ->post Syntax
- Capturing the Output
- Add Interface Code...
- Double-Translation
- HTML Processing
- HTML Concepts
- Rudimentary SGML Concepts
- XML Working Concepts
- SGML Basics
- Back to HTML
- On Specificity in Specifications
- A Table in XML
- A Table in HTML
- HTML Hassles
- Overview of HTML::* Modules
- Making Do With No Module
- Getting Away with Regexps!
- Giving up the Regexps
- HTML::Parser
- HTML::TokeParser
- The Token View
- When Tokens are Fine
- Sample HTML::TokeParser Code
- HTML::TokeParser Vitals
- HTML::Tree
- HTML::Tree Features
- HTML::Tree
- HTML::Element Vitals
- HTML::TreeBuilder Vitals
- Lifecycle of an HTML::TreeBuilder object
- HTML::Element Methods
- Relationship Methods
- Dumping Methods
- Detaching/Deletion Methods
- Constructor Methods
- Searching Methods
- More on ->look_down
- Yet More on ->look_down
- Alternative Approach: Positional Selection
- Alternative Approach: Selection by "class" attribute
- look_down Case Study: H1-Matching
- Headline-Matching 1
- Headline-Matching 2
- More look_down Trouble
- Headline-Matching 3
- Headline-Matching 4
- Future Developments
- Future of gopher
- Future of HTTP
- Future of URI/URN/URLs
- Future of evil evil JavaScript
- Future of JPEG, PNG, GIF, Flash
- Future of music formats:
MIDI, RealAudio, MP3
- Future of voice formats:
RealAudio, MP3, etc.
- (Future of video formats?)
- Future of PDF
- Future of HTML
- (Future of CSS)
- Future of XML
- __END__
#1
Web Access with Perl's LWP Modules
Sean M. Burke
sburke@cpan.org
The Perl Conference, 2001
#2
LWP: Lib Web Perl
A bunch of open-source Perl modules (available in CPAN) for getting
and parsing data from web sites.
- The module LWP::Simple
- The classes LWP::* and HTTP::*
- The class URI
- The classes HTML::*
Yes, they're mostly classes.
LWP is rather OOPy; but users
who are not at home with OOP can get along fine.
#3
Needed Versions
- LWP version 5.5394 or later
- Latest version of HTML::Tree
- Recent version of URI.
- Some version of HTML::Parser (newer is faster!)
#4
About the Docs
Every module that I discuss has documentation embedded as POD,
readable with perldoc (or perlman, etc). E.g.: perldoc
HTML::Element
Or you can look at the docs as web pages at
http://search.cpan.org
Aside from a helpful overview document called "lwpcook", most of
the documentation is meant as a reference, not as a tutorial.
#5
LWP::Simple and GETting URLs
LWP::Simple is a module that provides functions for GETting URLs.
Example comprehensive docs:
- get($url) -- returns what's at that URL, or undef on
failure.
- getprint($url) -- to STDOUT, prints what's at that URL.
On failure, prints an error message to STDERR. Returns the HTTP
status code (feeding that to is_success($status) gives you true for
success, false for failure).
- getstore($url, $filespec) -- saves what's at that URL to
a file. Returns the HTTP status code.
- mirror($url, $filespec) -- saves what's at that URL to a
file, but avoids a full re-transfer if the local file is up to
date. Returns the HTTP status code.
- head($url) -- makes a HEAD request context: In scalar
context, returns true for success, false on failure. in list
context, returns ($content_type, $content_length, $modified_time,
$expires, $server), or () on failure.
#6
Perspective on LWP::Simple
The concepts underlying those functions:
- GET an object (except for the head(...) function)
- what to do with the object's content (save it? return it? print
it to STDOUT?)
- how to signal success or failure (return undef? return false?
return an HTTP status code?)
The most "basic" of those functions is get().
#7
LWP::Simple's get($url)
Basic use:
my $content = get('http://www.guardian.co.uk/');
Example use:
use LWP::Simple;
use strict;
my $content = get('http://www.guardian.co.uk/');
# the main page of a UK newspaper
die "Hm, couldn't get Guardian!" unless defined $content;
foreach my $keyword (qw( GM Intel Ital Canad Mexic)) {
print "$keyword!\n" if $content =~ m/\b\Q$keyword/;
}
print "\n[End at ", scalar( localtime ), "]\n";
exit;
#8
LWP::Simple's head($url) Function
A HEAD request is like a GET request, but omits the actual message
body. It sends just the MIME
headers.
$whether_successful = head($url);
or:
($content_type, $content_length,
$modified_time, $expires, $server
) = head($url);
Or if you want only parts of the return list, take a list-slice:
($content_length, $mod_time) = ( head($url) )[1,2];
(Note that if
head fails and returns empty-list, that
sets $content_length and $mod_time to both undef.)
#9
head()-based Link Checker
Simple link checker:
use strict;
use LWP::Simple;
foreach my $url (@url_to_check) {
print "$url is no good\n"
unless scalar head($url);
}
#10
The Beginning of HTTP Hassles
Although it's rare these days, there's some servers that don't
understand HEAD requests on static objects (files).
There's many more CGIs that don't deal with HEAD requests. You
might get the unsuccessful status code 405 (Method Not Allowed).
Or who knows, maybe you'll get a 500 error (general
server/network error)!
Or the CGI might reply as if to a GET, and the server may or may
not trim the content.
#11
A Google Frequency Reporter
For each search term given, run a Google search on it, find the
bit that says:
"Results 1 - 10 of about 2,760. Search took 0.09 seconds."
and report just the number.
#12
Running a Google Search
Going to Google and running a search on "stuff" gives us a URL
like this:
http://www.google.com/search?q=%22stuff%22&btnG=Google+Search
Since we can paste that URL into the browser and have it work,
it must be a GET URL. As a function of $word, we can
model it with:
$content = get(
'http://www.google.com/search?q=%22'
. $word . '%22&btnG=Google+Search'
);
...which returns HTML source, which should contain either
the string "did not match any documents", or a string
like "of about <b>([0-9,]+)</b>".
#13
Coding it Up
use strict;
use LWP::Simple;
foreach my $q (@ARGV) { report_google_count($q) }
sub report_google_count {
my $word = $_[0];
my $url =
'http://www.google.com/search?q=%22'
. $word . '%22&btnG=Google+Search'
;
my $content = get($url);
if(!defined $content) {
print "$word: NOGO $url\n";
} elsif($content =~ m/did not match any documents/) {
print "$word: 0 matches\n";
} elsif($content =~ m/of about <b>([0-9,]+)<\/b>/) {
print "$word: $1 matches\n"; # like "1,952"
} else {
print "$word: Page not processable, at $url\n";
}
}
#14
...And How That Looks
% perl woogle.pl asafetida asafoetida
asafetida: 2,760 matches
asafoetida: 7,850 matches
#15
URL-encoding
But what if we wanted to do:
% perl woogle.pl "boy toy" boytoy
The first term, boy toy, would make a search URL of:
http://www.google.com/search?q=%22boy toy%22&btnG=Google+Search
But we mustn't ever have spaces in URLs! Instead:
http://www.google.com/search?q=%22boy%20toy%22&btnG=Google+Search
URL::Escape to the rescue...
#16
URI::Escape
URI::Escape is a simple module that provides two functions:
- $encoded = uri_escape($raw);
- Returns a URL-encoded copy of $raw's value.
- $raw = uri_unescape($encoded);
- Returns a URL-decoded copy of $encoded's value.
So uri_escape("boy toy") is "boy%20toy".
#17
Using URI::Escape
So we replace our line:
my $url =
'http://www.google.com/search?q=%22'
. $word . '%22&btnG=Google+Search'
;
with:
use URI::Escape;
my $url =
'http://www.google.com/search?q=%22'
. uri_escape($word) . '%22&btnG=Google+Search'
;
And then:
% perl woogle.pl "boy toy" boytoy
boy toy: 27,700 matches
boytoy: 6,090 matches
#18
LWP::Simple in Conclusion
LWP::Simple is excellent for short, simple programs.
LWP::Simple is great when all you're doing is GETting what's at
a URL.
What it doesn't do:
- It doesn't POST.
- It doesn't give fine control over the request (like its
headers, including cookies).
- It doesn't let you carefully examine the all headers on the
response.
To do those, you use the full
LWP::* /
HTTP::* modules, as
described later.
#19
HTTP Basics
HTTP is essentially a simple MIME protocol.
The client opens a connection to the server,
sends a request line, some MIME headers, and then
an optional message body.
The server responds with a status line,
some MIME headers, and then an optional (usually present)
message body.
#20
HTTP Session: GET, and 200
Client says to www.secret.gov:GET /foo/thing.html HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6
Referer: http://www.secret.gov/foo/main.html
[empty message-body]Server:HTTP/1.0 200 OK
Content-type: text/html
Content-length: 25
Server: NCSA 3.9 (+mod_ada)
<html>I like pie.</html>
#21
HTTP Session: GET, and 404
Client says to www.secret.gov:GET /foo/thing2.html HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6
[empty message-body]Server:HTTP/1.0 404 Not Found
Content-type: text/plain
Content-length: 36
Server: NCSA 3.9 (+mod_ada)
No such object as /foo/thing2.html.
#22
HTTP Session: POST, and 200
Client says to www.secret.gov:POST /foo/drawmap.ada HTTP/1.0
Host: www.secret.gov
Referer: http://www.secret.gov/mapform.shtml
Content-type: application/x-www-form-encoded
Content-length: 40
User-Agent: Mozilla/9.6
mlat=35.11721&mlon=-106.62463&msym=cross
Server:HTTP/1.0 200 OK
Content-type: image/gif
Content-length: 94252
Server: NCSA 3.9 (+mod_ada)
[94,252 bytes of GIF data]
#23
HTTP Session: GET, and 301
Client says to www.secret.gov:GET /foo/bar.xml HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6
[empty message-body]
Server:HTTP/1.0 301 Moved Permanently
Server: NCSA 3.9 (+mod_ada)
Location: http://bar.secret.gov/xmllib/f1.xml
[empty message-body]
#24
Common HTTP Status Codes
- 200 OK
- 301 Moved Permanently
- 302 Moved Temporarily
- 403 Forbidden
- 404 Not Found
- 500 Internal Server Error
#25
LWP Classes
LWP's modules are object-oriented -- which means you get to call
them "classes".
That doesn't mean that programs that use LWP have to be
object-oriented. (Mine typically aren't.)
#26
OOP Basics
- If you're not familiar with OOP in any programming language,
you'll learn now! Consider reading my POD article
HTML::Tree::AboutObjects for starters; and see Damian Conway's book
Object-Oriented Perl
- If you're familiar with using classes in other languages, see
the chapters on objects in Programming Perl, and then
the Conway OOP book.
- If you're familiar with using clasess in Perl, still keep your
wits about you...
#27
Basics of Objects in Perl
An "object" is a reference to a data structure that is special
because:
- It's been tagged with a package name (which
ref($object) will tell you).
- You can change and manipulate the object only thru agreed-upon
ways, thru routines called "methods".
- You call methods using a special syntax:
$object ->foo( parameters ...
)
That calls the "foo" routine that is provided by $object's
class.
- Another special syntax is how you make most objects:
Classname ->bar( parameters ...
)
#28
OOP Jargon
- A method that makes new objects is called a "constructor":
Classname->new( parameters ...
)
$object->clone( parameters ...
)
(A constructor can be called anything, but typically there's one
called "new".)
- Each piece of data inside a particular object is called an
"attribute".
- The methods that read and/or alter attributes are called
"accessors".
#29
Class Details
- What methods you can call on an object depends on its class,
since it's the class that provides them.
- The class's documentation is where you'll find the
description of the attributes that objects of that class have, as
well as the methods available.
- When confronting a new class, the question to ask is: "what
does an object of this class symbolize?"
- And then: "what can I make it do?" and "what's in it?"
#30
An Object's Meaning & Function
What does the object mean? Then: what does it do?
- an Imager object is...
- a 2D bitmap
...which you can load from a GIF/JPEG/PNG, draw on, resize, crop,
save, etc.
- an IO::Socket object is...
- a network socket
...which I can read from and/or write to -- and which I probably
had to specify a network address and portnumber for, when I created
it.
#31
An Object's Meaning & Function (b)
- a Net::FTP object is...
- an FTP connection from me to an FTP server; it's like a virtual
WSFTP/Fetch/Anarchie/ftp(1) window.
...with which I can transfer a file at time.
- a Business::US_Amort object is...
- a simulated loan ($170,000, 20 years, 8% fixed),
...which I can generate an amortization table for, or calculate the
total interest for.
#32
An Object's Meaning & Function (c)
- a LWP::UserAgent object is...
- a browser
...with which I can get things from the Web.
- a HTTP::Response object is...
- a wrapper for data that comes back from a Web server.
...whose MIME type I can look at, whose data I can extract, whose
HTTP status code I can check, etc.
#33
Each Object's Attributes
And, finally, what needs to be in each object?
- an Imager object's attributes are...
- every pixel's color; height and width of the bitmap; palette?
source filename? current "pen" color and size? etc.
- an IO::Socket object's attributes are...
- its timeout setting; whether it's connected; etc.
#34
Each Object's Attributes (b)
- a Net::FTP object's attributes are...
- hostname; and, indirectly, the current remote directory, and
ascii/binary mode
- a Business::US_Amort object's attributes are...
- intended term, principal, rate,
actual term, whether to output a table, whether calculations should
round to the nearest cent; etc.
#35
Each Object's Attributes (c)
- a LWP::UserAgent object's attributes are...
- its user-agent string ("libwww/5.82", "Mozilla/4.76"); its
cookies; its keyring for accessing password-protected URLs; how
long it'll wait for a server to respond; etc.
- an HTTP::Response object's attributes are...
- its HTTP status code and message (404, "Not Found"); its data
("<html><head>..."); and all its header lines, like
content_type (example value: "text/html"); etc.
#36
OOP in Action
use LWP::UserAgent; # load the module
use strict; # always a good idea
my $browser = LWP::UserAgent->new;
print "Given name: ", $browser->agent(), "\n";
# prints: libwww-perl/5.5394
$browser->agent("NCognito/12.4");
print "Code name: ", $browser->agent(), "\n";
# prints: NCognito/12.4
#37
OOP Details
Object-oriented programming is a whole approach
to program-design.
However, for purposes of dealing with LWP,
you can just pretend it's a style of interface.
#38
Where to Learn More About Perl OOP
- perlobj, perlboot, perltoot, etc.
- HTML::Tree::AboutObjects
- Programming Perl
- Object-Oriented Perl
- experience!
#39
Perl OOP Not-Oddities
Relative to other languages...
- Perl tends not to have ornate class hierarchies; i.e., there is
much less inheritance than you find in Java class-groups.
- Perl class-models have fewer Russian-doll objects than you find
in Java.
E.g., Document Object Model vs HTML::Element.
#41
LWP OOP Oddities (b)
- An attribute-value is typically a simple scalar.
But it can be another object!
To use a non-
LWP example:
$sax_track = MIDI::Track->new;
... then put sassy sax music into $sax_track ...
$harp_track = MIDI::Track->new;
... then put soothing harp music into $harp_track ...
$opus = MIDI::Opus->new;
$opus->tracks( $sax_track, $harp_track );
# The opus's "tracks" attribute is now a list
# of two track-objects!
#42
LWP Class Model
$resp = $browser->get( $url )- Basic classes:
- LWP::UserAgent
HTTP::Response
- Other important classes:
- HTTP::Request
URI
#43
LWP::UserAgent
$resp = $BROWSER->get( $url )
An LWP::UserAgent object is a browser, which you use for
retrieving documents across the Web.
Notable attributes: agent name, cookie jar, key ring.
Made by:
LWP::UserAgent->new
#44
HTTP::Response
use
LWP 5.5394;
# new features! $RESP = $browser->get( $url );
$RESP = $browser->head( $url );
$RESP = $browser->post( $request,
['k1'=>'v1',...] );
...each performs an HTTP request, and returns a new
HTTP::Response object.
An HTTP::Response object contains the document that was
returned.
Notable attributes: content (a big scalar);
last_modified (is seconds since epoch);
content_type (like "text/html");
code (the status, like 200, 401, 404, 500), and
is_success (based on the code).
If the server wasn't reachable at all, then the response is a
dummy object just to hold the error code, probably 500.
(Examples follow!)
#45
Simple ->get
Redoing our
LWP::Simple example program:
use strict;
use LWP 5.5394; # Loads necessary LWP classes
my $br = LWP::UserAgent->new;
my $resp = $br->get('http://www.guardian.co.uk/');
die "Hm, couldn't get Guardian: ", $resp->status
unless $resp->is_success;
die "It's not html, it's ", $resp->content_type
unless $resp->content_type eq 'text/html';
die "Odd! It's stale!"
if $^T - 24*60*60 > ($resp->last_modified || $^T);
my $content = $resp->content;
die "What? Content is short!"
unless length($content) > 15_000;
foreach my $keyword (qw( GM Intel Ital Canad Mexic)) {
print "$keyword!\n" if $content =~ m/\b\Q$keyword/;
}
print "\n[End at ", scalar( localtime ), "]\n";
#46
Simple ->head
What we did with this:
use LWP::Simple;
foreach my $url (@url_to_check) {
print "$url is no good\n"
unless scalar head($url);
}
We can do with this:
my @urls = ( ...some absolute URLs... );
use LWP 5.5394;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar( {} ); # for fun, enable cookies.
foreach my $url (@url_to_check) {
my $response = $browser->head($url);
print "$url is no good: ", $response->message, "\n"
unless $response->is_success;
}
#47
Wait, $browser->cookie_jar({}) ??
The
LWP::UserAgent docs say:
$ua->cookie_jar([$cookie_jar_obj])
Get/set the cookie jar object to use. [...] Normally
this will be a HTTP::Cookies object or some subclass.
The default is to have no cookie_jar, i.e. never automatically add
"Cookie" headers to the requests.
Shortcut: If a reference to a plain hash is passed in as the
$cookie_jar_object, then it is replaced with an instance of
HTTP::Cookies that is initialized based on the hash.
This form also automatically loads the HTTP::Cookies module. It
means that:
$ua->cookie_jar({ file => "$ENV{HOME}/.cookies.txt" });
is really just a shortcut for:
require HTTP::Cookies;
$ua->cookie_jar(HTTP::Cookies->new(file => "$ENV{HOME}/.cookies.txt"));
So?
#48
$browser->cookie_jar({}) !
So that's a shortcut for
require HTTP::Cookies;
$browser->cookie_jar(HTTP::Cookies->new());
...which makes it work like a normal
cookie-using Web browser, but
one whose cookies sit only in memory.
perldoc
HTTP::Cookies explains how to read the
cookie jar from
disk, save it back to disk, and even use your Netscape's
cookie
file.
So?
#49
$browser->cookie_jar({}) !!
Most times that you'd want to deal with a
cookie jar (an object of
class
HTTP::Cookies) is in creating it and making it the value of
some $browser's cookie_jar attribute.
So what's why the LWP authors made an idiom for that!
#50
$browser->cookie_jar({}) !!!
So skim the docs, on paper, with a highlighter in hand.
There's a lot of things in the docs that you don't need
to know. But you won't know which they are until you see them.
#51
Must-Read LWP Docs
There are
many modules in the
LWP distribution, and
most of them are of nearly no conceivable interest to the
average user.
Example: when you run $browser->get(...),
$browser->post(...), etc., lots of things happen involving a
whole LWP::Protocol::* hierarchy. However, LWP::Protocol::* is of
no interest to the typical programmer. Ditto the modules that
involve parsing HTTP headers, like HTTP::Date.
- Of interest, however, are:
- LWP::UserAgent; lwpcook; HTTP::Response and its superclasses,
HTTP::Message and HTTP::Headers.
- And, for later:
- HTTP::TreeBuilder, and the unavoidably quite long
HTML::Element.
#52
Behind the Scenes: HTTP::Request
$resp = $browser->get($url) is a shortcut for:
{
use HTTP::Request::Common;
# exports functions that make HTTP::Request objects
my $request = GET($url);
$resp = $browser->request( $request );
}
and then we actually perform the request.
#53
HTTP::Request
- HTTP::Request object is...
- a planned HTTP request
- ...which you can actually perform with
$browser->request($that_object);
Its attributes are: method (like 'GET'); uri (an absolute URL
string); content (used for the form data in POST requests); and
headers (like "Accept-Language").
#54
HTTP::Request (b)
How to make and then perform an
HTTP::Request object:
- The hard way:
- $req = HTTP::Request->new('GET', $url);
$req->header('Accept-Language', 'en-US, it');
$resp = $browser->request($req);
- The easy way:
- use HTTP::Request::Common; # exports GET,HEAD,POST,PUT
$resp = $browser->request(GET($url, 'Accept-Language', 'en-US,
it'));
- The implicit way:
- $resp = $browser->get($url, 'Accept-Language', 'en-US,
it');
#55
$browser->request
Performs the request, returning a response. Ways to call it:
- $browser->request( $req );
- Just does it! This is the only one that $browser->get(...)
(and post, head, and put) are shortcuts for.
- $browser->request( $req, $filename );
- Response content is saved to $filename, instead of getting
stored in the response object.
- $browser->request( $req, \&callback );
- $browser->request( $req, \&callback, $chunk_size );
- Content is sent (preferably in blocks of given $chunk_size) to
&callback.
#56
$browser->request (b)
So when do you need to use
HTTP::Request objects and
$browser->request?
- When you want to make a request with an HTTP method other than
get/head/post/put. (E.g., options.)
- When you want to use the ->request($req, $filename) or
->request($req, \&callback) syntaxes.
- ->request($req, $filename) is especially useful when the
result is a large object that you don't want to bother having in
memory.
->request was the only way to do it before
LWP 5.5394 (May
2001), so it's all over existing code.
#57
Behind the Scenes: $browser->request
$browser->request($req, ...) is a wrapper around:
$resp = $browser->simple_request($req, ...);
if $resp's code says "but you need authentication"
and we know a username+password (set by $browser->credentials),
then redo the request with those credentials
if $resp's code is a redirect
and we're not caught in a loop
and the req. method is in the list
$browser->requests_redirectable (normally HEAD,GET)
and redirection isn't to a 'file' URL, # security!
then:
$new_request = $req->clone
$new_req->uri( $new_url )
$new_req->previous( $resp )
return $resp->request($new_request, ...)
else return $resp
#58
->request internals
So?
- If you need to deal with HTTP authentication, now you know
where it happens. See $browser->credentials in
LWP::UserAgent.
- If you need to have you browser follow POST redirections, now
you know to change $browser->requests_redirectable.
- The URL of what you get back might not be the URL of what you
requested. (Consider $resp->request->url or
$resp->base.)
- Now you know that a request can actually cause a chain of
responses, of which you get the last. (But $prev = $it->previous
reads the one before.) For most applications, only the last is
interesting, but sometimes you want to look at each of them. (E.g.,
do any of these set cookies?)
#59
What's $browser->simple_request?
To perform a request to url "
scheme:..." :
If $browser has a ->protocols_allowed list and
scheme isn't in it,
or if $browser has a ->protocols_forbidden and scheme
is in it,
return an error response (code 500)
If we know how to handle this scheme (via a
LWP::Protocol::scheme class),
then use it for performing this request.
Otherwise, make an error object (code 500) with a message
explaining we don't know that scheme.
#60
->simple_request internals
- If you want $browser to handle requests to only some schemes
(like just 'http'), then you can set:
$browser->protocols_allowed(['http']);
- Or you can just exclude specific schemes:
$browser->protocols_forbidden(['https','mailto','ftp',
'data']);
#61
LWP Access examples
Now to more examples.
To process data from the Web:
- Perform the request for the data.
- Extract the data from what you got back.
- Then you can do things with it.
#62
Bookmark Link Checker
Let's check links in my bookmark file!
It starts out:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
Do Not Edit! -->
<TITLE>Bookmarks for Sean M. Burke</TITLE>
<H1>Bookmarks for Sean M. Burke</H1>
<DL><p>
<DT><H3 ADD_DATE="911669103">Personal Toolbar Folder</H3>
<DL><p>
<DT><A HREF="http://libros.unm.edu/" ...
<DT><A HREF="http://www.melvyl.ucop.edu/" ...
<DT><A HREF="http://www.guardian.co.uk/" ...
<DT><A HREF="http://www.booktv.org/schedule/" ...
<DT><A HREF="http://www.suck.com/" ...
#63
Matching Links
Suppose we want just the
HTTP links. We can just match URLs with:
m{<DT><A HREF="(http://[^"]+)"}s
Lines not matching that, we don't care about.
(Note that there's no relative links in a bookmark file.)
#64
Actual Useful Working Code!
use strict;
use LWP;
my $file =
'/program files/netscape/users/sburke/bookmark.htm';
die "$file doesn't exist" unless -e $file;
open(IN, "<$file") or die "Can't read-open $file: $!";
my $browser = LWP::UserAgent->new;
$browser->agent('Checkasaurus/0.01');
$browser->timeout(10); # be impatient
my %seen;
while(<IN>) {
next unless m{<DT><A HREF="(http://[^"]+)"}s;
my $url = $1;
next if $seen{$url}++; # seen it!
my $resp = $browser->head($url);
if($resp->is_success) {
print "## OK: $url\n";
} else {
print $url, "\n => ", $resp->status_line, "\n";
}
sleep 1;
}
print "##Done", scalar(localtime), "\n";
#65
...And How That Looks
## OK: http://libros.unm.edu/
## OK: http://www.nandotimes.com/noframes
...
http://www.amazon.com/exec/obidos/ASIN/B00001QGP4
=> 405 Method Not Allowed
## OK: http://www.yhchang.com/
## OK: http://low-vision.org/
http://www.helsinki.fi/~lukka/
=> 403 Forbidden
## OK: http://listserv.activestate.com/mailman/listinfo/perl-xml
## OK: http://www.lib.udel.edu/ud/spec/exhibits/forgery/psalm.htm
## OK: http://www.june29.com/HLP/ [altho I know that moved!]
## OK: http://www.manl.mb.ca/
http://inac.org/IrishPeople/gaelic/
=> 404 Not Found
## OK: http://members.tripod.com/~laoconnection/language1.htm
## OK: http://www.learnkhmer.com/
## OK: http://www.geocities.com/Athens/Academy/9594/tibet.html
...
But suppose we want to catch redirection.
#66
Noticing Redirection
Remember that ->request (as in ->head) can cause
several real request/response cycles. Check ->previous!
To report redirection:
...
if(! $resp->is_success) {
print $url, "\n => ", $resp->status_line, "\n"
} elsif($resp->previous and $resp->previous->is_redirect) {
print "## Moved $url\n## => ", $resp->request->url, "\n";
} else {
print "## OK: $url\n";
}
...
(Doesn't report unsuccessful redirection; doesn't deal
right with multiple redirection.)
#67
...And How That Looks
## OK: http://libros.unm.edu/
## Moved http://www.nandotimes.com/noframes
## => http://www.nandotimes.com/noframes/
...
## OK: http://www.lib.udel.edu/ud/spec/exhibits/forgery/psalm.htm
## Moved http://www.june29.com/HLP/
## => http://www.ilovelanguages.com/
## OK: http://www.manl.mb.ca/
...
#68
Primitive Remote Link Checker
We want to get a remote HTML page (by URL)
and check all the links in it.
There's modules that do proper intelligent
link extraction from HTML (like HTML::LinkExtor), but
we'll make do with this:
sub urls_in {
my $url = $_[0];
my $resp = $browser->get($url);
die "Can't get $url: ", $resp->status_line, " "
unless $resp->is_success;
die "Guh? $url is content-type ", $resp->content_type
unless $resp->content_type eq 'text/html';
$Base = $resp->base;
my @urls =
($resp->content =~ m/href="([^"]+)"/ig); # dumb
return @urls;
}
#69
And a Checker Procedure
And we can recycle our link-checker code from before,
as a routine:
sub check_url { # given an absolute URL
my $url = $_[0];
my $resp = $browser->head($url);
if(!$resp->is_success) {
print $url, "\n => ", $resp->status_line, "\n";
} elsif($resp->previous and $resp->previous->is_redirect) {
print "## Moved $url\n## => ", $resp->request->url, "\n";
} else {
print "## OK: $url\n";
}
}
#70
Checking Just Absolute HTTP URLs
use strict;
use LWP;
my $browser = LWP::UserAgent->new;
$browser->agent('Checkasaurus/0.02');
$browser->timeout(10); # be impatient
my $Base;
...and the two subs, here...
my $hp = 'http://www.speech.cs.cmu.edu/~sburke/';
my @urls = urls_in($hp);
die "No urls in $hp?" unless @urls;
my %seen;
foreach my $url (@urls) {
next if $seen{$url}++;
unless($url =~ m{^http://}s) {
print "Skipping <$url>\n";
next;
}
check_url($url);
}
#71
...And How That Looks
Skipping <#work>
Skipping <#reference>
Skipping <mailto:sburke@cpan.org>
Skipping <warning.html>
## OK: http://machaut.uchicago.edu/cgi-bin/WEBSTER.sh?WORD=burke
Skipping <not_dead.html>
http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?cruft
=> 500 Can't connect to foldoc.doc.ic.ac.uk:80 (Timeout)
## Moved http://killallhumans.com/
## => http://killallhumans.com/kah/
...
How to check the relative URLs
"warning.html" and "not_dead.html"?
#72
The URI Class!
(URI = Uniform Resource Identifiers, of which
Uniform Resource Locators and
Uniform Resource Names are (the?) two kinds.)
The URI class provides two useful constructors:
- $url = URI->new_abs($rel_url, $abs_base_url);
$url = URI->new_abs($abs_url, $abs_base_url);
- Create a new URI object relative to a given base.
- $url = URI->new($abs_url);
- Make an URI object from this URL.
#73
URI Vitals
A URI object is:
- a URI! (a URL or URN)
- ...which you can absolutize (during construction).
- ...whose parts you can extract or alter.
- ...which magically stringifies!
#74
URI Stringification
use strict;
use URI;
use LWP::UserAgent;
my $br = LWP::UserAgent->new;
my $url = URI->new('http://www.suck.com/');
print ref($br), "=> $br\n";
# LWP::UserAgent=> LWP::UserAgent=HASH(0x1765188)
print ref($url), "=> $url\n";
# URI::http=> http://www.suck.com/
# But alter it as a string, and it won't be an object:
$url .= '#today';
print ref($url), "=> $url\n";
# => http://www.suck.com/#today
#75
Parts of a URI
http://www.secret.gov/aliens/search3.dll?foo%20bar#baz
[--] [------------][-----------------] [-------] [-]
| host path query |
scheme fragment
#76
Using URI
use strict;
use URI;
my $url = URI->new_abs(
'../clones/search2.ada',
'http://secret.gov/reno/aliens/search.ada?replicants'
);
print "Hm, the URL's scheme is ", $url->scheme, ".\n";
#-> ...http.
print $url, "\n";
#-> http://secret.gov/reno/clones/search2.ada
$url->query('army "clone babies"');
print $url, "\n";
#-> http://secret.gov/reno/clones/search2.ada?army%20%22clone%20babies%22
#77
Making Our Checker a Bit Smarter
# replace the main loop with this:
use URI;
foreach my $url (@urls) {
next if $seen{$url}++;
if($url =~ m{^#}s) {
print "Skipping fragment <$url>\n"; next;
}
$url = URI->new_abs($url,$Base);
if($url->scheme ne 'http') {
print "Skipping non-http <$url>\n"; next;
}
check_url($url);
}
#78
...And How That Looks
Skipping fragment <#work>
Skipping fragment <#reference>
Skipping non-http <mailto:sburke@cpan.org>
## OK: http://www.speech.cs.cmu.edu/~sburke/warning.html
## OK: http://machaut.uchicago.edu/cgi-bin/WEBSTER.sh?WORD=burke
## OK: http://www.speech.cs.cmu.edu/~sburke/not_dead.html
http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?cruft
=> 500 Can't connect to foldoc.doc.ic.ac.uk:80 (Timeout)
## Moved http://killallhumans.com/
## => http://killallhumans.com/kah/
...
#79
Interfacing to Babelfish via POST
Babelfish (generally better accessed thru
the WWW::Babelfish module) is a service run by
Altavista that lets you feed bits of text thru
machine-translation programs, via your browser.
But the HTML form you use is a POST form, not
a GET form.
#80
POSTing data
Form data send by POST is just like data sent by GET --
it's key+value pairs. Looking at the source for the
Babelfish form shows that asking for "I like pie"
to be translated from English to French, produces
these three key+value pairs:
urltext = I like pie
lp = en_fr
enc = utf8
Or, encoded:
urltext=I%20like%20pie&lp=en_fr&enc=utf8
#81
->post Syntax
$resp = $browser->post($url, [k1 => v1, ... ] );
$resp = $browser->post($url, \@some_array );
$resp = $browser->post($url, {k1 => v1, ... } );
$resp = $browser->post($url, \%some_hash );
In this case,
$resp = $browser->post($url, [
'urltext' => 'I like pie',
'lp' => 'en_fr',
'enc' => 'utf8',
] );
#82
Capturing the Output
The return page from a Babelfish request has the translation
as the content of the first <textarea>...</textarea>.
Working that into a tidy function:
sub translate {
my($text, $language_path) = @_;
my $resp = $browser->post(
'http://babelfish.altavista.com/translate.dyn',
[ 'urltext' => $text, 'lp' => $language_path,
'enc' => 'utf8'
]);
die "Error in translation $language_path: ",
$resp->status_line(), "\n" unless $resp->is_success();
if($resp->content() =~ m{<textarea.*?>(.*?)</textarea>}is) {
my $translation = $1;
# Trim whitespace, and return.
$translation =~ s/\s+/ /g;
$translation =~ s/^ //s;
$translation =~ s/ $//s;
return $translation;
} else {
die "Can't find translation in $language_path response";
}
}
#83
Add Interface Code...
use strict;
use LWP;
my $browser = LWP::UserAgent->new();
$browser->env_proxy; # good if behind a firewall
...and then the translate function here
my $lang;
if(@ARGV and $ARGV[0] =~ m/^-(\w\w)$/s) {
$lang = lc $1; # if lang specified, like "-fr"
shift @ARGV;
} else {
my @languages = qw(it fr de es ja pt);
$lang = $languages[rand @languages];
}
die "What to translate?\n" unless @ARGV;
my $in = join(' ', @ARGV);
print " => $lang => ", translate(
translate($in, 'en_' . $lang),
$lang . '_en' ), "\n";
#84
Double-Translation
% alienate -de "Pearls before swine!"
=> via de => Beads before pigs!
% alienate "Bond, James Bond"
=> via fr => Link, Link Of James
% alienate "Shaken, not stirred"
=> via pt => Agitated, not agitated
% alienate -it "Shaken, not stirred"
=> via it => Mental patient, not stirred
% alienate -it "Guess what! I'm a computer!"
=> via it => Conjecture that what! They are a calculating!
% alienate 'It was more fun than a barrel of monkeys'
=> via de => It was more fun than a barrel drop hammer
% alienate -ja 'It was more fun than a barrel of monkeys'
=> via ja => That the barrel of monkey at times was many pleasures
#85
HTML Processing
The
HTTP access methods I've discussed will get you objects of
any media type.
Most content on the Web these days is in HTML, and so most
data extraction tasks are about pulling data out of HTML.
#86
HTML Concepts
...Wherein your humble
tutor presents
HTML in terms of data structures,
not in terms of how to use <blink> for fun and profit.
#87
Rudimentary SGML Concepts
An SGML document represents a tree structure.
Elements (data nodes) contain other elements
and/or text nodes.
Elements can have attributes (key+value pairs)
(This ignores comments, as well as
arcana like PI's, declarations,
marked sections, etc...)
XML is a straightforward subtype of
SGML.
HTML is a messy kind of SGML instance.
#88
XML Working Concepts
Instead of considering the document as representing
a tree, imagine you're dumping a structure as a document:
foo
/ \
bar baz <-- hoo=hah
|
"quux"
This dumps to XML as:
<foo>
<bar></bar>
<baz hoo="hah">quux</baz>
</foo>
#89
SGML Basics
An element is symbolized by a start-tag
(possibly containing attributes),
content, and then an end-tag:
- <foo>
- A start-tag with the tag-name "foo".
- </foo>
- An end-tag with the tag-name "foo".
- <baz hoo="hah">
- A start-tag with the tag-name "foo",
also expressing the attribute hoo="hah" for this element
#90
Back to HTML
Tim Berners-Lee,
Weaving The Web, p. 41:
"There was a family of markup languages, the standard generalized
markup language (SGML), already preferred by some of the world's
documentation community [...]
I developed HTML to look like a member of that community."
[emphasis mine]
#91
On Specificity in Specifications
The character Chrissy in Jane Wagner's
The Search for Signs of Intelligent Life in the Universe, p.35:
All my life I've always wanted to be somebody.
But I see now I should have been more
specific.
#92
A Table in XML
Suppose a table consists of rows, consisting of cells, consisting of
data.
table
/ \
tr tr
/ | \ | \
td td td td td
| | | | \
"Cost" "$" "Desc" "Car" "$10,000"
As XML:
<table>
<tr>
<td>Cost</td> <td>$</td> <td>Desc</td>
</tr>
<tr>
<td>Car</td> <td>$10,000</td>
</tr>
</table>
#93
A Table in HTML
All the
BOLD tags here are omissible:
<table>
<TR>
<TD>Cost</TD> <td>$</TD> <td>Desc</TD>
</TR>
<tr>
<TD>Car</TD> <td>$10,000</TD>
</TR>
</table>
Yes, all you need is:
<table>
Cost <td>$ <td>Desc
<tr> Car <td>$10,000
</table>
So?
#94
HTML Hassles
So, it's hard to write a regexp that matches the content of the second cell
of the table, if it could be any of:
<table>
Cost </td>$ <td>Desc
...
<table>
<td>$Cost <td>$ <td>Desc
...
<table>
<tr>$Cost </td> <td>$ <td>Desc
(...altho whichever of these your favorite browser
happens to deign to render, is another matter altogether.)
#95
Overview of HTML::* Modules
The HTML::* modules are indispensible for data-extraction
tasks.
But first see if you can do without.
#96
Making Do With No Module
Most HTML comes from a template. Take advantage of this!
<html>
<!-- Press Release Template 3 -->
...button bars...
...tables nested eight deep...
...banner ads...
<!-- START -->
...All the things you want!...
<!-- END -->
...more ads, table code, etc...
</html>
#97
Getting Away with Regexps!
If you can just use
m/<!-- START -->
(.+?)<!-- END -->/,
then do so!
(More realistic: one regexp covers 70% of cases, another RE covers
another 10% of cases, another slightly different one covers 12%,
and you identify the remaining 8% to be dealt with
manually.)
"If the funk ain't broke, then don't try to fix it!"
-- Bootsy Collins
#98
Giving up the Regexps
When to just give up and use the HTML::* modules:
- You've got more than ten regexps in your extractor program.
- Your extractor program is over two hundred lines.
- You start having to work around unsystematically bad HTML.
<a href=...\..\main.html">Stuff</a >
<!---- ---- Hmm -------- >
#99
HTML::Parser
It's a tokenizer: parses HTML source as tokens, not elements.
Very tolerant of bizarre code.
<p align=center>- start-tag: tag name and attributes
</p>- end-tag: tag name
stuff- text: with references like é decoded.
<!-- comment -->- comment: (usually just ignored for most processing)
In spite of its name, HTML::Parser is not too friendly a module for everyday use.
Used widely by other modules, tho.
#100
HTML::TokeParser
A very friendly interface to the token view of HTML;
uses HTML::Parser to do the actual "work".
For many applications, you don't need a real parse
of a document; tokens are fine.
TokeParser lets you step thru the tokens in an
HTML stream.
It's better than just a RE, when given this:
<!-- <a href="foo">woozlewuzzle</a> -->
#101
The Token View
Example case: Scan a document for
<a href="
url">
stuff </a> and
<form action="
url" action="
action">.
Unlike with tables, this task is unlikely to
require knowledge of HTML tag-implication!
#102
When Tokens are Fine
Consider this text:
<p><a name="start1">If</a> you mix one part
<a href="http://www.downy.com/">Downy</a> to
five parts water, and put it in a spray bottle,
you can spray it on the carpet to allieviate
static problems.
<form method="get" action="more_downy_fun.pl">
<input size="30" type="text" name="query">
<input type="submit" value="MORE TIPS?">
</form>
#103
Sample HTML::TokeParser Code
use strict;
use HTML::TokeParser;
my $stream = HTML::TokeParser->new('thang.html');
# or ->new(\$content)
# get_tag gives:
# ['foo', \%attributes, \@attributes, $orig_text]
# or ['/foo', $orig_text]
while(my $tag = $stream->get_tag) {
if($tag->[0] eq 'a') {
my $url = $tag->[1]{'href'} || next;
my $text = $stream->get_trimmed_text("/a");
print "A=>\t$url\t$text\n";
} elsif($tag->[0] eq 'form') {
my $url = $tag->[1]{'action'} || "(nil?!)";
my $action = $tag->[1]{'method'} || "get";
print "FORM=>\t$url\t$action\n";
} else {
print "# Ignoring ", $tag->[0], "\n";
}
}
#104
HTML::TokeParser Vitals
An HTML::TokeParser object is...
- A big long tickertape of HTML tokens
- ...which you can scan thru with methods
like get_token, get_tag, get_text, etc.
(Hidden attributes: what file/string you're parsing;
and your current position in the tape.)
#105
HTML::Tree
A pair of modules that build a real in-memory parse tree.
I.e., they go from anything like this:
<table>
Cost <td>$ <td>Desc
<tr> Car <td>$10,000
</table>
...to this:
table
/ \
tr tr
/ | \ | \
td td td td td
| | | | \
"Cost" "$" "Desc" "Car" "$10,000"
#106
HTML::Tree Features
- Uses HTML::Parser to do the tokenizing.
- Tolerant of badly expressed HTML.
- Scanning an HTML document means scanning
its doc tree (regardless of representation).
- Uses more memory and more time than just a token view.
- A doc tree isn't a string, but a 2D
data structure.
- But everything you can do with tokens,
you can do with an HTML tree.
#107
HTML::Tree
"If you had to use just one module for all your HTML parsing..."
Just be sure you're using the latest version!
#108
HTML::Element Vitals
Consists of two main classes, HTML::Element and HTML::TreeBuilder.
- As HTML::Element object is...
- an element in an HTML document tree.
- ...thru which you can search for other components in the tree;
or which you can move around, dump to STDOUT, re-emit
as HTML, etc.
- object attributes: what element is its parent;
what elements or text nodes are its children;
its tag name, and its element attributes (like align="center")
#109
HTML::TreeBuilder Vitals
- As HTML::TreeBuilder object is...
- a special kind of HTML::Element object.
- the top element in an (initially blank) HTML document tree.
- ...which can do everything an HTML:Element object does, plus:
can be used in $root->parse_file($filename) or
$root->parse($content),$root->eof
- object attributes: same as HTML::Element attributes,
plus: options for planned parsing, like whether to store
comments.
#110
Lifecycle of an HTML::TreeBuilder object
- You create it:
my $root = HTML::TreeBuilder->new;
- You set any parse options:
$root->store_comments(1);
- You parse a document into it:
$root->parse_file($filename);
or
$root->parse($content), $root->eof;
- You extract information from it.
- You explicitly delete the tree from memory:
$root->delete();
#111
HTML::Element Methods
There's dozens and dozens of methods in HTML::Element.
But here are the basics:
- $ele->parent
- What element (if any) is the parent of this element
- $ele->content_list
- What elements or text strings are children of this element.
- $ele->tag
- This node's tagname string. (E.g., "blockquote")
- $ele->attr("foo")
- Read the value for this node's "foo" attribute,
as from "<tagname foo=bar>".
(This returns undef if no such attribute.)
- $ele->attr("foo", "bar")
- Set the value for this node's "foo" attribute to "bar".
(Set to undef to delete the attribute.)
#112
Relationship Methods
- $ele->lineage
- The list of all $ele's ancestors. I.e., $ele->parent,
$ele->parent->parent, $ele->parent->parent->parent, etc.
- $ele->pindex
- Says where $ele appears in $ele->parent->content_list --
i.e., if $ele is at index 2 in that list, returns 2.
- $ele->right
- The node(s) to the right of $ele in the tree.
(Depends on scalar/list calling context.)
- $ele->left
#113
Dumping Methods
- $string = $ele->as_text;
- Returns a join("",...) of all the text descendants
of this node
- $string = $ele->as_HTML;
- Returns an HTML source representation of
$ele and its decendants.
- $ele->dump;
- Dumps $ele and its descendants as indented tree diagram,
to STDOUT.
#114
Detaching/Deletion Methods
- $ele->detach
- Remove $ele from its parent's content list.
- $ele->delete
- $ele->detach, plus deletes delete
it (and any descendants) from memory.
- @ex_content = $ele->detach_content
- Detach all of $ele's content nodes, and return them.
- $ele->replace_with( ...node or nodes... )
- Detach $ele and replace it with the nodes given.
#115
Constructor Methods
- $ele = HTML::Element->new('tagname', attr=>val,...);
- Construct a new element with given attributes.
- $treelet2 = $treelet->clone;
- Deep-copy this element and its children.
- $treelet = HTML::Element->new_from_lol(['p', {'align' => 'center'},
"I like ", ['em', "pie"], '!'])
- Create a new node/treelet based on this list (of lists)*.
- $ele->push_content(...eles or lols or text bits...)
- $ele->unshift_content(...eles or lols or text bits...)
- $ele->splice($offset, $length, ...eles or lols or text bits...)
- Alters $ele's content list.
#116
Searching Methods
- $ele->find_by_tag_name('a', 'area', ...)
- Elements at/under $ele that have
any of the given tag names.
- $ele->look_down( 'foo' => 'bar' )
- Elements whose "foo" attribute
value is "bar".
- $ele->look_down( 'class' => 'whizzbang',
'_tag' => 'br' )
- Elements whose "class" attribute
value is "whizzbang", and whose tagname is "br".
- $ele->look_down( '_tag' => 'td',
'rowspan' => "2",
sub { $_[0]->content_list < 3 } )
- Elements whose tagname is "td", and
whose "rowspan" attribute has value "2", and
who have fewer than three child nodes.
#117
More on ->look_down
->look_down in list context returns all matching items;
in scalar context, returns the first matching element,
or undef if none.
You can nest look_downs!
@big_tables = $tree->look_down(
'_tag' => 'table',
'border' => 3,
sub {
my @tds = $_[0]->look_down('_tag' => 'img',
sub { ($_[0]->attr('width') || 0) > 150 }
);
@tds > 20;
}
);
Return all tables containing more than twenty
img elements of width greater than 150.
#118
Yet More on ->look_down
To extract all the headlines from a Yahoo News page
(but not other links there),
it came down to wanting only links from the first
paragraph that had more than two BR's as children;
my $p = $tree->look_down(
'_tag', 'p',
sub {
2 < grep { ref($_) and $_->tag eq 'br' }
$_[0]->content_list
}
);
die "no headlines-p in $url?" unless $p;
@links = $p->look_down('_tag', 'a');
die "no headlines in $url?" unless @links;
#119
Alternative Approach: Positional Selection
Sometimes more direct:
my $table = ( $tree->look_down('_tag','table') )[1];
my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
my $col3 = ( $row2->look-down('_tag', 'td') )[2];
...then do things with $col3...
I.e., get what's in the third column of the second row
of the second table element in a page.
#120
Alternative Approach: Selection by "class" attribute
Glean semantic information from CSS-tagging:
my @linkse = $tree->look_down(
'class' => 'headlinelink'
);
#121
look_down Case Study: H1-Matching
Suppose you have a lot of press release files, and your
task is extracting the headline of each.
A typical case:
<h1><center>Visit Our Corporate Partner
<br><a href="/dyna/clickthru"
><img src="/dyna/vend_ad"></a>
</center></h1>
<h1><center>ConGlomCo Announces
New Regional HQ in Ouagadougou
</center></h1>
How to tell the ad from the real headline?
#122
Headline-Matching 1
<h1><center>Visit Our Corporate Partner
<br><a href="/dyna/clickthru"
><img src="/dyna/vend_ad"></a>
</center></h1>
<h1><center>ConGlomCo Announces
New Regional HQ in Ouagadougou
</center></h1>
Code:
# screen for that "Visit our..." caption:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub { $_[0]->as_text !~ m/\bvisit/i }
);
...find the first h1 whose descendant-text doesn't
match m/\bvisit/\i.
#123
Headline-Matching 2
<h1><center>Visit Our Corporate Partner
<br><a href="/dyna/clickthru"
><img src="/dyna/vend_ad"></a>
</center></h1>
<h1><center>ConGlomCo Announces
New Regional HQ in Ouagadougou
</center></h1>
Code:
# screen for images:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub { not $_[0]->look_down('_tag', 'img') }
);
# or: not $_[0]->find_by_tag_name('img')
...find the first h1 that has no image element inside it.
#124
More look_down Trouble
A troublesome case:
<h1><center>Visit Our Corporate Partner
<br><a href="/dyna/clickthru"
><img src="/dyna/vend_ad"></a>
</center></h1>
<h1><center>ConGlomCo President Schreck to
Visit Regional HQ
<br><a href="/photos/Schreck_visit_large.jpg"
><img src="/photos/Schreck_visit.jpg"></a>
</center></h1>
How to tell the ad from the real headline? Both have
image-links in them; both even say "Visit!"
#125
Headline-Matching 3
Find the first heading that contains no
images (unless they're images
with a src path that includes "/photos/"):
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
my $img = $_[0]->look_down('_tag','img');
return 1 unless $img;
# no image means it's fine
return 1 if $img->attr('src') =~ m{/photos/};
# good if a photo
return 0; # otherwise bad
}
);
#126
Headline-Matching 4
Find the first heading with no links to hrefs
that include "/dyna":
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
0 == grep { ($_->attr('href') || '') =~ m{/dyna} }
$_[0]->look_down('_tag','a');
}
);
The "right" answer is whatever best fits your data.
#127
Future Developments
Currently, the Web is, in no particular order:
- gopher, anon-ftp, HTTP
- URI/URN/URLs
- HTML
- Evil evil JavaScript
- XML
- Graphics formats: Flash, PNG, GIF, JPEG
- PDF
- Music/audio formats: MIDI, RealAudio, MP3
#128
Future of gopher
(This space intentionally left blank.)
#129
Future of HTTP
The protocol itself is unlikely to change significantly.
Development of LWP HTTP modules will probably involve incidental
bugfixes, and tidying up support for HTTP/1.1 features.
(Example: adding/changing interface for asking things about HTTPS
certificates.)
open(WEB, "<http://www.suck.com") in Perl 6?
#130
Future of URI/URN/URLs
Consensus on URNs should come "Real Soon Now".
(Mid-2030s?)
Then adding new schemes to URI.pm is relatively trivial.
#131
Future of evil evil JavaScript
Occasionally people on the libwww-perl list muse that it'd be
nice to build a Perl interface to something like the Mozilla
project's standalone JavaScript engine.
This sounds hard, and messy, and whoever does this will earn a
reputation as an amazing lunatic.
Would it really be useful?
#132
Future of JPEG, PNG, GIF, Flash
Graphics will always be with us, and it'd be neat to
extract meaningful information from them.
Most algorithms that extract semantic information from graphics
are task-specific: OCRing out the text, recognizing faces,
reading expressions, identifying objects.
So don't expect a generally useful module anytime soon.
(How semantic will Scalable Vector Graphics be, in practice?)
#133
Future of music formats:
MIDI, RealAudio, MP3
Music is presumably not something that anyone but musicians or
music librarians would want to pull "semantic" content out of.
#134
Future of voice formats:
RealAudio, MP3, etc.
There will be ever-more audio on the Web.
Speech-to-text programs can, ideally, turn an audio stream into
a text stream, and presumably distinguish different speakers.
Once we have that, it seems simple to build a Perl interface to
whatever text format comes out of a speech-to-text program.
Imagine a program that gets the NPR Newscast and makes a
transcript: you can probably read faster than you can listen.
#135
(Future of video formats?)
Usefully treatable as an unrelated audio stream and an unrelated
moving-picture stream?
There's already TV metainformation databases: tvguide.com,
clicktv.com, etc. !
Extracting text from the closed-captioning / videotext sideband?
Speech-to-texting the descriptive audio sideband?
What other interesting sidebands are in HDTV?
#136
Future of PDF
Generally less semantic than HTML, and more semantic than
GIF/PNG/etc.
A nightmare scenario: PDF becomes more tightly integrated into
browsers, and we get called on to deal with "all-PDF" sites.
#137
Future of HTML
- The W3C would be very pleased if we could wave a magic wand
(Raggett's
tidy?)
and turn all existing HTML into XHTML.
- XHTML is mostly just a different expression of an HTML parse
tree -- so extraction tasks are the same.
- Don't hold your breath for any new additions to HTML
per se.
- The basic HTML modules (HTML::Parser and HTML::TreeBuilder) are
about as smart as they're going to get. Improvements will probably
be mostly obscure bugfixes, or shortcuts (parse_web?), and
task-specific methods (e.g., number_lists).
#138
(Future of CSS)
- CSS is by definition non-semantic, so there's no "meaningful"
information in it -- a stylesheet is just directions to rendering
engines for varying media.
- Notable exception: If part of an HTML page has a CSS style that
says "if you hardcopy this document, leave this bit out". Is that
a semantic hint?
#139
Future of XML
- Generally, XML (other than XHTML) is vastly more semantic than
HTML, so data extraction tasks will be simpler. Simpler!
- But you still need to extract things.
- There's query formalisms you can use: XQL, XPath, etc.
- Or you can use XML::TreeBuilder and use look_down!
(Or, feh, the XML Document Object Model.)
- Some extraction tasks are amenable to XML::Simple,
or other things turning XML into easy-to-access
Perl data structures.
- If all else fails (and sometimes it does), then you can always
traverse the tree, recursively considering what to do for every
node. (See HTML::Element::traverse, but don't try this at home, kids!)