This is G o o g l e's cache of http://hybrid.academic.cmri.ac.th/ftp/presentation/os2001/Burke_S_1432/burke2001.html.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:UX-na4-996sC:hybrid.academic.cmri.ac.th/ftp/presentation/os2001/Burke_S_1432/burke2001.html+lwp+http+cookie+tutor&hl=en&ie=UTF8


Google is not affiliated with the authors of this page nor responsible for its content.
These search terms have been highlighted: lwp http cookie tutor 

Web Access with Perl's LWP Modules

Sean Burke published a new article on perl.com about LWP which is mostly part of his LWP book. That article is up-to-date and well written. Highly recommend!!!

Web Access with Perl's LWP Modules

  1. LWP: Lib Web Perl
    1. Needed Versions
    2. About the Docs
  2. LWP::Simple and GETting URLs
    1. Perspective on LWP::Simple
    2. LWP::Simple's get($url)
    3. LWP::Simple's head($url) Function
    4. head()-based Link Checker
    5. The Beginning of HTTP Hassles
    6. A Google Frequency Reporter
      1. Running a Google Search
      2. Coding it Up
      3. ...And How That Looks
      4. URL-encoding
      5. URI::Escape
      6. Using URI::Escape
    7. LWP::Simple in Conclusion
  3. HTTP Basics
    1. HTTP Session: GET, and 200
    2. HTTP Session: GET, and 404
    3. HTTP Session: POST, and 200
    4. HTTP Session: GET, and 301
    5. Common HTTP Status Codes
  4. LWP Classes
    1. OOP Basics
      1. Basics of Objects in Perl
      2. OOP Jargon
      3. Class Details
      4. An Object's Meaning & Function
      5. An Object's Meaning & Function (b)
      6. An Object's Meaning & Function (c)
      7. Each Object's Attributes
      8. Each Object's Attributes (b)
      9. Each Object's Attributes (c)
      10. OOP in Action
    2. OOP Details
      1. Where to Learn More About Perl OOP
      2. Perl OOP Not-Oddities
      3. LWP OOP Oddities
      4. LWP OOP Oddities (b)
  5. LWP Class Model
    1. LWP::UserAgent
    2. HTTP::Response
    3. Simple ->get
    4. Simple ->head
    5. Wait, $browser->cookie_jar({}) ??
      1. $browser->cookie_jar({}) !
      2. $browser->cookie_jar({}) !!
      3. $browser->cookie_jar({}) !!!
      4. Must-Read LWP Docs
    6. Behind the Scenes: HTTP::Request
      1. HTTP::Request
      2. HTTP::Request (b)
      3. $browser->request
      4. $browser->request (b)
    7. Behind the Scenes: $browser->request
      1. ->request internals
      2. What's $browser->simple_request?
      3. ->simple_request internals
  6. LWP Access examples
    1. Bookmark Link Checker
      1. Matching Links
      2. Actual Useful Working Code!
      3. ...And How That Looks
      4. Noticing Redirection
      5. ...And How That Looks
    2. Primitive Remote Link Checker
      1. And a Checker Procedure
      2. Checking Just Absolute HTTP URLs
      3. ...And How That Looks
      4. The URI Class!
        1. URI Vitals
        2. URI Stringification
        3. Parts of a URI
        4. Using URI
      5. Making Our Checker a Bit Smarter
      6. ...And How That Looks
    3. Interfacing to Babelfish via POST
      1. POSTing data
      2. ->post Syntax
      3. Capturing the Output
      4. Add Interface Code...
    4. Double-Translation
  7. HTML Processing
    1. HTML Concepts
      1. Rudimentary SGML Concepts
      2. XML Working Concepts
      3. SGML Basics
      4. Back to HTML
      5. On Specificity in Specifications
      6. A Table in XML
      7. A Table in HTML
      8. HTML Hassles
    2. Overview of HTML::* Modules
      1. Making Do With No Module
        1. Getting Away with Regexps!
        2. Giving up the Regexps
      2. HTML::Parser
      3. HTML::TokeParser
      4. The Token View
        1. When Tokens are Fine
        2. Sample HTML::TokeParser Code
      5. HTML::TokeParser Vitals
      6. HTML::Tree
      7. HTML::Tree Features
  8. HTML::Tree
    1. HTML::Element Vitals
    2. HTML::TreeBuilder Vitals
    3. Lifecycle of an HTML::TreeBuilder object
    4. HTML::Element Methods
      1. Relationship Methods
      2. Dumping Methods
      3. Detaching/Deletion Methods
      4. Constructor Methods
      5. Searching Methods
    5. More on ->look_down
    6. Yet More on ->look_down
      1. Alternative Approach: Positional Selection
      2. Alternative Approach: Selection by "class" attribute
      3. look_down Case Study: H1-Matching
        1. Headline-Matching 1
        2. Headline-Matching 2
        3. More look_down Trouble
        4. Headline-Matching 3
        5. Headline-Matching 4
  9. Future Developments
    1. Future of gopher
    2. Future of HTTP
    3. Future of URI/URN/URLs
    4. Future of evil evil JavaScript
    5. Future of JPEG, PNG, GIF, Flash
    6. Future of music formats: MIDI, RealAudio, MP3
    7. Future of voice formats: RealAudio, MP3, etc.
    8. (Future of video formats?)
    9. Future of PDF
    10. Future of HTML
    11. (Future of CSS)
    12. Future of XML
  10. __END__

#1

Web Access with Perl's LWP Modules

Sean M. Burke
sburke@cpan.org
The Perl Conference, 2001


#2

LWP: Lib Web Perl

A bunch of open-source Perl modules (available in CPAN) for getting and parsing data from web sites. Yes, they're mostly classes. LWP is rather OOPy; but users who are not at home with OOP can get along fine.

#3

Needed Versions


#4

About the Docs

Every module that I discuss has documentation embedded as POD, readable with perldoc (or perlman, etc). E.g.: perldoc HTML::Element
Or you can look at the docs as web pages at http://search.cpan.org

Aside from a helpful overview document called "lwpcook", most of the documentation is meant as a reference, not as a tutorial.


#5

LWP::Simple and GETting URLs

LWP::Simple is a module that provides functions for GETting URLs.

Example comprehensive docs:


#6

Perspective on LWP::Simple

The concepts underlying those functions: The most "basic" of those functions is get().

#7

LWP::Simple's get($url)

Basic use:
  my $content = get('http://www.guardian.co.uk/');

Example use:

  use LWP::Simple;
  use strict;
  my $content = get('http://www.guardian.co.uk/');
   # the main page of a UK newspaper
  die "Hm, couldn't get Guardian!" unless defined $content;
  foreach my $keyword (qw( GM Intel Ital Canad Mexic)) {
    print "$keyword!\n" if $content =~ m/\b\Q$keyword/;
  }
  print "\n[End at ", scalar( localtime ), "]\n";
  exit;

#8

LWP::Simple's head($url) Function

A HEAD request is like a GET request, but omits the actual message body. It sends just the MIME headers.
  $whether_successful = head($url);
 
or:
  
  ($content_type, $content_length,
   $modified_time, $expires, $server
  ) = head($url);

Or if you want only parts of the return list, take a list-slice:

  ($content_length, $mod_time) = ( head($url) )[1,2];
(Note that if head fails and returns empty-list, that sets $content_length and $mod_time to both undef.)

#9

head()-based Link Checker

Simple link checker:
  use strict;
  use LWP::Simple;
  foreach my $url (@url_to_check) {
    print "$url is no good\n"
      unless scalar head($url);
  }

#10

The Beginning of HTTP Hassles

Although it's rare these days, there's some servers that don't understand HEAD requests on static objects (files).

There's many more CGIs that don't deal with HEAD requests. You might get the unsuccessful status code 405 (Method Not Allowed).

Or who knows, maybe you'll get a 500 error (general server/network error)!

Or the CGI might reply as if to a GET, and the server may or may not trim the content.


#11

A Google Frequency Reporter

For each search term given, run a Google search on it, find the bit that says:

"Results 1 - 10 of about 2,760. Search took 0.09 seconds."

and report just the number.


#12

Running a Google Search

Going to Google and running a search on "stuff" gives us a URL like this:

http://www.google.com/search?q=%22stuff%22&btnG=Google+Search

Since we can paste that URL into the browser and have it work, it must be a GET URL. As a function of $word, we can model it with:

$content = get(
   'http://www.google.com/search?q=%22'
   . $word . '%22&btnG=Google+Search'
);
...which returns HTML source, which should contain either the string "did not match any documents", or a string like "of about <b>([0-9,]+)</b>".

#13

Coding it Up

use strict;
use LWP::Simple;
foreach my $q (@ARGV) { report_google_count($q) }

sub report_google_count {
  my $word = $_[0];
  my $url = 
   'http://www.google.com/search?q=%22'
   . $word . '%22&btnG=Google+Search'
  ;
  my $content = get($url);
  if(!defined $content) {
    print "$word: NOGO $url\n";
  } elsif($content =~ m/did not match any documents/) {
    print "$word: 0 matches\n";
  } elsif($content =~ m/of about <b>([0-9,]+)<\/b>/) {
    print "$word: $1 matches\n";    # like "1,952"
  } else {
    print "$word: Page not processable, at $url\n";
  }
}

#14

...And How That Looks

% perl woogle.pl asafetida asafoetida
asafetida: 2,760 matches
asafoetida: 7,850 matches

#15

URL-encoding

But what if we wanted to do:

% perl woogle.pl "boy toy" boytoy

The first term, boy toy, would make a search URL of:

http://www.google.com/search?q=%22boy toy%22&btnG=Google+Search

But we mustn't ever have spaces in URLs! Instead:

http://www.google.com/search?q=%22boy%20toy%22&btnG=Google+Search

URL::Escape to the rescue...


#16

URI::Escape

URI::Escape is a simple module that provides two functions:
$encoded = uri_escape($raw);
Returns a URL-encoded copy of $raw's value.
 
$raw = uri_unescape($encoded);
Returns a URL-decoded copy of $encoded's value.
 
So uri_escape("boy toy") is "boy%20toy".

#17

Using URI::Escape

So we replace our line:
  my $url = 
   'http://www.google.com/search?q=%22'
   . $word . '%22&btnG=Google+Search'
  ;
with:
  use URI::Escape;
  my $url = 
   'http://www.google.com/search?q=%22'
   . uri_escape($word) . '%22&btnG=Google+Search'
  ;
And then:
% perl woogle.pl "boy toy" boytoy
boy toy: 27,700 matches
boytoy: 6,090 matches

#18

LWP::Simple in Conclusion

LWP::Simple is excellent for short, simple programs.

LWP::Simple is great when all you're doing is GETting what's at a URL.

What it doesn't do:

To do those, you use the full LWP::* / HTTP::* modules, as described later.

#19

HTTP Basics

HTTP is essentially a simple MIME protocol.

The client opens a connection to the server, sends a request line, some MIME headers, and then an optional message body.
The server responds with a status line, some MIME headers, and then an optional (usually present) message body.


#20

HTTP Session: GET, and 200

Client says to www.secret.gov:

GET /foo/thing.html HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6
Referer: http://www.secret.gov/foo/main.html

[empty message-body]



Server:

HTTP/1.0 200 OK
Content-type: text/html
Content-length: 25
Server: NCSA 3.9 (+mod_ada)

<html>I like pie.</html>

#21

HTTP Session: GET, and 404

Client says to www.secret.gov:

GET /foo/thing2.html HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6

[empty message-body]



Server:

HTTP/1.0 404 Not Found
Content-type: text/plain
Content-length: 36
Server: NCSA 3.9 (+mod_ada)

No such object as /foo/thing2.html.

#22

HTTP Session: POST, and 200

Client says to www.secret.gov:

POST /foo/drawmap.ada HTTP/1.0
Host: www.secret.gov
Referer: http://www.secret.gov/mapform.shtml
Content-type: application/x-www-form-encoded
Content-length: 40
User-Agent: Mozilla/9.6

mlat=35.11721&mlon=-106.62463&msym=cross



Server:

HTTP/1.0 200 OK
Content-type: image/gif
Content-length: 94252
Server: NCSA 3.9 (+mod_ada)

[94,252 bytes of GIF data]

#23

HTTP Session: GET, and 301

Client says to www.secret.gov:

GET /foo/bar.xml HTTP/1.0
Host: www.secret.gov
User-Agent: Mozilla/9.6

[empty message-body]



Server:

HTTP/1.0 301 Moved Permanently
Server: NCSA 3.9 (+mod_ada)
Location: http://bar.secret.gov/xmllib/f1.xml

[empty message-body]

#24

Common HTTP Status Codes


#25

LWP Classes

LWP's modules are object-oriented -- which means you get to call them "classes".

That doesn't mean that programs that use LWP have to be object-oriented. (Mine typically aren't.)


#26

OOP Basics


#27

Basics of Objects in Perl

An "object" is a reference to a data structure that is special because:

#28

OOP Jargon


#29

Class Details


#30

An Object's Meaning & Function

What does the object mean? Then: what does it do?
an Imager object is...
a 2D bitmap
...which you can load from a GIF/JPEG/PNG, draw on, resize, crop, save, etc.
 
an IO::Socket object is...
a network socket
...which I can read from and/or write to -- and which I probably had to specify a network address and portnumber for, when I created it.

#31

An Object's Meaning & Function (b)

a Net::FTP object is...
an FTP connection from me to an FTP server; it's like a virtual WSFTP/Fetch/Anarchie/ftp(1) window.
...with which I can transfer a file at time.
 
a Business::US_Amort object is...
a simulated loan ($170,000, 20 years, 8% fixed),
...which I can generate an amortization table for, or calculate the total interest for.

#32

An Object's Meaning & Function (c)

a LWP::UserAgent object is...
a browser
...with which I can get things from the Web.
 
a HTTP::Response object is...
a wrapper for data that comes back from a Web server.
...whose MIME type I can look at, whose data I can extract, whose HTTP status code I can check, etc.
 

#33

Each Object's Attributes

And, finally, what needs to be in each object?
an Imager object's attributes are...
every pixel's color; height and width of the bitmap; palette? source filename? current "pen" color and size? etc.
 
an IO::Socket object's attributes are...
its timeout setting; whether it's connected; etc.

#34

Each Object's Attributes (b)

a Net::FTP object's attributes are...
hostname; and, indirectly, the current remote directory, and ascii/binary mode
 
a Business::US_Amort object's attributes are...
intended term, principal, rate,
actual term, whether to output a table, whether calculations should round to the nearest cent; etc.

#35

Each Object's Attributes (c)

a LWP::UserAgent object's attributes are...
its user-agent string ("libwww/5.82", "Mozilla/4.76"); its cookies; its keyring for accessing password-protected URLs; how long it'll wait for a server to respond; etc.
 
an HTTP::Response object's attributes are...
its HTTP status code and message (404, "Not Found"); its data ("<html><head>..."); and all its header lines, like content_type (example value: "text/html"); etc.

#36

OOP in Action

   use LWP::UserAgent; # load the module
   
   use strict;    # always a good idea
   
   my $browser = LWP::UserAgent->new;
   
   print "Given name: ", $browser->agent(), "\n";

      # prints: libwww-perl/5.5394

   $browser->agent("NCognito/12.4");
   print "Code name: ", $browser->agent(), "\n";

      # prints: NCognito/12.4

#37

OOP Details

Object-oriented programming is a whole approach to program-design.

However, for purposes of dealing with LWP, you can just pretend it's a style of interface.


#38

Where to Learn More About Perl OOP


#39

Perl OOP Not-Oddities

Relative to other languages...

#40

LWP OOP Oddities


#41

LWP OOP Oddities (b)

To use a non-LWP example:
 $sax_track = MIDI::Track->new;
 ... then put sassy sax music into $sax_track ...
    
 $harp_track = MIDI::Track->new;
 ... then put soothing harp music into $harp_track ...
    
 $opus = MIDI::Opus->new;
 $opus->tracks( $sax_track, $harp_track );
  # The opus's "tracks" attribute is now a list
  #  of two track-objects!

#42

LWP Class Model

$resp = $browser->get( $url )
Basic classes:
LWP::UserAgent
HTTP::Response
 
Other important classes:
HTTP::Request
URI

#43

LWP::UserAgent

 $resp = $BROWSER->get( $url )

An LWP::UserAgent object is a browser, which you use for retrieving documents across the Web.
Notable attributes: agent name, cookie jar, key ring.
Made by:

         LWP::UserAgent->new

#44

HTTP::Response


use LWP 5.5394; # new features!
 $RESP = $browser->get( $url );
 $RESP = $browser->head( $url );
 $RESP = $browser->post( $request, ['k1'=>'v1',...] );

...each performs an HTTP request, and returns a new HTTP::Response object.

An HTTP::Response object contains the document that was returned.
Notable attributes: content (a big scalar); last_modified (is seconds since epoch); content_type (like "text/html"); code (the status, like 200, 401, 404, 500), and is_success (based on the code).

If the server wasn't reachable at all, then the response is a dummy object just to hold the error code, probably 500.

(Examples follow!)


#45

Simple ->get

Redoing our LWP::Simple example program:
  use strict;
  use LWP 5.5394;  # Loads necessary LWP classes
  my $br = LWP::UserAgent->new;
  my $resp = $br->get('http://www.guardian.co.uk/');
  die "Hm, couldn't get Guardian: ", $resp->status
    unless $resp->is_success;

  die "It's not html, it's ", $resp->content_type
    unless $resp->content_type eq 'text/html';

  die "Odd!  It's stale!"
   if $^T - 24*60*60 > ($resp->last_modified || $^T);

  my $content = $resp->content;
  die "What?  Content is short!"
    unless length($content) > 15_000;

  foreach my $keyword (qw( GM Intel Ital Canad Mexic)) {
    print "$keyword!\n" if $content =~ m/\b\Q$keyword/;
  }
  print "\n[End at ", scalar( localtime ), "]\n";

#46

Simple ->head

What we did with this:
  use LWP::Simple;
  foreach my $url (@url_to_check) {
    print "$url is no good\n"
      unless scalar head($url);
  }
We can do with this:
  my @urls = ( ...some absolute URLs... );
  use LWP 5.5394;
  my $browser = LWP::UserAgent->new;  
  $browser->cookie_jar( {} );  # for fun, enable cookies.
  
  foreach my $url (@url_to_check) {
    my $response = $browser->head($url);
    print "$url is no good: ", $response->message, "\n"
      unless $response->is_success;
  }

#47

Wait, $browser->cookie_jar({}) ??

The LWP::UserAgent docs say:

$ua->cookie_jar([$cookie_jar_obj])

Get/set the cookie jar object to use. [...] Normally this will be a HTTP::Cookies object or some subclass.
The default is to have no cookie_jar, i.e. never automatically add "Cookie" headers to the requests.

Shortcut: If a reference to a plain hash is passed in as the $cookie_jar_object, then it is replaced with an instance of HTTP::Cookies that is initialized based on the hash.
This form also automatically loads the HTTP::Cookies module. It means that:

  $ua->cookie_jar({ file => "$ENV{HOME}/.cookies.txt" });
is really just a shortcut for:
  require HTTP::Cookies;
  $ua->cookie_jar(HTTP::Cookies->new(file => "$ENV{HOME}/.cookies.txt"));
So?

#48

$browser->cookie_jar({}) !

So that's a shortcut for
  require HTTP::Cookies;
  $browser->cookie_jar(HTTP::Cookies->new());
...which makes it work like a normal cookie-using Web browser, but one whose cookies sit only in memory.
perldoc HTTP::Cookies explains how to read the cookie jar from disk, save it back to disk, and even use your Netscape's cookie file.

So?


#49

$browser->cookie_jar({}) !!

Most times that you'd want to deal with a cookie jar (an object of class HTTP::Cookies) is in creating it and making it the value of some $browser's cookie_jar attribute.

So what's why the LWP authors made an idiom for that!


#50

$browser->cookie_jar({}) !!!

So skim the docs, on paper, with a highlighter in hand.

There's a lot of things in the docs that you don't need to know. But you won't know which they are until you see them.


#51

Must-Read LWP Docs

There are many modules in the LWP distribution, and most of them are of nearly no conceivable interest to the average user.

Example: when you run $browser->get(...), $browser->post(...), etc., lots of things happen involving a whole LWP::Protocol::* hierarchy. However, LWP::Protocol::* is of no interest to the typical programmer. Ditto the modules that involve parsing HTTP headers, like HTTP::Date.

Of interest, however, are:
LWP::UserAgent; lwpcook; HTTP::Response and its superclasses, HTTP::Message and HTTP::Headers.
And, for later:
HTTP::TreeBuilder, and the unavoidably quite long HTML::Element.

#52

Behind the Scenes: HTTP::Request

$resp = $browser->get($url) is a shortcut for:

{
  use HTTP::Request::Common;
   # exports functions that make HTTP::Request objects
  my $request = GET($url);
  $resp = $browser->request( $request );
}
and then we actually perform the request.

#53

HTTP::Request

HTTP::Request object is...
a planned HTTP request
...which you can actually perform with $browser->request($that_object);
Its attributes are: method (like 'GET'); uri (an absolute URL string); content (used for the form data in POST requests); and headers (like "Accept-Language").

#54

HTTP::Request (b)

How to make and then perform an HTTP::Request object:
The hard way:
$req = HTTP::Request->new('GET', $url);
$req->header('Accept-Language', 'en-US, it');
$resp = $browser->request($req);
 
The easy way:
use HTTP::Request::Common; # exports GET,HEAD,POST,PUT
$resp = $browser->request(GET($url, 'Accept-Language', 'en-US, it'));
 
The implicit way:
$resp = $browser->get($url, 'Accept-Language', 'en-US, it');
 

#55

$browser->request

Performs the request, returning a response. Ways to call it:
$browser->request( $req );
Just does it! This is the only one that $browser->get(...) (and post, head, and put) are shortcuts for.
 
$browser->request( $req, $filename );
Response content is saved to $filename, instead of getting stored in the response object.
 
$browser->request( $req, \&callback );
$browser->request( $req, \&callback, $chunk_size );
Content is sent (preferably in blocks of given $chunk_size) to &callback.
 

#56

$browser->request (b)

So when do you need to use HTTP::Request objects and $browser->request? ->request was the only way to do it before LWP 5.5394 (May 2001), so it's all over existing code.

#57

Behind the Scenes: $browser->request

$browser->request($req, ...) is a wrapper around:
  $resp = $browser->simple_request($req, ...);
    
  if $resp's code says "but you need authentication"
    and we know a username+password (set by $browser->credentials),
      then redo the request with those credentials
    
  if $resp's code is a redirect
    and we're not caught in a loop
    and the req. method is in the list 
      $browser->requests_redirectable (normally HEAD,GET)
    and redirection isn't to a 'file' URL,   # security!
    then:
       $new_request = $req->clone
       $new_req->uri( $new_url )
       $new_req->previous( $resp )
       return $resp->request($new_request, ...)
  
  else return $resp

#58

->request internals

So?

#59

What's $browser->simple_request?

To perform a request to url "scheme:..." :
If $browser has a ->protocols_allowed list and scheme isn't in it,
or if $browser has a ->protocols_forbidden and scheme is in it,
  return an error response (code 500)

If we know how to handle this scheme (via a LWP::Protocol::scheme class),
  then use it for performing this request.

Otherwise, make an error object (code 500) with a message explaining we don't know that scheme.


#60

->simple_request internals


#61

LWP Access examples

Now to more examples.

To process data from the Web:


#62

Bookmark Link Checker

Let's check links in my bookmark file!
It starts out:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
Do Not Edit! -->
<TITLE>Bookmarks for Sean M. Burke</TITLE>
<H1>Bookmarks for Sean M. Burke</H1>

<DL><p>
  <DT><H3 ADD_DATE="911669103">Personal Toolbar Folder</H3>
  <DL><p>
    <DT><A HREF="http://libros.unm.edu/" ...
    <DT><A HREF="http://www.melvyl.ucop.edu/" ...
    <DT><A HREF="http://www.guardian.co.uk/" ...
    <DT><A HREF="http://www.booktv.org/schedule/" ...
    <DT><A HREF="http://www.suck.com/" ...

#63

Matching Links

Suppose we want just the HTTP links. We can just match URLs with:

  m{<DT><A HREF="(http://[^"]+)"}s

Lines not matching that, we don't care about.
(Note that there's no relative links in a bookmark file.)


#64

Actual Useful Working Code!

use strict;
use LWP;
my $file =
 '/program files/netscape/users/sburke/bookmark.htm';
die "$file doesn't exist" unless -e $file;
open(IN, "<$file") or die "Can't read-open $file: $!";

my $browser = LWP::UserAgent->new;
$browser->agent('Checkasaurus/0.01');
$browser->timeout(10); # be impatient
my %seen;

while(<IN>) {
  next unless m{<DT><A HREF="(http://[^"]+)"}s;
  my $url = $1;
  next if $seen{$url}++;  # seen it!
  my $resp = $browser->head($url);
  if($resp->is_success) {
    print "## OK: $url\n";
  } else {
    print $url, "\n => ", $resp->status_line, "\n";
  }
  sleep 1;
}
print "##Done", scalar(localtime), "\n";

#65

...And How That Looks

## OK: http://libros.unm.edu/
## OK: http://www.nandotimes.com/noframes
...
http://www.amazon.com/exec/obidos/ASIN/B00001QGP4
 => 405 Method Not Allowed
## OK: http://www.yhchang.com/
## OK: http://low-vision.org/
http://www.helsinki.fi/~lukka/
 => 403 Forbidden
## OK: http://listserv.activestate.com/mailman/listinfo/perl-xml
## OK: http://www.lib.udel.edu/ud/spec/exhibits/forgery/psalm.htm
## OK: http://www.june29.com/HLP/   [altho I know that moved!]
## OK: http://www.manl.mb.ca/
http://inac.org/IrishPeople/gaelic/
 => 404 Not Found
## OK: http://members.tripod.com/~laoconnection/language1.htm
## OK: http://www.learnkhmer.com/
## OK: http://www.geocities.com/Athens/Academy/9594/tibet.html
...
But suppose we want to catch redirection.

#66

Noticing Redirection

Remember that ->request (as in ->head) can cause several real request/response cycles. Check ->previous!

To report redirection:

  ...
  if(! $resp->is_success) {
    print $url, "\n => ", $resp->status_line, "\n"
  } elsif($resp->previous and $resp->previous->is_redirect) {
    print "## Moved $url\n## => ", $resp->request->url, "\n";
  } else {
    print "## OK: $url\n";
  }
  ...
(Doesn't report unsuccessful redirection; doesn't deal right with multiple redirection.)

#67

...And How That Looks

## OK: http://libros.unm.edu/
## Moved http://www.nandotimes.com/noframes
##  => http://www.nandotimes.com/noframes/
...
## OK: http://www.lib.udel.edu/ud/spec/exhibits/forgery/psalm.htm
## Moved http://www.june29.com/HLP/
##  => http://www.ilovelanguages.com/
## OK: http://www.manl.mb.ca/
...

#68

Primitive Remote Link Checker

We want to get a remote HTML page (by URL) and check all the links in it.

There's modules that do proper intelligent link extraction from HTML (like HTML::LinkExtor), but we'll make do with this:

sub urls_in {
  my $url = $_[0];
  my $resp = $browser->get($url);
  die "Can't get $url: ", $resp->status_line, " "
   unless $resp->is_success;
  die "Guh?  $url is content-type ", $resp->content_type
   unless $resp->content_type eq 'text/html';
  $Base = $resp->base;
  my @urls = 
   ($resp->content =~ m/href="([^"]+)"/ig);  # dumb
  return @urls;
}

#69

And a Checker Procedure

And we can recycle our link-checker code from before, as a routine:
sub check_url {  # given an absolute URL
  my $url = $_[0];
  my $resp = $browser->head($url);
  if(!$resp->is_success) {
    print $url, "\n => ", $resp->status_line, "\n";
  } elsif($resp->previous and $resp->previous->is_redirect) {
    print "## Moved $url\n##  => ", $resp->request->url, "\n";
  } else {
    print "## OK: $url\n";
  }
}

#70

Checking Just Absolute HTTP URLs

use strict;
use LWP;
my $browser = LWP::UserAgent->new;
$browser->agent('Checkasaurus/0.02');
$browser->timeout(10); # be impatient
my $Base;
...and the two subs, here...

my $hp = 'http://www.speech.cs.cmu.edu/~sburke/';
my @urls = urls_in($hp);
die "No urls in $hp?" unless @urls;

my %seen;
foreach my $url (@urls) {
  next if $seen{$url}++;
  unless($url =~ m{^http://}s) {
    print "Skipping <$url>\n";
    next;
  }
  check_url($url);
}

#71

...And How That Looks

Skipping <#work>
Skipping <#reference>
Skipping <mailto:sburke@cpan.org>
Skipping <warning.html>
## OK: http://machaut.uchicago.edu/cgi-bin/WEBSTER.sh?WORD=burke
Skipping <not_dead.html>
http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?cruft
 => 500 Can't connect to foldoc.doc.ic.ac.uk:80 (Timeout)
## Moved http://killallhumans.com/
## => http://killallhumans.com/kah/
...
How to check the relative URLs "warning.html" and "not_dead.html"?

#72

The URI Class!

(URI = Uniform Resource Identifiers, of which Uniform Resource Locators and Uniform Resource Names are (the?) two kinds.)

The URI class provides two useful constructors:

$url = URI->new_abs($rel_url, $abs_base_url);
$url = URI->new_abs($abs_url, $abs_base_url);
Create a new URI object relative to a given base.
 
$url = URI->new($abs_url);
Make an URI object from this URL.

#73

URI Vitals

A URI object is:

#74

URI Stringification

use strict;
use URI;
use LWP::UserAgent;
my $br  = LWP::UserAgent->new;
my $url = URI->new('http://www.suck.com/');
print ref($br), "=> $br\n";
  # LWP::UserAgent=> LWP::UserAgent=HASH(0x1765188)
print ref($url), "=> $url\n";
  # URI::http=> http://www.suck.com/

# But alter it as a string, and it won't be an object:
$url .= '#today';
print ref($url), "=> $url\n";
  # => http://www.suck.com/#today

#75

Parts of a URI

 http://www.secret.gov/aliens/search3.dll?foo%20bar#baz
 [--]   [------------][-----------------] [-------] [-]
  |          host            path           query    |
scheme                                            fragment

#76

Using URI

use strict;
use URI;
my $url = URI->new_abs(
  '../clones/search2.ada',
  'http://secret.gov/reno/aliens/search.ada?replicants'
);

print "Hm, the URL's scheme is ", $url->scheme, ".\n";
 #-> ...http.

print $url, "\n";
 #-> http://secret.gov/reno/clones/search2.ada

$url->query('army "clone babies"');
print $url, "\n";

 #-> http://secret.gov/reno/clones/search2.ada?army%20%22clone%20babies%22

#77

Making Our Checker a Bit Smarter

# replace the main loop with this:
use URI;
foreach my $url (@urls) {
  next if $seen{$url}++;
  if($url =~ m{^#}s) {
    print "Skipping fragment <$url>\n"; next;
  }
  $url = URI->new_abs($url,$Base);
  if($url->scheme ne 'http') {
    print "Skipping non-http <$url>\n"; next;
  }
  check_url($url);
}

#78

...And How That Looks

Skipping fragment <#work>
Skipping fragment <#reference>
Skipping non-http <mailto:sburke@cpan.org>
## OK: http://www.speech.cs.cmu.edu/~sburke/warning.html
## OK: http://machaut.uchicago.edu/cgi-bin/WEBSTER.sh?WORD=burke
## OK: http://www.speech.cs.cmu.edu/~sburke/not_dead.html
http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?cruft
 => 500 Can't connect to foldoc.doc.ic.ac.uk:80 (Timeout)
## Moved http://killallhumans.com/
## => http://killallhumans.com/kah/
...

#79

Interfacing to Babelfish via POST

Babelfish (generally better accessed thru the WWW::Babelfish module) is a service run by Altavista that lets you feed bits of text thru machine-translation programs, via your browser.

But the HTML form you use is a POST form, not a GET form.


#80

POSTing data

Form data send by POST is just like data sent by GET -- it's key+value pairs. Looking at the source for the Babelfish form shows that asking for "I like pie" to be translated from English to French, produces these three key+value pairs:
  urltext = I like pie
  lp = en_fr
  enc = utf8
Or, encoded:
  urltext=I%20like%20pie&lp=en_fr&enc=utf8

#81

->post Syntax

$resp = $browser->post($url, [k1 => v1, ... ] );
$resp = $browser->post($url, \@some_array );
$resp = $browser->post($url, {k1 => v1, ... } );
$resp = $browser->post($url, \%some_hash );

In this case,
$resp = $browser->post($url, [
  'urltext' => 'I like pie', 'lp' => 'en_fr', 'enc' => 'utf8',
] );


#82

Capturing the Output

The return page from a Babelfish request has the translation as the content of the first <textarea>...</textarea>. Working that into a tidy function:
sub translate {
  my($text, $language_path) = @_;
  my $resp = $browser->post(
    'http://babelfish.altavista.com/translate.dyn',
    [ 'urltext' => $text, 'lp' => $language_path,
       'enc' => 'utf8'
  ]);
  die "Error in translation $language_path: ",
   $resp->status_line(), "\n" unless $resp->is_success();
    
  if($resp->content() =~ m{<textarea.*?>(.*?)</textarea>}is) {
    my $translation = $1;
    # Trim whitespace, and return.
    $translation =~ s/\s+/ /g;
    $translation =~ s/^ //s;
    $translation =~ s/ $//s;
    return $translation;
  } else {
    die "Can't find translation in $language_path response";
  }
}

#83

Add Interface Code...

use strict;
use LWP;
my $browser = LWP::UserAgent->new();
$browser->env_proxy;   # good if behind a firewall
...and then the translate function here
my $lang;
if(@ARGV and $ARGV[0] =~ m/^-(\w\w)$/s) {
  $lang = lc $1;   # if lang specified, like "-fr"
  shift @ARGV;
} else {
  my @languages = qw(it fr de es ja pt);
  $lang = $languages[rand @languages];
}
  
die "What to translate?\n" unless @ARGV;
my $in = join(' ', @ARGV);
print " => $lang => ", translate(
   translate($in, 'en_' . $lang),
$lang . '_en' ), "\n";

#84

Double-Translation

% alienate -de "Pearls before swine!"
=> via de => Beads before pigs!

% alienate "Bond, James Bond"
=> via fr => Link, Link Of James

% alienate "Shaken, not stirred"
=> via pt => Agitated, not agitated

% alienate -it "Shaken, not stirred"
=> via it => Mental patient, not stirred

% alienate -it "Guess what! I'm a computer!"
=> via it => Conjecture that what! They are a calculating!

% alienate 'It was more fun than a barrel of monkeys'
=> via de => It was more fun than a barrel drop hammer

% alienate -ja 'It was more fun than a barrel of monkeys'
=> via ja => That the barrel of monkey at times was many pleasures


#85

HTML Processing

The HTTP access methods I've discussed will get you objects of any media type.

Most content on the Web these days is in HTML, and so most data extraction tasks are about pulling data out of HTML.


#86

HTML Concepts

...Wherein your humble tutor presents HTML in terms of data structures, not in terms of how to use <blink> for fun and profit.

#87

Rudimentary SGML Concepts

An SGML document represents a tree structure.
Elements (data nodes) contain other elements and/or text nodes.
Elements can have attributes (key+value pairs)

(This ignores comments, as well as arcana like PI's, declarations, marked sections, etc...)

XML is a straightforward subtype of SGML.
HTML is a messy kind of SGML instance.


#88

XML Working Concepts

Instead of considering the document as representing a tree, imagine you're dumping a structure as a document:
         foo
        /   \
       bar  baz <-- hoo=hah
             |
          "quux"
This dumps to XML as:
  <foo>
    <bar></bar>
    <baz hoo="hah">quux</baz>
  </foo>

#89

SGML Basics

An element is symbolized by a start-tag (possibly containing attributes), content, and then an end-tag:

<foo>
A start-tag with the tag-name "foo".
 
</foo>
An end-tag with the tag-name "foo".
 
<baz hoo="hah">
A start-tag with the tag-name "foo", also expressing the attribute hoo="hah" for this element

#90

Back to HTML

Tim Berners-Lee, Weaving The Web, p. 41:
"There was a family of markup languages, the standard generalized markup language (SGML), already preferred by some of the world's documentation community [...]
I developed HTML to look like a member of that community."
[emphasis mine]

#91

On Specificity in Specifications

The character Chrissy in Jane Wagner's The Search for Signs of Intelligent Life in the Universe, p.35:
All my life I've always wanted to be somebody.
But I see now I should have been more
specific.

#92

A Table in XML

Suppose a table consists of rows, consisting of cells, consisting of data.
           table
         /       \
       tr          tr
    /   |  \        |  \
   td   td  td      td   td
   |    |   |       |     \
"Cost" "$"  "Desc" "Car"  "$10,000"
As XML:
<table>
  <tr>
    <td>Cost</td> <td>$</td> <td>Desc</td>
  </tr>
  <tr>
    <td>Car</td> <td>$10,000</td>
  </tr>
</table>

#93

A Table in HTML

All the BOLD tags here are omissible:
<table>
  <TR>
    <TD>Cost</TD> <td>$</TD> <td>Desc</TD>
  </TR>
  <tr>
    <TD>Car</TD> <td>$10,000</TD>
  </TR>
</table>
Yes, all you need is:
<table>
  Cost <td>$ <td>Desc
  <tr> Car <td>$10,000
</table>
So?

#94

HTML Hassles

So, it's hard to write a regexp that matches the content of the second cell of the table, if it could be any of:
<table>
  Cost </td>$ <td>Desc
...

<table>
  <td>$Cost <td>$ <td>Desc
...

<table>
  <tr>$Cost </td> <td>$ <td>Desc
(...altho whichever of these your favorite browser happens to deign to render, is another matter altogether.)

#95

Overview of HTML::* Modules

The HTML::* modules are indispensible for data-extraction tasks.
But first see if you can do without.

#96

Making Do With No Module

Most HTML comes from a template. Take advantage of this!
  <html>
   <!-- Press Release Template 3 -->
   ...button bars...
   ...tables nested eight deep...
   ...banner ads...
  <!-- START -->
   ...All the things you want!...
  <!-- END -->
   ...more ads, table code, etc...
  </html>

#97

Getting Away with Regexps!

If you can just use m/<!-- START -->(.+?)<!-- END -->/, then do so!

(More realistic: one regexp covers 70% of cases, another RE covers another 10% of cases, another slightly different one covers 12%, and you identify the remaining 8% to be dealt with manually.)

"If the funk ain't broke, then don't try to fix it!"
  -- Bootsy Collins


#98

Giving up the Regexps

When to just give up and use the HTML::* modules:

#99

HTML::Parser

It's a tokenizer: parses HTML source as tokens, not elements.
Very tolerant of bizarre code.
<p align=center>
start-tag: tag name and attributes
</p>
end-tag: tag name
stuff
text: with references like &eacute; decoded.
<!-- comment -->
comment: (usually just ignored for most processing)
In spite of its name, HTML::Parser is not too friendly a module for everyday use.
Used widely by other modules, tho.

#100

HTML::TokeParser

A very friendly interface to the token view of HTML; uses HTML::Parser to do the actual "work".

For many applications, you don't need a real parse of a document; tokens are fine.

TokeParser lets you step thru the tokens in an HTML stream.

It's better than just a RE, when given this:

<!-- <a href="foo">woozlewuzzle</a> -->


#101

The Token View

Example case: Scan a document for <a href="url"> stuff </a> and <form action="url" action="action">.

Unlike with tables, this task is unlikely to require knowledge of HTML tag-implication!


#102

When Tokens are Fine

Consider this text:

<p><a name="start1">If</a> you mix one part
<a href="http://www.downy.com/">Downy</a> to
five parts water, and put it in a spray bottle,
you can spray it on the carpet to allieviate
static problems.

<form method="get" action="more_downy_fun.pl">
 <input size="30" type="text" name="query">
 <input type="submit" value="MORE TIPS?">
</form>

#103

Sample HTML::TokeParser Code

use strict;
use HTML::TokeParser;
my $stream = HTML::TokeParser->new('thang.html');
#  or ->new(\$content)

# get_tag gives:
#    ['foo', \%attributes, \@attributes, $orig_text]
# or ['/foo', $orig_text]

while(my $tag = $stream->get_tag) {
  if($tag->[0] eq 'a') {
    my $url = $tag->[1]{'href'} || next;
    my $text = $stream->get_trimmed_text("/a");
    print "A=>\t$url\t$text\n";
    
  } elsif($tag->[0] eq 'form') {
    my $url    = $tag->[1]{'action'} || "(nil?!)";
    my $action = $tag->[1]{'method'} || "get";
    print "FORM=>\t$url\t$action\n";
  } else {
    print "# Ignoring ", $tag->[0], "\n";
  }
}

#104

HTML::TokeParser Vitals

An HTML::TokeParser object is... (Hidden attributes: what file/string you're parsing; and your current position in the tape.)

#105

HTML::Tree

A pair of modules that build a real in-memory parse tree.
I.e., they go from anything like this:
<table>
  Cost <td>$ <td>Desc
  <tr> Car <td>$10,000
</table>
...to this:
           table
         /       \
       tr          tr
    /   |  \        |  \
   td   td  td      td   td
   |    |   |       |     \
"Cost" "$"  "Desc" "Car"  "$10,000"

#106

HTML::Tree Features


#107

HTML::Tree

"If you had to use just one module for all your HTML parsing..."

Just be sure you're using the latest version!


#108

HTML::Element Vitals

Consists of two main classes, HTML::Element and HTML::TreeBuilder.
As HTML::Element object is...
an element in an HTML document tree.
 
...thru which you can search for other components in the tree;
or which you can move around, dump to STDOUT, re-emit as HTML, etc.
 
object attributes: what element is its parent; what elements or text nodes are its children; its tag name, and its element attributes (like align="center")

#109

HTML::TreeBuilder Vitals

As HTML::TreeBuilder object is...
a special kind of HTML::Element object.
 
the top element in an (initially blank) HTML document tree.
 
...which can do everything an HTML:Element object does, plus: can be used in $root->parse_file($filename) or $root->parse($content),$root->eof
 
object attributes: same as HTML::Element attributes, plus: options for planned parsing, like whether to store comments.

#110

Lifecycle of an HTML::TreeBuilder object


#111

HTML::Element Methods

There's dozens and dozens of methods in HTML::Element. But here are the basics:
$ele->parent
What element (if any) is the parent of this element
 
$ele->content_list
What elements or text strings are children of this element.
 
$ele->tag
This node's tagname string. (E.g., "blockquote")
 
$ele->attr("foo")
Read the value for this node's "foo" attribute, as from "<tagname foo=bar>". (This returns undef if no such attribute.)
 
$ele->attr("foo", "bar")
Set the value for this node's "foo" attribute to "bar". (Set to undef to delete the attribute.)

#112

Relationship Methods

$ele->lineage
The list of all $ele's ancestors. I.e., $ele->parent, $ele->parent->parent, $ele->parent->parent->parent, etc.
 
$ele->pindex
Says where $ele appears in $ele->parent->content_list -- i.e., if $ele is at index 2 in that list, returns 2.
 
$ele->right
The node(s) to the right of $ele in the tree.
(Depends on scalar/list calling context.)
 
$ele->left

#113

Dumping Methods

$string = $ele->as_text;
Returns a join("",...) of all the text descendants of this node
 
$string = $ele->as_HTML;
Returns an HTML source representation of $ele and its decendants.
 
$ele->dump;
Dumps $ele and its descendants as indented tree diagram, to STDOUT.

#114

Detaching/Deletion Methods

$ele->detach
Remove $ele from its parent's content list.
 
$ele->delete
$ele->detach, plus deletes delete it (and any descendants) from memory.
 
@ex_content = $ele->detach_content
Detach all of $ele's content nodes, and return them.
 
$ele->replace_with( ...node or nodes... )
Detach $ele and replace it with the nodes given.

#115

Constructor Methods

$ele = HTML::Element->new('tagname', attr=>val,...);
Construct a new element with given attributes.
 
$treelet2 = $treelet->clone;
Deep-copy this element and its children.
 
$treelet = HTML::Element->new_from_lol(['p', {'align' => 'center'}, "I like ", ['em', "pie"], '!'])
Create a new node/treelet based on this list (of lists)*.
 
$ele->push_content(...eles or lols or text bits...)
$ele->unshift_content(...eles or lols or text bits...)
$ele->splice($offset, $length, ...eles or lols or text bits...)
Alters $ele's content list.

#116

Searching Methods

$ele->find_by_tag_name('a', 'area', ...)
Elements at/under $ele that have any of the given tag names.
 
$ele->look_down( 'foo' => 'bar' )
Elements whose "foo" attribute value is "bar".
 
$ele->look_down( 'class' => 'whizzbang',
   '_tag' => 'br' )
Elements whose "class" attribute value is "whizzbang", and whose tagname is "br".
 
$ele->look_down( '_tag' => 'td',
   'rowspan' => "2",
   sub { $_[0]->content_list < 3 } )
Elements whose tagname is "td", and whose "rowspan" attribute has value "2", and who have fewer than three child nodes.

#117

More on ->look_down

->look_down in list context returns all matching items; in scalar context, returns the first matching element, or undef if none.

You can nest look_downs!

@big_tables = $tree->look_down(
  '_tag' => 'table',
  'border' => 3,
  sub {
    my @tds = $_[0]->look_down('_tag' => 'img',
      sub { ($_[0]->attr('width') || 0) > 150 }
    );
    @tds > 20;
  }
);
Return all tables containing more than twenty img elements of width greater than 150.

#118

Yet More on ->look_down

To extract all the headlines from a Yahoo News page (but not other links there), it came down to wanting only links from the first paragraph that had more than two BR's as children;
 my $p = $tree->look_down(
   '_tag', 'p',
    sub {
      2 < grep { ref($_) and $_->tag eq 'br' }
               $_[0]->content_list
    }
 );
 die "no headlines-p in $url?" unless $p;
 @links = $p->look_down('_tag', 'a');
 die "no headlines in $url?" unless @links;

#119

Alternative Approach: Positional Selection

Sometimes more direct:
      my $table = ( $tree->look_down('_tag','table') )[1];
      my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
      my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
      ...then do things with $col3...
I.e., get what's in the third column of the second row of the second table element in a page.

#120

Alternative Approach: Selection by "class" attribute

Glean semantic information from CSS-tagging:
 my @linkse = $tree->look_down(
   'class' => 'headlinelink'
 );

#121

look_down Case Study: H1-Matching

Suppose you have a lot of press release files, and your task is extracting the headline of each.

A typical case:

  <h1><center>Visit Our Corporate Partner
   <br><a href="/dyna/clickthru"
     ><img src="/dyna/vend_ad"></a>
  </center></h1>
  <h1><center>ConGlomCo Announces
   New Regional HQ in Ouagadougou
  </center></h1>
How to tell the ad from the real headline?

#122

Headline-Matching 1

  <h1><center>Visit Our Corporate Partner
   <br><a href="/dyna/clickthru"
     ><img src="/dyna/vend_ad"></a>
  </center></h1>
  <h1><center>ConGlomCo Announces
   New Regional HQ in Ouagadougou
  </center></h1>
Code:
  # screen for that "Visit our..." caption:
  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub { $_[0]->as_text !~ m/\bvisit/i }
  );
...find the first h1 whose descendant-text doesn't match m/\bvisit/\i.

#123

Headline-Matching 2

  <h1><center>Visit Our Corporate Partner
   <br><a href="/dyna/clickthru"
     ><img src="/dyna/vend_ad"></a>
  </center></h1>
  <h1><center>ConGlomCo Announces
   New Regional HQ in Ouagadougou
  </center></h1>
Code:
  # screen for images:
  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub { not $_[0]->look_down('_tag', 'img') }
  );
  
  #  or:  not $_[0]->find_by_tag_name('img')
...find the first h1 that has no image element inside it.

#124

More look_down Trouble

A troublesome case:

  <h1><center>Visit Our Corporate Partner
   <br><a href="/dyna/clickthru"
     ><img src="/dyna/vend_ad"></a>
  </center></h1>
  <h1><center>ConGlomCo President Schreck to
   Visit Regional HQ
   <br><a href="/photos/Schreck_visit_large.jpg"
     ><img src="/photos/Schreck_visit.jpg"></a>
  </center></h1>
How to tell the ad from the real headline? Both have image-links in them; both even say "Visit!"

#125

Headline-Matching 3

Find the first heading that contains no images (unless they're images with a src path that includes "/photos/"):
  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      my $img = $_[0]->look_down('_tag','img');
      return 1 unless $img;
        # no image means it's fine
      return 1 if $img->attr('src') =~ m{/photos/};
        # good if a photo
      return 0; # otherwise bad
    }
  );

#126

Headline-Matching 4

Find the first heading with no links to hrefs that include "/dyna":
  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      0 == grep { ($_->attr('href') || '') =~ m{/dyna} }
            $_[0]->look_down('_tag','a');
    }
  );
The "right" answer is whatever best fits your data.

#127

Future Developments

Currently, the Web is, in no particular order:

#128

Future of gopher


(This space intentionally left blank.)



#129

Future of HTTP

The protocol itself is unlikely to change significantly.

Development of LWP HTTP modules will probably involve incidental bugfixes, and tidying up support for HTTP/1.1 features.
(Example: adding/changing interface for asking things about HTTPS certificates.)

open(WEB, "<http://www.suck.com") in Perl 6?


#130

Future of URI/URN/URLs

Consensus on URNs should come "Real Soon Now".
(Mid-2030s?)

Then adding new schemes to URI.pm is relatively trivial.


#131

Future of evil evil JavaScript

Occasionally people on the libwww-perl list muse that it'd be nice to build a Perl interface to something like the Mozilla project's standalone JavaScript engine.

This sounds hard, and messy, and whoever does this will earn a reputation as an amazing lunatic.

Would it really be useful?


#132

Future of JPEG, PNG, GIF, Flash

Graphics will always be with us, and it'd be neat to extract meaningful information from them.

Most algorithms that extract semantic information from graphics are task-specific: OCRing out the text, recognizing faces, reading expressions, identifying objects.

So don't expect a generally useful module anytime soon.

(How semantic will Scalable Vector Graphics be, in practice?)


#133

Future of music formats:
MIDI, RealAudio, MP3

Music is presumably not something that anyone but musicians or music librarians would want to pull "semantic" content out of.

#134

Future of voice formats:
RealAudio, MP3, etc.

There will be ever-more audio on the Web.

Speech-to-text programs can, ideally, turn an audio stream into a text stream, and presumably distinguish different speakers.

Once we have that, it seems simple to build a Perl interface to whatever text format comes out of a speech-to-text program.

Imagine a program that gets the NPR Newscast and makes a transcript: you can probably read faster than you can listen.


#135

(Future of video formats?)

Usefully treatable as an unrelated audio stream and an unrelated moving-picture stream?

There's already TV metainformation databases: tvguide.com, clicktv.com, etc. !

Extracting text from the closed-captioning / videotext sideband?

Speech-to-texting the descriptive audio sideband?

What other interesting sidebands are in HDTV?


#136

Future of PDF

Generally less semantic than HTML, and more semantic than GIF/PNG/etc.

A nightmare scenario: PDF becomes more tightly integrated into browsers, and we get called on to deal with "all-PDF" sites.


#137

Future of HTML


#138

(Future of CSS)


#139

Future of XML


#140

__END__