Book HomeJava and XSLTSearch this book

20.2. The LWP Modules

The LWP modules provide the core of functionality for web programming in Perl. It contains the foundations for networking applications, protocol implementations, media type definitions, and debugging ability.

The modules LWP::Simple and LWP::UserAgent define client applications that implement network connections, send requests, and receive response data from servers. LWP::RobotUA is another client application used to build automated web searchers following a specified set of guidelines.

LWP::UserAgent is the primary module used in applications built with LWP. With it, you can build your own robust web client. It is also the base class for the Simple and RobotUA modules. These two modules provide a specialized set of functions for creating clients.

Additional LWP modules provide the building blocks required for web communications, but you often don't need to use them directly in your applications. LWP::Protocol implements the actual socket connections with the appropriate protocol. The most common protocol is HTTP, but mail protocols (such as SMTP), FTP for file transfers, and others can be used across networks.

LWP::MediaTypes implements the MIME definitions for media type identification and mapping to file extensions. The LWP::Debug module provides functions to help you debug your LWP applications.

The following sections describe the RobotUA, Simple, and UserAgent modules of LWP.

20.2.1. LWP::RobotUA Sections

The Robot User Agent (LWP::RobotUA) is a subclass of LWP::UserAgent and is used to create robot client applications. A robot application requests resources in an automated fashion. Robots perform such activities as searching, mirroring, and surveying. Some robots collect statistics, while others wander the Web and summarize their findings for a search engine.

The LWP::RobotUA module defines methods to help program robot applications and observes the Robot Exclusion Standards, which web server administrators can define on their web site to keep robots away from certain (or all) areas of the site.

The constructor for an LWP::RobotUA object looks like this:

$rob = LWP::RobotUA->new(agent_name, email, [$rules]);

The first parameter, agent_name, is the user agent identifier used for the value of the User-Agent header in the request. The second parameter is the email address of the person using the robot, and the optional third parameter is a reference to a WWW::RobotRules object, which is used to store the robot rules for a server. If you omit the third parameter, the LWP::RobotUA module requests the robots.txt file from every server it contacts and generates its own WWW::RobotRules object.

Since LWP::RobotUA is a subclass of LWP::UserAgent, the LWP::UserAgent methods are used to perform the basic client activities. The following methods are defined by LWP::RobotUA for robot-related functionality.

as_string

$rob->as_string( )

Returns a human-readable string that describes the robot's status.

delay

$rob->delay ([time])

Sets or returns the specified time (in minutes) to wait between requests. The default value is 1.

host_wait

$rob->host_wait(netloc)

Returns the number of seconds the robot must wait before it can request another resource from the server identified by netloc.

no_visits

$rob->no_visits(netloc)

Returns the number of visits to a given server. netloc is of the form user:password@host:port. The user, password, and port are optional.

rules

$rob->rules([$rules])

Sets or returns the WWW:RobotRules object $rules, which is used when determining if the module is allowed access to a particular resource.

use_sleep

$rob->use_sleep ([boolean])

Determines whether the user agent should sleep( ) if requests arrive too fast. The default is true. If set to false, an internal SERVICE_UNAVAILABLE response is generated, with a Retry-After header indicating when it is permissable to send another request to this server. With no arguments, returns the current value of this flag.

20.2.2. LWP::Simple

LWP::Simple provides an easy-to-use interface for creating a web client, although it is only capable of performing basic retrieving functions. An object constructor is not used for this class; it defines functions for retrieving information from a specified URL and interpreting the status codes from the requests.

This module isn't named Simple for nothing. The following shows how to use it to get a web page and save it to a file:

use LWP::Simple;

$homepage = 'oreilly_com.html';
$status = getstore('http:www.oreilly.com/', $homepage);
print("hooray") if is_success($status);

The retrieving functions get and head return the URL's contents and header contents, respectively. The other retrieving functions return the HTTP status code of the request. The status codes are returned as the constants from the HTTP::Status module, which is also where the is_success and is_failure methods are obtained. See Section 20.3.4, "HTTP::Status" for a listing of the response codes.

The user agent identifier produced by LWP::Simple is LWP::Simple/n.nn, in which n.nn is the version number of LWP being used.

The following are the functions exported by LWP::Simple.

get

get (url)

Returns the contents of the specified url. Upon failure, get returns undef. Other than returning undef, there is no way of accessing the HTTP status code or headers returned by the server.

getprint

getprint (url)

Prints the contents of url on standard output and returns the HTTP status code given by the server.

getstore

getstore (url, file)

Stores the contents of the specified url into file and returns the HTTP status code given by the server.

head

head (url)

Returns header information about the specified url in the form of: ($content_type, $document_length, $modified_time, $expires, $server). Upon failure, head returns an empty list.

is_error

is_error (code)

Given a status code from getprint, getstore, or mirror, returns true if the request was not successful.

is_success

is_success (code)

Given a status code from getprint, getstore, or mirror, returns true if the request was successful.

mirror

mirror (url, file)

Copies the contents of the specified url into file, when the modification time or length of the online version is different from that of the named file.

20.2.3. LWP::UserAgent

Requests over the network are performed with LWP::UserAgent objects. To create an LWP::UserAgent object, use:

$ua = LWP::UserAgent->new( );

You give the object a request, which it uses to contact the server, and the information you requested is returned. The most often used method in this module is request, which contacts a server and returns the result of your query. Other methods in this module change the way request behaves. You can change the timeout value, customize the value of the User-Agent header, or use a proxy server.

The following methods are supplied by LWP::UserAgent.

new

$ua->new(%options)

Constructs a new LWP::UserAgent object and returns a reference to it. Key/value arguments may be provided to set up the initial state of the user agent. new accepts several options that correspond to the following attribute methods:

Key

Default

agent

"libwww-perl/#.##"

from

undef

timeout

180

use_eval

1

parse_head

1

max_size

undef

cookie_jar

undef

conn_cache

undef

protocols_allowed

undef

protocols_forbidden

undef

requests_redirectable

["GET", "HEAD"]

The options are:

env_proxy
If set to true, proxy settings are read from environment variables.

keep_alive
A number that will be passed on as the total_capacity for the connection. A LWP::ConnCache is set up (see the conn_cache method), and the HTTP/1.1 protocol module is enabled.

agent

$ua->agent([string])

When invoked with no arguments, this method returns the current value of the identifier used in the User-Agent HTTP header. If invoked with an argument, the User-Agent header will use string as its identifier in the future.

_agent

$ua->_agent(  )

Returns the default agent identifier. This is a string of the form "libwww-perl/#.##", in which #.## is substituted with the version number of this library.

clone

$ua->clone(  )

Returns a copy of the LWP::UserAgent object.

conn_cache

$ua->conn_cache([$cache_object])

Defines the LWP::ConnCache object to use. With no arguments, returns the current LWP::ConnCache object.

cookie_jar

$ua->cookie_jar([$cjar])

Specifies the "cookie jar" object to use with the UserAgent object, or returns it if invoked with no argument. $cjar is a reference to an HTTP::Cookies object that contains client cookie data. See the HTTP::Cookies section for more information.

credentials

$ua->credentials(netloc, realm, uname, pass)

Uses the given username and password for authentication at the given network location and realm. This method sets the parameters for either the WWW-Authenticate or Proxy-Authenticate headers in a request. The get_basic_credentials method is called by request to retrieve the username and passwords, if they exist. The arguments are:

netloc
The network location (usually a URL string) to which the username and password apply.

realm
The name of the server-defined range of URLs that this data applies to.

uname
The username for authentication.

pass
The password for authentication. By default, the password will be transmitted with MIME base-64 encoding.

env_proxy

$ua->env_proxy(  )

Defines a scheme/proxy URL mapping by looking at environment variables. For example, to define the HTTP proxy, one would define the HTTP_PROXY environment variable with the proxy's URL. To define a domain to avoid the proxy, one would define the NO_PROXY environment variable with the domain that doesn't need a proxy.

from

$ua->from([email])

When invoked with no arguments, this method returns the current value of the email address used in the From header. If invoked with an argument, the From header will use that email address in the future. (The From header tells the web server the email address of the person running the client software.)

get

$ua->get($url, [Header => Value])

Shortcut for $ua->request(HTTP::Request::Common::GET( $url, Header => Value,... )).

get_basic_credentials

$ua->get_basic_credentials(realm, url)

Returns the list containing the username and password for the given realm and urlwhen authentication is required by the server. This function is usually called internally by request. This method becomes useful when creating a subclass of LWP::UserAgent with its own version of get_basic_credentials. From there, you can rewrite get_basic_credentials to do more flexible things, such as asking the user for the account information, or referring to authentication information in a file. All you need to do is return a list in which the first element is a username and the second element is a password.

head

$ua->head($url, [Header => Value])

Shortcut for $ua->request(HTTP::Request::Common::HEAD( $url, Header => Value,...)).

is_protocol_supported

$ua->is_protocol_supported(proto)

Given a scheme, this method returns a true or false (nonzero or zero) value. A true value means that LWP knows how to handle a URL with the specified protocol. If it returns a false value, LWP does not know how to handle the URL.

max_size

$ua->max_size([size])

Sets or returns the maximum size (in bytes) for response content. The default is undef, which means that there is no limit. If the returned content is partial because the size limit was exceeded, then an X-Content-Range header will be added to the response.

mirror

$ua->mirror(url, file)

Given a URL and file path, this method copies the contents of url into the file when the length or modification date headers are different from any previous retrieval. If the file does not exist, it is created. This method returns an HTTP::Response object, in which the response code indicates what happened.

no_proxy

$ua->no_proxy(domains)

Does not use a proxy server for the specified domains.

parse_head

$ua->parse_head([boolean])

Sets or returns a true or false value indicating whether response headers from the <head> sections of HTML documents are initialized. The default is true.

post

$ua->post($url, \%formref, [Header => Value])

Shortcut for $ua->request(HTTP::Request::Common::POST( $url, \%formref, Header => Value,... )). The form reference is optional and can be either a hashref or an arrayref.

protocols_allowed

$ua->protocols_allowed([\@protocols])

Assigns the list of protocols that $ua->request and $ua->simple_request will exclusively allow. With no arguments, returns a list of the protocols currently allowed. Assigning to a value of undef deletes the list.

protocols_forbidden

$ua->protocols_forbidden([\@protocols])

Assigns the list of procotols that $ua->request and $ua->simple_request will not allow. With no arguments, returns a list of the protocols currently prohibited. Assigning to a value of undef deletes the list.

proxy

$ua->proxy(prot, proxy_url)

Defines a URL (proxy_url) to use with the specified protocols, prot. The first parameter can be a reference to a list of protocol names or a scalar that contains a single protocol. The second argument defines a proxy URL to use with the protocol.

put

$ua->put($url, [Header => Value])
$ua -- >gt;put($url, Header ==>gt; Value,...)

Shortcut for $ua->request(HTTP::Request::Common::PUT( $url, Header => Value,...)).

redirect_ok

$ua->redirect_ok($this_request)

This method is called by request before it tries to follow a redirection to the request in $this_request. This should return a true value if this redirection is permissible.

request

$ua->request($request, [file | $sub, size])

Performs a request for the resource specified by $request, which is an HTTP::Request object. Returns the information received from the server as an HTTP::Response object. Normally, doing a $ua->request($request) is enough. You can also specify a subroutine to process the data as it comes in or provide a filename in which to store the entity body of the response. The arguments are:

$request
An HTTP::Request object. The object must contain the method and URL of the site to be queried. This object must exist before request is called.

file
Name of the file in which to store the response's entity body. When this option is used on request, the entity body of the returned response object will be empty.

$sub
A reference to a subroutine that will process the data of the response. If you use the optional third argument, size, the subroutine will be called any time that number of bytes is received as response data. The subroutine should expect each chunk of the entity body data as a scalar in the first argument, an HTTP::Response object as the second argument, and an LWP::Protocol object as the third argument.

size
Optional argument specifying the number of bytes of the entity body received before the sub callback is called to process response data.

requests_redirectable

$ua->requests_redirectable([\@requests])

Assigns the list of request names that $ua->redirect_ok will allow to be redirected. With no arguments, returns the current list of request names. By default, GET and HEAD requests are allowed; to include POST requests, enter:

push @{ $ua->requests_redirectable }, 'POST';
timeout

$ua->timeout([secs])

When invoked with no arguments, timeout returns the timeout value of a request. By default, this value is three minutes. Therefore, if the client software doesn't hear back from the server within three minutes, it will stop the transaction and indicate that a timeout occurred in the HTTP response code. If invoked with an argument, the timeout value is redefined to be that value.

use_alarm

$ua->use_alarm([boolean])

Retrieves or defines the ability to use alarm for timeouts. By default, timeouts with alarm are enabled. If you plan on using alarm for your own purposes, or it isn't supported on your system, it is recommended that you disable alarm by calling this method with a value of 0.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.