<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><link rel="canonical" href="https://www.onperl.org/blog/onperl/page/mechanize"><title>On Perl :: Index</title><meta name="description" content="This is a site on Perl."><link rel="shortcut icon" href="/blog/onperl/page/favicon.ico"><script src="/scripts/common.js"></script><style type="text/css" media="all">@import "/styles/clean.css";</style><meta name="robots" content="follow,index,noarchive"><meta name="viewport" content="width=device-width,initial-scale=1"></head><body><div id="boundary"><a href="/"><img loading="lazy" id="mast" src="/images/onperl.png" alt="onperl"></a><div id="sidebar"><div class="navigation"><div class="link"><a href="/blog/onperl/latest/10">recent</a></div><div class="link"><a href="/blog/onperl/archive/10">archive</a></div><div class="link"><a href="/blog/onperl/page/about">about</a></div><div class="link"><a href="/feed/onperl">feed</a></div><div class="link last">etc...</div></div></div><div id="content"><h1 class="post-title">Secure Scraping</h1><div class="post-content"><p>There are a few CPAN modules I consider "must-haves" and as of today WWW::Mechanize is on that list. If you've ever had to do any screen scraping, parsing HTML on web pages for information, you'll want to try this module too.</p><p>In the example below, I've decided to use it to grab the most recent email subject, from my web-based email account. This would be simple except my account requires a web-based login, and this happens via a SSL connection and uses cookies. Could be a nightmare, but rest easy, WWW::Mechanize makes it seem almost trivial.</p><p>Before you get started make sure you have the required modules installed...</p><pre class="code">
$ perl -MCPAN -eshell
cpan&gt; install IO::Socket::SSL
cpan&gt; install WWW::Mechanize
cpan&gt; q
</pre><p>Now you can write something as simple as this...</p><pre class="code">
use WWW::Mechanize;

my $url = "https://mail.example.com/login";
my $username = "joe_user";
my $password = "secret";

my $mech = WWW::Mechanize-&gt;new(
    agent =&gt; "Linux Mozilla",
    cookie_jar =&gt; {}
);

$mech-&gt;get($url);
unless ($mech-&gt;success) {
    die "Can't get login page $url: ",
    $mech-&gt;response-&gt;status_line;
}

$mech-&gt;field(Email =&gt; $username);
$mech-&gt;field(Passwd =&gt; $password);
$mech-&gt;click();

# scrape it...
my $content = $mech-&gt;content();
my ($latest) =
    $content =~ m{&lt;td&gt;(.+?)&lt;/td&gt;}i;

print "Latest email: \"$latest\"\n";
</pre></div></div><div id="footer">© <a href="mailto:michael@mathews.net">michael mathews</a></div></div></body></html>