info gawkinet

3.5 WEBGRAB: Extract Links from a Page

Sometimes it is necessary to extract links from web pages. Browsers do it, web robots do it, and sometimes even humans do it. Since we have a tool like GETURL at hand, we can solve this problem with some help from the Bourne shell:

BEGIN { RS = "http://[#%&\\+\\-\\./0-9\\:;\\?A-Z_a-z\\~]*" }
RT != "" {
   command = ("gawk -v Proxy=MyProxy -f geturl.awk " RT \
               " > doc" NR ".html")
   print command
}

Notice that the regular expression for URLs is rather crude. A precise regular expression is much more complex. But this one works rather well. One problem is that it is unable to find internal links of an HTML document. Another problem is that ‘ftp’, ‘telnet’, ‘news’, ‘mailto’, and other kinds of links are missing in the regular expression. However, it is straightforward to add them, if doing so is necessary for other tasks.

This program reads an HTML file and prints all the HTTP links that it finds. It relies on gawk's ability to use regular expressions as record separators. With RS set to a regular expression that matches links, the second action is executed each time a non-empty link is found. We can find the matching link itself in RT.

The action could use the system function to let another GETURL retrieve the page, but here we use a different approach. This simple program prints shell commands that can be piped into sh for execution. This way it is possible to first extract the links, wrap shell commands around them, and pipe all the shell commands into a file. After editing the file, execution of the file retrieves exactly those files that we really need. In case we do not want to edit, we can retrieve all the pages like this:

gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh

After this, you will find the contents of all referenced documents in files named ‘doc*.html’ even if they do not contain HTML code. The most annoying thing is that we always have to pass the proxy to GETURL. If you do not like to see the headers of the web pages appear on the screen, you can redirect them to ‘/dev/null’. Watching the headers appear can be quite interesting, because it reveals interesting details such as which web server the companies use. Now, it is clear how the clever marketing people use web robots to determine the market shares of Microsoft and Netscape in the web server market.

Port 80 of any web server is like a small hole in a repellent firewall. After attaching a browser to port 80, we usually catch a glimpse of the bright side of the server (its home page). With a tool like GETURL at hand, we are able to discover some of the more concealed or even “indecent” services (i.e., lacking conformity to standards of quality). It can be exciting to see the fancy CGI scripts that lie there, revealing the inner workings of the server, ready to be called:

With a command such as:
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/
some servers give you a directory listing of the CGI files. Knowing the names, you can try to call some of them and watch for useful results. Sometimes there are executables in such directories (such as Perl interpreters) that you may call remotely. If there are subdirectories with configuration data of the web server, this can also be quite interesting to read.
The well-known Apache web server usually has its CGI files in the directory ‘/cgi-bin’. There you can often find the scripts ‘test-cgi’ and ‘printenv’. Both tell you some things about the current connection and the installation of the web server. Just call:
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
Sometimes it is even possible to retrieve system files like the web server's log file—possibly containing customer data—or even the file ‘/etc/passwd’. (We don't recommend this!)

Caution: Although this may sound funny or simply irrelevant, we are talking about severe security holes. Try to explore your own system this way and make sure that none of the above reveals too much information about your system.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]