ref: a39ae178fcc4387fe63efea4d15b34d52f4a13de
dir: /man/1/webgrab/
.TH WEBGRAB 1 .SH NAME webgrab \- fetch web page content as files .SH SYNOPSIS .B webgrab [ .B -r ] [ .B -v ] [ .BI -o " stem" ] [ .BI -p " body" ] .I url .SH DESCRIPTION .I Webgrab connects to the web server named in the .IR url . It fetches the content of the web page also determined by the .IR url , and stores it locally in a file. If the page is written in HTML, .I webgrab reads it to build a list of sub-component pages (eg, frames) and images. It fetches those, saving the content in separate files. It adds a comment to the end of each HTML file giving the time, and the file's origin. It automatically follows redirections offered by the server. .PP The .I stem of the names of the output files is normally derived from a component of the .IR url . If the .I url contains a path name, the .I stem is the component of that path, less any dot-separated suffix and prefix. For example, given .IP .BR http://www.vitanuova.com/inferno/old.index.html .PP the stem would be .BR index . If there is no path name, but the .I url contains a domain name, the .I stem is the penultimate component of the domain name (eg, excluding trailing .BR .com , and initial .BR www , etc). For example, given .IP .B www.innerhost.vitanuova.com .PP the stem would be .BR vitanuova . If all else fails, .I webgrab uses the .I stem .BR webgrab . .PP Given a .IR stem , the initial page is stored in .IB stem . suffix where .I suffix is the suffix (eg, .BR .html ) of the name of the original page. Subordinate pages are saved in a similar way in files named .IB stem _1. suffix1, .IB stem _2. suffix2, \&... . .PP The options are: .TP .B -r do not fetch subcomponents (just the `raw' source of .I url itself) .TP .B -v print a progress report .TP .B -vv print a chatty progress report .TP .BI -o " stem" use the .I stem as given .TP .BI -p " body" Use HTTP .B POST instead of .BR GET , posting .I body as the data .PP .I Webgrab reads the configuration file .B /services/webget/config (if it exists), to look for the address of an optional HTTP proxy (in the .L httpproxy entry), and list of domains for which a proxy should not be used (in the .B noproxy or .B noproxydoms entry). If symbolic network and service names might be involved, the connection server .B lib/cs needs to be already running. .SH FILES .B /services/webget/config .SH SOURCE .B /appl/cmd/webgrab.b .SH BUGS It should read the proxy name from the .IR charon (1) configuration file and not the .I webget configuration file. .br It cannot do `secure' transfers .RB ( https ). .br Its HTML parsing is naive, but on the other hand, it is less likely to trip over HTML novelties. .SH "SEE ALSO" .IR cs (8)