Wget, HTTPS & ignoring robots.txt

8 Aug
2008

Wget has many usages as you probably are aware of. One of them being an easy method of mirroring parts of or even entire sites.

Lately I am working on a project which required an exact copy including all dependent media files to be mirrored from another site. Working server side it is much easier to simply retrieve the files via Wget directly to the server instead of downloading them to the local machine and from there upload them to the server in question. Wget simply saves me from doing all this extra work.

The Man page of Wget is extensive and includes all kind of options to quickly get around the most quirky tasks.

Specifically with this task I experienced trouble. I did not get all files requested and different combinations of options and parameters did not accomplish what I was after. Several failed attempts made me even consider copying the files over one by one manually…!

As the pages in question were served as HTTPS I started suspecting Wget having issues with secure connections. After enabling the debug switch (‘-d‘) as suggested I realised that Wget was actually encountering a robots.txt file that excluded all the directories and files I was trying to retrieve. Wget follows the Robot Exclusion Standard but it can easily be circumvented by using the ‘-e‘ switch:

wget -e robots=off url

If you wish to turn off the robot exclusion permanently, set the robots variable to ‘off‘ in your profile’s ‘.wgetrc‘ file.

2 Responses to Wget, HTTPS & ignoring robots.txt

Avatar

Websites tagged "wget" on Postsaver

May 31st, 2009 at 10:48

[…] – Wget, HTTPS & ignoring robots.txt saved by TheFamousStacie2009-05-24 – Konsep DNS dan tutorial instalasi BIND saved by […]

Avatar

Daniel

October 18th, 2013 at 16:35

Excelent!

Comment Form

top