robots.txt – simon.schönbeck.dk

Wget has many usages as you probably are aware of. One of them being an easy method of mirroring parts of or even entire sites.

Lately I am working on a project which required an exact copy including all dependent media files to be mirrored from another site. Working server side it is much easier to simply retrieve the files via Wget directly to the server instead of downloading them to the local machine and from there upload them to the server in question. Wget simply saves me from doing all this extra work.

The Man page of Wget is extensive and includes all kind of options to quickly get around the most quirky tasks.

Specifically with this task I experienced trouble. I did not get all files requested and different combinations of options and parameters did not accomplish what I was after. Several failed attempts made me even consider copying the files over one by one manually…!

As the pages in question were served as HTTPS I started suspecting Wget having issues with secure connections. After enabling the debug switch (‘-d‘) as suggested I realised that Wget was actually encountering a robots.txt file that excluded all the directories and files I was trying to retrieve. Wget follows the Robot Exclusion Standard but it can easily be circumvented by using the ‘-e‘ switch:

wget -e robots=off url

If you wish to turn off the robot exclusion permanently, set the robots variable to ‘off‘ in your profile’s ‘.wgetrc‘ file.

Tag: robots.txt

Wget, HTTPS & ignoring robots.txt