• Wget, HTTPS & ignoring robots.txt

    Wget has many usages as you probably are aware of. One of them being an easy method of mirroring parts of or even entire sites.

    Lately I am working on a project which required an exact copy including all dependent media files to be mirrored from another site. Working server side it is much easier to simply retrieve the files via Wget directly to the server instead of downloading them to the local machine and from there upload them to the server in question. Wget simply saves me from doing all this extra work.

    The Man page of Wget is extensive and includes all kind of options to quickly get around the most quirky tasks.

    Specifically with this task I experienced trouble. I did not get all files requested and different combinations of options and parameters did not accomplish what I was after. Several failed attempts made me even consider copying the files over one by one manually…!

    As the pages in question were served as HTTPS I started suspecting Wget having issues with secure connections. After enabling the debug switch (‘-d‘) as suggested I realised that Wget was actually encountering a robots.txt file that excluded all the directories and files I was trying to retrieve. Wget follows the Robot Exclusion Standard but it can easily be circumvented by using the ‘-e‘ switch:

    wget -e robots=off url
    

    If you wish to turn off the robot exclusion permanently, set the robots variable to ‘off‘ in your profile’s ‘.wgetrc‘ file.

  • Apache: DocumentRoot does not exist. Why SELinux?

    Once more SELinux has been playing up with the normal operations of a box. During the installation and set up of an Apache instance and a few virtual hosts I simply could not get around the dreaded error message:

    Starting httpd: Warning: DocumentRoot [/home/www/myhost] does not exist
    

    No matter which permissions and owners were given to the directories or files related the error still came up hindering the Apache httpd service to start. Obviously the path was correct, copied and pasted, to exclude any spelling issues.

    After experiencing similar conundrums in the past I had a slight suspicion regarding SELinux, which comes enabled by default on Fedora, may have been blocking access to the directory somehow.

    A bit of searching did confirm that SELinux indeed also intervened at this level blocking Apache’s normal operations. I fully understand and agree with the goal of SELinux, but it is simply too big a compromise between security and usability. As Theodore Tso pretty much summarises it:

    SELINUX is so horrible to use, that after wasting a large amount of time enabling it and then watching all of my applications die a horrible death since they didn’t have the appropriate hand-crafted security policy, caused me to swear off of it. For me, given my threat model and how much my time is worth, life is too short for SELinux.

    SELinux stays disabled again…

  • Thunderbird + Gmail = invalid username and password

    I have been using Gmail accounts inside Thunderbird for some time and every now and then I received those annonying error messages stating that the username or password was invalid. This was somehow a conundrum as the passwords were saved within the Thunderbird profile and had not changed.

    Looking up the issue today I came across both the problem, the trigger, and the solution. I realised that Gmail does not let you check for mailbox changes more often than every 10 minutes. Thunderbird was set to check for new mail every 6 minutes which seems to have triggered the account to be locked out. First when the Captcha had been passed the email account was again available via IMAP.

  • Error: Could not open the local file – FTP madness!

    Trying to move some automatic backups from one location to another became a bit of a struggle recently. The backups are created automatically and uploaded to a file server at a scheduled pattern. From this file server I had to use some of the backups and tried to download them to my local machine. This is where the problem started.

    I am long time user of the excellent FTP client named FileZilla and very seldom experience any issues with it. But this time during the download process I kept receiving this weird error message:

    Error:    Could not open the local file path/filename
    Error:    Download failed

    I initially thought about some permission problems like no read access until I swiftly realised it was actually on the local side the issue was and not on the server. This just made no sense as obviously why would it want to open the file when it actually had to create the file instead. I tried to download the backups to different locations on the hard disks, even external ones as well, to see if oddly enough they had magically become write protected. This was not successful and I started to think of the directory structure as downloading the individual file worked just fine.

    So to summarise:

    1. Downloading the directory containing the backups did not work.
    2. Downloading each individual file inside the directory worked as expected.

    Very strange indeed…

    I actually did not break it until the following morning when I woke up with the solution. I am not sure what happened as I would not dream about such a pity issue but I just woke up and Eureka!

    In hindsight it was really obvious what the problem was but at the point it just did not come to my mind.

    Linux and Windows have some dramatic differences and one of them is the fact that file and directory names under Windows are limited to certain characters while Linux pretty much takes it all. I was trying to download directories to a Windows XP PC with a colon (:) in the name, such as . This is incompatible with Windows and therefore FileZilla was not permitted to create such directories anywhere on the hard disks.

    Never use the following characters in file or directory names if you expect Windows compatibility:

    / \ : * ? " < > |
  • Excel sort by length of string

    I received a question today on how to sort a list of strings which was actually rather simple to accomplish.

    The following list is the original sorting:

    • caca
    • cocacola
    • ca
    • cacacacacola
    • c
    • cacacacacacacaac
    • cacaa

    The requested list after sorting should look like this:

    • c
    • ca
    • caca
    • cacaa
    • cocacola
    • cacacacacola
    • cacacacacacacaac

    What I first did was to create a separate column to contain the length of each cell. There is a function in Excel which easily calculates the length of a string: LEN(). Now I had two columns, the first unsorted string column and the second length (of the first column string) column. As you probably have realised by now it was a simple matter of sorting the second column instead of the first for the strings to sort by length.

    If interested in seeing it I have attached the test sheet for you to play around with:

    Length String Sorting