PLEASE NOTE: the upgrades to this server,, have not yet been completed.
Hopefully the changes will be completed on THURSDAY 8th December.
Web-based programs, such as csmarks, cssubmit, and the help fora, will be unavailable at some time on Thursday 8th.
  It's UWAweek 49


This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying selected article
Showing 1 of 564 articles.
Currently 2 other people reading this forum.

 UWA week 16 (1st semester, non-teaching week) ↓
SVG not supported

Login to reply

1:11pm Mon 18th Apr, Peter M.

"Ryan Bunney" [email protected] wrote:

"Peter Millitz" [email protected] wrote:

FYI -Not sure if this is important but after I downloaded the text files from the Assignment 1 link onto my Windows PC, I noticed all the files contained superfluous white space characters as pin the extract below, including the unrecognisable character string combination a the very beginning as well as the usual DOS style line termination character.

============================================================================================== The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll^M ^M This eBook is for the use of anyone anywhere in the United States and^M most other parts of the world at no cost and with almost no restrictions^M whatsoever. You may copy it, give it away or re-use it under the terms^M of the Project Gutenberg License included with this eBook or online at^M If you are not located in the United States, you^M will have to check the laws of the country where you are located before^M using this eBook.^M ^M Title: Alice’s Adventures in Wonderland^M ^M Author: Lewis Carroll^M ^M Release Date: January, 1991 [eBook #11]^M [Most recently updated: October 12, 2020]^M

So far the only side-effect I've seen is that when I create a list of words there from any of the text files, the very first line is blank.

Hi Peter,

This is good for students to know! Thank you for notifying others of the situation.

As an FYI, this is not a (your) Windows issue - it will be true for any of the operating systems students will be running. The reason is that the Gutenberg plain text files are all including the F+EFF BOM for their UTF-8 encoded files (suggesting the Gutenberg people are in the habit of using Windows).

The existence of a BOM can be confirmed by the file command; it will return something along the lines of:

UTF-8 Unicode (with BOM) text, with CRLF line terminators

You should be able to remove that with a regex.

Now, as there is no marking key provided that tells you to explicitly do data cleaning, nor have we discussed different file encoding schemes in which the BOM would be mentioned, I see no issue in providing you a link to a useful discussion on Stack Exchange that should help with this issue.

The reason I think it important to use sed than just tail +2 in this example is that it is always possible your solution will be tested on files without that BOM. Now, this would likely be unfair because there is nothing in the 'marking key' that says we are testing you on your knowledge of encoding schemes, but it is better to be safer than sorry in this assignments.

Hope this helps!

Ryan (lab guy).

Thanks for the reply Ryan. I feel more comfortable having 'cleaned' all my text files (removed BOM and ^Ms).

I'm not entirely clear on the meaning of the last paragraph of your reply however. Do you mean that is better to permanently edit the text files rather than just interrogate with tail (to avoid the BOM) or are you alluding to use of those tools in the Assignment in general?

The reason I ask is because at the last lecture, Michael suggested for the assignment we might utilise grep and/or sed -e. I certainly see the usefulness of grep but struggling to find a reason to use sed, mainly because I'm under the impression that sed is used to change something in a file and so far, having attempted both part 1 and part 2 of the Assignment (only the -w option for the latter), I can't find a reason to have to edit the input text file.



The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  1:17AM Sep 14 2022
Privacy policy