It's UWAweek 48

help2003/help4407

This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying the 5 articles in this topic
Showing 5 of 564 articles.
Currently 4 other people reading this forum.


 UWA week 16 (1st semester, non-teaching week) ↓
SVG not supported

Login to reply

👍?
helpful
10:55am Mon 18th Apr, Ryan B.
Edited: shortly thereafter

"Peter Millitz" [email protected] wrote:

FYI -Not sure if this is important but after I downloaded the text files from the Assignment 1 link onto my Windows PC, I noticed all the files contained superfluous white space characters as pin the extract below, including the unrecognisable character string combination a the very beginning as well as the usual DOS style line termination character.

============================================================================================== The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll^M ^M This eBook is for the use of anyone anywhere in the United States and^M most other parts of the world at no cost and with almost no restrictions^M whatsoever. You may copy it, give it away or re-use it under the terms^M of the Project Gutenberg License included with this eBook or online at^M www.gutenberg.org. If you are not located in the United States, you^M will have to check the laws of the country where you are located before^M using this eBook.^M ^M Title: Alice’s Adventures in Wonderland^M ^M Author: Lewis Carroll^M ^M Release Date: January, 1991 [eBook #11]^M [Most recently updated: October 12, 2020]^M

So far the only side-effect I've seen is that when I create a list of words there from any of the text files, the very first line is blank.

Hi Peter,

This is good for students to know! Thank you for notifying others of the situation.

As an FYI, this is not a (your) Windows issue - it will be true for any of the operating systems students will be running. The reason is that the Gutenberg plain text files are all including the F+EFF BOM for their UTF-8 encoded files (suggesting the Gutenberg people are in the habit of using Windows).

The existence of a BOM can be confirmed by the file command; it will return something along the lines of:

UTF-8 Unicode (with BOM) text, with CRLF line terminators

You should be able to remove that with a regex.

Now, as there is no marking key provided that tells you to explicitly do data cleaning, nor have we discussed different file encoding schemes in which the BOM would be mentioned, I see no issue in providing you a link to a useful discussion on Stack Exchange that should help with this issue.

The reason I think it important to use sed than just tail +2 in this example is that it is always possible your solution will be tested on files without that BOM. Now, this would likely be unfair because there is nothing in the 'marking key' that says we are testing you on your knowledge of encoding schemes, but it is better to be safer than sorry in these assignments.

Hope this helps!

Ryan (lab guy).


SVG not supported

Login to reply

👍?
helpful
10:55am Mon 18th Apr, Ryan B.

"Peter Millitz" [email protected] wrote:

FYI -Not sure if this is important but after I downloaded the text files from the Assignment 1 link onto my Windows PC, I noticed all the files contained superfluous white space characters as pin the extract below, including the unrecognisable character string combination a the very beginning as well as the usual DOS style line termination character.

============================================================================================== The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll^M ^M This eBook is for the use of anyone anywhere in the United States and^M most other parts of the world at no cost and with almost no restrictions^M whatsoever. You may copy it, give it away or re-use it under the terms^M of the Project Gutenberg License included with this eBook or online at^M www.gutenberg.org. If you are not located in the United States, you^M will have to check the laws of the country where you are located before^M using this eBook.^M ^M Title: Alice’s Adventures in Wonderland^M ^M Author: Lewis Carroll^M ^M Release Date: January, 1991 [eBook #11]^M [Most recently updated: October 12, 2020]^M

So far the only side-effect I've seen is that when I create a list of words there from any of the text files, the very first line is blank.

Hi Peter,

This is good for students to know! Thank you for notifying others of the situation.

As an FYI, this is not a (your) Windows issue - it will be true for any of the operating systems students will be running. The reason is that the Gutenberg plain text files are all including the F+EFF BOM for their UTF-8 encoded files (suggesting the Gutenberg people are in the habit of using Windows).

The existence of a BOM can be confirmed by the file command; it will return something along the lines of:

UTF-8 Unicode (with BOM) text, with CRLF line terminators

You should be able to remove that with a regex.

Now, as there is no marking key provided that tells you to explicitly do data cleaning, nor have we discussed different file encoding schemes in which the BOM would be mentioned, I see no issue in providing you a link to a useful discussion on Stack Exchange that should help with this issue.

The reason I think it important to use sed than just tail +2 in this example is that it is always possible your solution will be tested on files without that BOM. Now, this would likely be unfair because there is nothing in the 'marking key' that says we are testing you on your knowledge of encoding schemes, but it is better to be safer than sorry in this assignments.

Hope this helps!

Ryan (lab guy).


SVG not supported

Login to reply

👍?
helpful
1:11pm Mon 18th Apr, Peter M.

"Ryan Bunney" [email protected] wrote:

"Peter Millitz" [email protected] wrote:

FYI -Not sure if this is important but after I downloaded the text files from the Assignment 1 link onto my Windows PC, I noticed all the files contained superfluous white space characters as pin the extract below, including the unrecognisable character string combination a the very beginning as well as the usual DOS style line termination character.

============================================================================================== The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll^M ^M This eBook is for the use of anyone anywhere in the United States and^M most other parts of the world at no cost and with almost no restrictions^M whatsoever. You may copy it, give it away or re-use it under the terms^M of the Project Gutenberg License included with this eBook or online at^M www.gutenberg.org. If you are not located in the United States, you^M will have to check the laws of the country where you are located before^M using this eBook.^M ^M Title: Alice’s Adventures in Wonderland^M ^M Author: Lewis Carroll^M ^M Release Date: January, 1991 [eBook #11]^M [Most recently updated: October 12, 2020]^M

So far the only side-effect I've seen is that when I create a list of words there from any of the text files, the very first line is blank.

Hi Peter,

This is good for students to know! Thank you for notifying others of the situation.

As an FYI, this is not a (your) Windows issue - it will be true for any of the operating systems students will be running. The reason is that the Gutenberg plain text files are all including the F+EFF BOM for their UTF-8 encoded files (suggesting the Gutenberg people are in the habit of using Windows).

The existence of a BOM can be confirmed by the file command; it will return something along the lines of:

UTF-8 Unicode (with BOM) text, with CRLF line terminators

You should be able to remove that with a regex.

Now, as there is no marking key provided that tells you to explicitly do data cleaning, nor have we discussed different file encoding schemes in which the BOM would be mentioned, I see no issue in providing you a link to a useful discussion on Stack Exchange that should help with this issue.

The reason I think it important to use sed than just tail +2 in this example is that it is always possible your solution will be tested on files without that BOM. Now, this would likely be unfair because there is nothing in the 'marking key' that says we are testing you on your knowledge of encoding schemes, but it is better to be safer than sorry in this assignments.

Hope this helps!

Ryan (lab guy).

Thanks for the reply Ryan. I feel more comfortable having 'cleaned' all my text files (removed BOM and ^Ms).

I'm not entirely clear on the meaning of the last paragraph of your reply however. Do you mean that is better to permanently edit the text files rather than just interrogate with tail (to avoid the BOM) or are you alluding to use of those tools in the Assignment in general?

The reason I ask is because at the last lecture, Michael suggested for the assignment we might utilise grep and/or sed -e. I certainly see the usefulness of grep but struggling to find a reason to use sed, mainly because I'm under the impression that sed is used to change something in a file and so far, having attempted both part 1 and part 2 of the Assignment (only the -w option for the latter), I can't find a reason to have to edit the input text file.

Cheers,

Peter


SVG not supported

Login to reply

👍?
helpful
2:24pm Mon 18th Apr, Ryan B.

Hi Peter,

As we've discussed in the labs, there are multiple ways to 'skin the Unix cat', so it is entirely possible you have completed the assignment conditions without sed.

The reason that I advocate (in this particular instance) for the use of sed over tail can be tested locally at your end:

  • Currently, the presence of the U+EFF/BOM at the beginning of the file leads to your method producing a list of words, with the first line being empty (likely, the result of the BOM).
  • If you were to be given a file without the BOM (or you use the sed technique above to remove it if it exists, or you create your own test file), does this still produce an empty line at the top of the list of words?
  • If it does not produce that empty line, then the use of tail +2 (a la. the labs with the CSV headers) is going to affect your results, as you will be skipping a word, rather than an empty line.

I should note that when I mention the use of sed, I am not suggesting permanently editing the files in-place - rather, as is the case in various pre-processing scenarios, I can create an intermediary file (maybe $FILENAME.edit), do my operations on that, and then remove that edited file once the analysis has been done.

I hope that clarifies things.

Warm Regards,

Ryan.


SVG not supported

Login to reply

👍?
helpful
5:51pm Mon 18th Apr, Peter M.

"Ryan Bunney" [email protected] wrote:

Hi Peter,

As we've discussed in the labs, there are multiple ways to 'skin the Unix cat', so it is entirely possible you have completed the assignment conditions without sed.

The reason that I advocate (in this particular instance) for the use of sed over tail can be tested locally at your end:

  • Currently, the presence of the U+EFF/BOM at the beginning of the file leads to your method producing a list of words, with the first line being empty (likely, the result of the BOM).
  • If you were to be given a file without the BOM (or you use the sed technique above to remove it if it exists, or you create your own test file), does this still produce an empty line at the top of the list of words?
  • If it does not produce that empty line, then the use of tail +2 (a la. the labs with the CSV headers) is going to affect your results, as you will be skipping a word, rather than an empty line.

I should note that when I mention the use of sed, I am not suggesting permanently editing the files in-place - rather, as is the case in various pre-processing scenarios, I can create an intermediary file (maybe $FILENAME.edit), do my operations on that, and then remove that edited file once the analysis has been done.

I hope that clarifies things.

Warm Regards,

Ryan.

Hi Ryan,

When I remove the BOM, my method no longer produces the empty line. I now see your point about using tail.

Thanks,

Peter.

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  1:17AM Sep 14 2022
Privacy policy