PLEASE NOTE: the upgrades to this server,, have not yet been completed.
Hopefully the changes will be completed on THURSDAY 8th December.
Web-based programs, such as csmarks, cssubmit, and the help fora, will be unavailable at some time on Thursday 8th.
  It's UWAweek 49


This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying selected article
Showing 1 of 564 articles.
Currently 5 other people reading this forum.

 UWA week 19 (1st semester, week 10) ↓
SVG not supported

Login to reply

10:42pm Wed 11th May, Michael W.

"Adam Holt" [email protected] wrote:

Hi Michael.

I, and it seems many others, are confused about the word definitions, non-alphabetics and punctuation. Some help and clarification would be much appreciated. :)

Hi Adam What I'll is answer each part next to the question, rather at the end and need to flip back and forth, okay?

Reasons for Confusion

The specification requires that the double hyphen -- is preprocessed, i.e., removed by transforming to a space. The effect is that world--four (from sample1) becomes world four and therefore counts as 2 words.

Double hyphen receives special mention because you genuinely need to deal with hyphen, so you can't simply get rid of it, as you do, for example with ":" (colon, #, $ and every digit). Because of this ambiguity I needed to say something about it, and what I've said is that double hyphen should be replaced with Space/blank.

The question then is: Is the double hyphen -- the only non-alphabetic that needs to be removed from the input text?

The reasoning behind the question is that there are many other non-alphabetic punctuation characters which would have similar effects on the word count if they were also removed from the input text.

Bottom line. Remove everything that is not one of the things we want.

For example, the second line of HuckleberyFinn is:

by Mark Twain (Samuel Clemens)

If the parentheses () are not removed, then this line has 3 words (or is it 4?), i.e., the definition of a simple word: a simple word is a sequence of one or more alphabetic characters followed by a space, or any punctuation other than hyphen or apostrophe, implies that (Samuel is not a word but Clemens) is a word - I'm confused. On the other hand, if the parentheses are removed then this line unambiguously has 5 words.

parentheses featured in Assignment 1, but a not relevant here, so like al the other extraneous characters they are ignored. So, in this example, there are 5 works

Another example of this ambiguity is this line:

Release Date: August 20, 2006 [EBook #76

There are 4 words here (digits are ignored)

How many words are counted on this line? Date: is a word, but [EBook is not?

And yet more pathological cases like these lines:

Dey's two gals flyin' 'bout you in yo' life _Ain't_ you a sweet-scented dandy, though? A bed; and bedclothes; and a look'n'-glass

Dey's, flyin' and yo' are words but 'bout is not? _Ain't_ is not a word but if underscores are removed then it is a word? What about look'n'-glass?

According to the definitions I set up, "Dey's" would be a possessive. (Yes, I know it's actually a contraction, but to get that you need to be a human or do significant computer-based analysis, which is well beyond this unit_.

Questions & Clarifications

  • So, precisely what non-alphabetic characters need to be removed from the input texts?
  • And exactly what non-alphabetic punctuation characters are allowed in a word and in what positions of the word are they allowed to be in?

As I mentioned above, anything we don't explicitly want we ignore.

Cheers MichaelW

Cheers, Adam.

P.S. Apologies if I have misunderstood anything. Additionally, I think it would be very helpful to have some examples of words and non-words that are representative of common and pathological cases that are likely to be encountered in the corpora (maybe such examples can be added to project specification). In the end, if our programmed word definitions are incorrect then our profile will also be incorrect and will not pass the automated tests. :(

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  1:17AM Sep 14 2022
Privacy policy