It's UWAweek 48

help2003/help4407

This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying selected article
Showing 1 of 564 articles.
Currently 5 other people reading this forum.


 UWA week 19 (1st semester, week 10) ↓
SVG not supported

Login to reply

👍?
helpful
9:44pm Wed 11th May, Adam H.

Hi Michael.

I, and it seems many others, are confused about the word definitions, non-alphabetics and punctuation. Some help and clarification would be much appreciated. :)

Reasons for Confusion


The specification requires that the double hyphen -- is preprocessed, i.e., removed by transforming to a space. The effect is that world--four (from sample1) becomes world four and therefore counts as 2 words.

The question then is: Is the double hyphen -- the only non-alphabetic that needs to be removed from the input text?

The reasoning behind the question is that there are many other non-alphabetic punctuation characters which would have similar effects on the word count if they were also removed from the input text.

For example, the second line of HuckleberyFinn is:

by Mark Twain (Samuel Clemens)

If the parentheses () are not removed, then this line has 3 words (or is it 4?), i.e., the definition of a simple word: a simple word is a sequence of one or more alphabetic characters followed by a space, or any punctuation other than hyphen or apostrophe, implies that (Samuel is not a word but Clemens) is a word - I'm confused. On the other hand, if the parentheses are removed then this line unambiguously has 5 words.

Another example of this ambiguity is this line:

Release Date: August 20, 2006 [EBook #76]

How many words are counted on this line? Date: is a word, but [EBook is not?

And yet more pathological cases like these lines:

Dey's two gals flyin' 'bout you in yo' life _Ain't_ you a sweet-scented dandy, though? A bed; and bedclothes; and a look'n'-glass

Dey's, flyin' and yo' are words but 'bout is not? _Ain't_ is not a word but if underscores are removed then it is a word? What about look'n'-glass?

Questions & Clarifications


  • So, precisely what non-alphabetic characters need to be removed from the input texts?
  • And exactly what non-alphabetic punctuation characters are allowed in a word and in what positions of the word are they allowed to be in?

Cheers, Adam.

P.S. Apologies if I have misunderstood anything. Additionally, I think it would be very helpful to have some examples of words and non-words that are representative of common and pathological cases that are likely to be encountered in the corpora (maybe such examples can be added to project specification). In the end, if our programmed word definitions are incorrect then our profile will also be incorrect and will not pass the automated tests. :(

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  1:17AM Sep 14 2022
Privacy policy