10:42pm Wed 11th May, Michael W.

"Adam Holt" [email protected] wrote:

Hi Michael.

I, and it seems many others, are confused about the word definitions, non-alphabetics and punctuation. Some help and clarification would be much appreciated. :)

Hi Adam What I'll is answer each part next to the question, rather at the end and need to flip back and forth, okay?

Reasons for Confusion

The specification requires that the double hyphen -- is preprocessed, i.e., removed by transforming to a space. The effect is that world--four (from sample1) becomes world four and therefore counts as 2 words.

Double hyphen receives special mention because you genuinely need to deal with hyphen, so you can't simply get rid of it, as you do, for example with ":" (colon, #, $ and every digit). Because of this ambiguity I needed to say something about it, and what I've said is that double hyphen should be replaced with Space/blank.

The question then is: Is the double hyphen -- the only non-alphabetic that needs to be removed from the input text?

The reasoning behind the question is that there are many other non-alphabetic punctuation characters which would have similar effects on the word count if they were also removed from the input text.

Bottom line. Remove everything that is not one of the things we want.

For example, the second line of HuckleberyFinn is:

by Mark Twain (Samuel Clemens)

If the parentheses () are not removed, then this line has 3 words (or is it 4?), i.e., the definition of a simple word: a simple word is a sequence of one or more alphabetic characters followed by a space, or any punctuation other than hyphen or apostrophe, implies that (Samuel is not a word but Clemens) is a word - I'm confused. On the other hand, if the parentheses are removed then this line unambiguously has 5 words.

parentheses featured in Assignment 1, but a not relevant here, so like al the other extraneous characters they are ignored. So, in this example, there are 5 works

Another example of this ambiguity is this line:

Release Date: August 20, 2006 [EBook #76

There are 4 words here (digits are ignored)

How many words are counted on this line? Date: is a word, but [EBook is not?

And yet more pathological cases like these lines:

Dey's two gals flyin' 'bout you in yo' life _Ain't_ you a sweet-scented dandy, though? A bed; and bedclothes; and a look'n'-glass

Dey's, flyin' and yo' are words but 'bout is not? _Ain't_ is not a word but if underscores are removed then it is a word? What about look'n'-glass?

According to the definitions I set up, "Dey's" would be a possessive. (Yes, I know it's actually a contraction, but to get that you need to be a human or do significant computer-based analysis, which is well beyond this unit_.

Questions & Clarifications

  • So, precisely what non-alphabetic characters need to be removed from the input texts?
  • And exactly what non-alphabetic punctuation characters are allowed in a word and in what positions of the word are they allowed to be in?

As I mentioned above, anything we don't explicitly want we ignore.

Cheers MichaelW

Cheers, Adam.

P.S. Apologies if I have misunderstood anything. Additionally, I think it would be very helpful to have some examples of words and non-words that are representative of common and pathological cases that are likely to be encountered in the corpora (maybe such examples can be added to project specification). In the end, if our programmed word definitions are incorrect then our profile will also be incorrect and will not pass the automated tests. :(

