It's UWAweek 48

help2003/help4407

This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying the 8 articles in this topic
Showing 8 of 564 articles.
Currently 3 other people reading this forum.


 UWA week 17 (1st semester, week 8) ↓
SVG not supported

Login to reply

👍?
helpful
2:41pm Thu 28th Apr, ANONYMOUS

Hi Michael, Can you please give some sort of final clarification on the data cleaning and the python script? You've mentioned on this forum that we should do data cleaning including Viet Nam and maybe the United Arab Emirates. You've not mentioned this in the assignment spec. I use grep to find the country so if I enter Emirates/emirates it still comes back with the correct answer. So the only real issue I see if Viet Nam? Can you please tell us if you're expecting us to hard code this or perform some other sort of data cleaning? Also in terms of the python script you wrote, I don't get its purpose? Is it just there if we want to use it? Or are you expecting us to use it? I would really appreciate some clarification, as I'm sure we all would. Thanks!


SVG not supported

Login to reply

👍?
helpful
10:20pm Thu 28th Apr, Michael W.

Most of the issues will disappear if use the Python program I created. However, Viet Nam/Vietnam will required special handing, as both are widely used names of that country. Cheers MichaelW


SVG not supported

Login to reply

👍x1
helpful
8:51am Fri 29th Apr, ANONYMOUS

Hi Michael, I am a little bit concerned about the edge cases. For example, before this conversation I didnt realise Viet Nam and Vietnam are both widely used. So I wouldn't even know that it needs to be specifically handled. And of course there are easy cases like USA - United States. But I am a bit worried about missing an edge case, just because I wasn't aware of such a thing. Would you be able to suggest how I could deal with this situation?


SVG not supported

Login to reply

👍?
helpful
9:38am Fri 29th Apr, Michael W.

Hi, I will only use the existing text, not contractions, which would be unfair. I also said to please ignore Sudan, as it's genuinely ambiguous, and assume that anything in brackets can be ignored. After using the helper program to deal with capitals (ie title-case), the only edge case I can see is Vietnam/Viet Nam, as both are used as the name of the country. Fair enough?


SVG not supported

Login to reply

👍?
helpful
9:43am Fri 29th Apr, ANONYMOUS

This is a really good point, as it applies to every country in the list. Simple examples include "Iran" vs "Iran (Islamic Republic of)" and "Korea" vs "South Korea" vs "Republic of Korea". But truthfully, how an individual may express a country's name is often cultural, or completely arbitrary. Handling each individual edge case is not good design. There are techniques around this (like how Google implements its search algorithm), but this is far beyond the scope of this course. Would a more realistic approach be to expect the user to input the country in the formats from the CSV (except for perhaps, case-sensitivity)?


SVG not supported

Login to reply

👍?
helpful
9:47am Fri 29th Apr, ANONYMOUS

Sorry Michael, I didn't see your reply when I wrote my response below. So we can assume that the user will express a country in the same language/format that will be in the CSV? (I'm assuming we're not being asked to implement an algorithm that handles linguistic semiotics :))


SVG not supported

Login to reply

👍?
helpful
9:53am Fri 29th Apr, Michael W.

Hi, First off, please bear in mind that everything in brackets can be ignored (and Sudan, given the ambiguity about which Sudan we're talking about). The truth is that this is intended as a Shell programming exercise, not real data-science. The teaching point, however, is the recognition that data-science analyses typically involve a great deal of data cleaning/data normalisation, so things that really are identical are identically represented. Cheers MichaelW


SVG not supported

Login to reply

👍?
helpful
10:02am Fri 29th Apr, ANONYMOUS

That's a very good point Michael, and thank you for clarifying. A good number of this course will be coming from the point of view of software engineering, where inputs and usage are expected to be thorough and deterministic. For them, it is difficult to know for this particular exercise, how all edge cases should be handled, without a comprehensive list of unit tests. But I understand that common sense should prevail.

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  1:17AM Sep 14 2022
Privacy policy