It's UWAweek 47

helpOSTS

This forum is provided to promote discussion amongst students enrolled in Open Source Tools and Scripting.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying the 4 articles in this topic
Showing 4 of 445 articles.
Currently 55 other people reading this forum.


 UWA week 20 (1st semester, week 11) ↓
SVG not supported

Login to reply

👍?
helpful

Hi Sir, Good afternoon. I have some questions about the assignment 2 requirements In the data cleaning part, do we have to return an error message if the data set does not meet the "clean" condition? For example: 1. Based on the header (i.e. top) line, make sure that the file is a tab-separated format file if not: “ The input file is not matching with tab-separated format file”? Or do we have to convert it to a TSV file and continue? 2. Also based on the header line, report any lines that do not have the same number of cells. (Cells are allowed to be empty.) If the line of the data doesn’t match the condition —> return that specific line with an error message. If the file has the same number of cells —> continue 3. Remove the column with the header Continent, which is sparsely populated and is not present in one of the files. If the file contains Continent, remove it? I am not clear on the requirements 4. Ignore the rows that do not represent countries (the country code field is empty) 5. Ignore the rows for years outside those for which we have at least some Cantril data as that is what we will be using. In practice, this means only include years from 2011 to 2021, inclusive. The output file sent to stdout should have rows with the data in the following order (tab separated): Thanks for your time for helping me.


SVG not supported

Login to reply

👍x1
helpful

Hi, I'll respond to each bit below it, ANONYMOUS wrote:
> Hi Sir > > Good afternoon. I have some questions about the assignment 2 requirements > > In the data cleaning part, do we have to return an error message if the data set does not meet the "clean" condition?
As always,that depends on whether the problem is local/fixable or not fixable.
> > For example: > 1. Based on the header (i.e. top) line, make sure that the file is a tab-separated format file > > if not: “ The input file is not matching with tab-separated format file”? Or do we have to convert it to a TSV file and continue?
A .csv file instead of the expected .tsv (or some other separator) is not readily fixable (if done properly), so error message and exit is reasonable.
> > 2. Also based on the header line, report any lines that do not have the same number of cells. (Cells are allowed to be empty.) > > If the line of the data doesn’t match the condition —> return that specific line with an error message.
Yes, report the line on stderr, and continue processing (but don't print the line)
> If the file has the same number of cells —> continue
Yes
> > 3. Remove the column with the header Continent, which is sparsely populated and is not present in one of the files. > If the file contains Continent, remove it? I am not clear on the requirements
Yes, remove the column called Continent, if present in the file
>
> 4. Ignore the rows that do not represent countries (the country code field is empty)
Yes
> 5. Ignore the rows for years outside those for which we have at least some Cantril data as that is what we will be using. In practice, this means only include years from 2011 to 2021, inclusive.
Yes.
> The output file sent to stdout should have rows with the data in the following order (tab separated): > > Thanks for your time for helping me.
Cheers MichaelW 👨‍🎨


SVG not supported

Login to reply

👍?
helpful
11:48am Tue 14th May, Kaichao Z.

Hi Prof. Wise, I have a further question for the requirement:

Also based on the header line, report any lines that do not have the same number of cells. (Cells are allowed to be empty.) I wonder which file should I base? The raw file or the merged file? Best regards, Kaichao Zheng


SVG not supported

Login to reply

👍?
helpful

"Kaichao Zheng" [email protected] wrote:

Hi Prof. Wise, I have a further question for the requirement:

Also based on the header line, report any lines that do not have the same number of cells. (Cells are allowed to be empty.) I wonder which file should I base? The raw file or the merged file? Best regards, Kaichao Zheng

Hi Kaichao, The testing for the correct number of cells per row should happen for the 3 files submitted to the data cleaning program, after which all the lines in a given file should have the same number of cells. Joining two files containing lines with the same number of cells per line (though perhaps different for each file), will result in the joined rows having the same number of cells. No?

Cheers MichaelW 👨‍🎨

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  8:08AM Aug 25 2024
Privacy policy