Hi,
I'll respond to each bit below it,
ANONYMOUS wrote:
> Hi Sir
>
> Good afternoon. I have some questions about the assignment 2 requirements
>
> In the data cleaning part, do we have to return an error message if the data set does not meet the "clean" condition?
As always,that depends on whether the problem is local/fixable or not fixable.
>
> For example:
> 1. Based on the header (i.e. top) line, make sure that the file is a tab-separated format file
>
> if not: “ The input file is not matching with tab-separated format file”? Or do we have to convert it to a TSV file and continue?
A .csv file instead of the expected .tsv (or some other separator) is not readily fixable (if done properly), so error message and exit is reasonable.
>
> 2. Also based on the header line, report any lines that do not have the same number of cells. (Cells are allowed to be empty.)
>
> If the line of the data doesn’t match the condition —> return that specific line with an error message.
Yes, report the line on stderr, and continue processing (but don't print the line)
> If the file has the same number of cells —> continue
Yes
>
> 3. Remove the column with the header Continent, which is sparsely populated and is not present in one of the files.
> If the file contains Continent, remove it? I am not clear on the requirements
Yes, remove the column called Continent, if present in the file
>
> 4. Ignore the rows that do not represent countries (the country code field is empty)
Yes
> 5. Ignore the rows for years outside those for which we have at least some Cantril data as that is what we will be using. In practice, this means only include years from 2011 to 2021, inclusive.
Yes.
> The output file sent to stdout should have rows with the data in the following order (tab separated):
>
> Thanks for your time for helping me.
Cheers
MichaelW
👨🎨