Faculty of Engineering and Mathematical Sciences 
Not logged in (login)


This forum is provided to promote discussion amongst students enrolled in Data Warehousing (CITS3401).
RSS cloud
Jump to:

There are 98 articles from this person

98 of 259 articles shown, currently no other people reading this forum.

UWA week 23 - 1st semester, study break

photo Re: Super Patterns (all 4)
Sun 7th Jun, 2:55pm, Zeyi W.
According to the definition: "An itemset/pattern is closed if none of its immediate supersets has the same support as the itemset/pattern", you only need to use immediate super-patterns for verification.
photo Re: Regarding extension for the project (both)
Sat 6th Jun, 7:22pm, Zeyi W.
That is okay. I will waive the late submission penalty if you submitted before the extended deadline.
photo Re: Super Patterns (all 4)
Sat 6th Jun, 7:21pm, Zeyi W.
Simple answer is yes. If {A, B, D} and {A, B, C, D} are frequent patterns (i.e. support >= min_sup), they are super-patterns of {B}. People in the community are often more interested in the "immediate super-patterns" (e.g. {B,C}, {B,D} and {A, B}), because...
photo Re: working out for equations in exam (both)
Wed 3rd Jun, 6:41pm, Zeyi W.
The marks are given for both the final answer and also the process of obtaining the answer.
photo Re: Finals (both)
Mon 1st Jun, 8:30pm, Zeyi W.
You don't need to write code to answer the questions.
photo Re: Calculating info gain (all 3)
Mon 1st Jun, 8:27pm, Zeyi W.
Sounds good. I am great that you persist on the problem, and figuring out the solution on your own--a crucial skill.

UWA week 22 - 1st semester, week 12

photo Re: PCA using discrete values (both)
Sat 30th May, 2:02pm, Zeyi W.
You can remove the categorical ones, but you need to explain why you want to do that in the report.
photo Re: Data Reduction - compare performance (both)
Fri 29th May, 5:22pm, Zeyi W.
You can compare them based on accuracy, F1, etc., or compare them on data set size, etc. Data reduction usually leads to lower accuracy, but sometimes may lead to higher accuracy. You can use J48 or other models of your choice.
photo Re: Tree view (all 4)
Fri 29th May, 11:31am, Zeyi W.
You can say the tree is trained with minNumOjb=100 in the report, and interpret the tree (or analyse something you want).
photo Re: Can we choose whether we want to use confidence or lift? (all 4)
Fri 29th May, 11:29am, Zeyi W.
Not the minsupport, but the support for the itemsets (which generate the rules). You may need to do some searches to find out. You should have only income bracket on the right-hand side.
photo Re: Data Reduction: Sampling and feature reduction (both)
Thu 28th May, 7:49pm, Zeyi W.
The first one. You should perform sampling+feature reduction on the data set.
photo Re: Tree view (all 4)
Thu 28th May, 7:47pm, Zeyi W.
Yes. You can also try REPTree instead of J48, where you can set "max_depth", if REPTree can meet your need.
photo Re: Data Reduction (all 4)
Thu 28th May, 3:10pm, Zeyi W.
Yes. You may treat binary attributes as numeric ones as well. It is up to you... You need to explain in the report.
photo Re: Training model for data reduction (both)
Thu 28th May, 2:58pm, Zeyi W.
You can choose any model you like. Both numerosity and feature reduction should be performed on the same dataset.
photo Re: Can we choose whether we want to use confidence or lift? (all 4)
Thu 28th May, 2:57pm, Zeyi W.
You can use confidence instead of lift. For the second problem, you can (1) set a very small confidence threshold and a very large number of rules (e.g. 1000), and then search for the rules you like; (2) construct a data set only having income>50k and...
photo Re: Comparing decision tree models (both)
Thu 28th May, 12:11pm, Zeyi W.
I would recommend comparing both if you have time. Otherwise, you can compare two tree models: one with attribute selection based on your understanding and the other with attribute selection based on IG.
photo Re: Question regarding the weighting of each assignment (both)
Thu 28th May, 12:08pm, Zeyi W.
The midsem has been reduced to 10% as per the announcement before the midsem test.
photo Re: question on using fnlwgt (both)
Thu 28th May, 12:06pm, Zeyi W.
You can remove fnlwgt if you don't know how to make use of it. However, using fnlwgt may lead to interesting findings. You may need to perform some preprocessing (e.g. normalisation) on fnlwgt before use it.
photo Re: Can we do different pre-processing for different part of of questions? (all 3)
Thu 28th May, 12:02pm, Zeyi W.
That is okay if you are able to justify why you want to do that. Generally speaking, you should treat the whole project as if you were doing it by yourself only, rather than combining two separate projects into one.
photo Re: features selection (all 4)
Thu 28th May, 11:59am, Zeyi W.
You should have the rules with only "income" on the right-hand side. Please try rules with smaller lift/conf.
photo Re: Comparing Decision Tree Model with and without attribute selection (both)
Thu 28th May, 11:55am, Zeyi W.
Thanks for pointing out. This should be "with and without attribute selection using IG". You can compare the two trees: one with attribute selection based on your understanding, and the other with attribute selection based on IG.
photo Re: Decision Tree accuracy comparison (both)
Thu 28th May, 11:48am, Zeyi W.
Please refer to this link (https://secure.csse.uwa.edu.au/run/help3401?p=np&a=193&all=y3) for better visualisation of the trees. A larger difference is preferred, but a smaller one is also fine. You can explain that on the report.
photo Re: visualizing tree models (all 4)
Thu 28th May, 11:47am, Zeyi W.
It is okay if you are happy with the visualisation. Otherwise, you may need to limit the tree depth or set a large limit for minNumObj, which should lead to a simpler tree.
photo Re: Data Reduction (all 4)
Thu 28th May, 11:42am, Zeyi W.
On the whole data set would be more reasonable. You can treat the data reduction task as an independent one from the previous classification task.
photo Re: Classification (both)
Thu 28th May, 11:39am, Zeyi W.
Either way is okay. You need to explain in the report why you want one from another.
photo Re: Do I need to use the top 5 rules? (both)
Thu 28th May, 11:39am, Zeyi W.
You can use rules with smaller confidence/lift to get a more diverse set of rules.
photo Re: features selection (all 4)
Wed 27th May, 5:10pm, Zeyi W.
1. This is normal. The data may not totally reflect our intuition. The task of attribute selection based on "your understanding" v.s. that based on IG is for you to get something like this, so that you would hopefully have a deeper impression of knowledge...
photo Re: visualizing tree models (all 4)
Wed 27th May, 5:05pm, Zeyi W.
Hi Antonio, You can right-click on the tree and choose “Fit to Screen” or “Auto Scale” (cf. screenshot attached).
photo Re: lab 11 question (both)
Wed 27th May, 5:00pm, Zeyi W.
Thanks for the feedback. I missed converting the "income" attribute in the lab sheet. What you have done is correct.
photo Re: Information gain tree (both)
Wed 27th May, 4:52pm, Zeyi W.
You can use either J48 or REPTree to learn the tree. For the attribute selection, you may choose the info gain or gain ratio to rank the attributes and selet the top ones.

UWA week 21 - 1st semester, week 11

photo Re: Weka: association rules that predict a specified attribute (all 6)
Sat 23rd May, 11:02pm, Zeyi W.  O.P.
Some students told me that when they are mining the rules for "income > 50k", the confidence is always 100%, because the subset contains only "income > 50k". It is possible to manually calculate the confidence for the rules (with income>50k) you mine (e.g....
photo Re: Pruning decision trees (both)
Fri 22nd May, 5:21pm, Zeyi W.
You can use pruning to simplify the tree. Alternatively, you can limit the tree depth if you are using REPTree; you can set a larger minNumObj, if you are using J48. By doing this, you should be able to obtain a simpler tree.
photo Re: attribute selection (all 4)
Fri 22nd May, 5:17pm, Zeyi W.
You should use your intuition to select k features, and you remove all the other features from your data set. Then, you use this processed data set to train a decision tree. For the other model, you can use Weka to select k features, and remove all the...
photo Re: attribute selection (all 4)
Thu 21st May, 5:55pm, Zeyi W.
You can choose some attributes which you think are important (intuitively) for making a prediction. For example, "education" should be important to income. What else can you think of? You can add them to the list. You don't need to manually calculate IG...
photo Re: including Scripts used (both)
Wed 20th May, 11:56am, Zeyi W.
You can include it in the report, and use any form you like. However, readability is crucial in marking (i.e. if one cannot understand what you are doing, one cannot appreciate it), so you should try to present the script nicely if you can.
photo Re: ranking association (all 4)
Tue 19th May, 2:18pm, Zeyi W.
No problem. Thanks for the additional info. Please feel free to use confidence.
photo Re: ranking association (all 4)
Mon 18th May, 3:26pm, Zeyi W.
I am not very sure on "I can't specify the attribute I want as the target; how can I specify which attribute I want as my target with lift enabled". You may refer to the thread on help3401 "Weka: association rules that predict a specified attribute"....
photo Re: Is it necessary to convert to binary for weka? (both)
Mon 18th May, 3:13pm, Zeyi W.
Hi Paul, Great question. Working directly with the categorical attributes (without conversion) is totally fine, if it meets your need. However, you may need to take extra care on numerical attributes, which WEKA may treat them as categorical ones. Regarding...

UWA week 20 - 1st semester, week 10

photo Re: Alteryx for Preprocessing - Project 2 (both)
Sun 17th May, 11:21pm, Zeyi W.
photo Re: Weka: association rules that predict a specified attribute (all 6)
Sat 16th May, 8:05pm, Zeyi W.  O.P.
You can construct a file which only contains income >50k, and perform rule mining from this file.
photo Re: Opening file error (both)
Sat 16th May, 8:04pm, Zeyi W.
Would you revise the first line of the test.csv file? Please replace the first line with: "age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,ra ce,sex,capital_gain,capital_loss,hours_per_week,native country,income_bracket"...
photo Re: Dataset for Project - Part 2 (both)
Sat 16th May, 7:34pm, Zeyi W.
We should use the training data. The test set is for model evaluation. You don't have to use it, if you don't know how to use the test set for model evaluation.
photo Re: Weka: association rules that predict a specified attribute (all 6)
Wed 13th May, 1:40pm, Zeyi W.  O.P.
The method presented in this post is potentially a way to help you find the rules.
photo Weka: association rules that predict a specified attribute (all 6)
Wed 13th May, 10:36am, Zeyi W.
Rather than sifting through the association rules to find the ones with the desired Right-Hand-Side variable, Weka Association Rule Mining allows for always producing rules that predict the class attribute (i.e. target variable) by turning on the "car"...
photo Re: project mark (both)
Tue 12th May, 12:52pm, Zeyi W.
Yes. When marking, the quality of a project will be evaluated. If students work in pairs, the contribution of each team member should be clearly stated.

UWA week 19 - 1st semester, week 9

photo Re: Does location of column index matter for association rule mining? (both)
Wed 6th May, 3:50pm, Zeyi W.
No. However, the order matters though. E.g. A,C => B is different from B=>A,C
photo Re: Data for project 2 (both)
Wed 6th May, 12:04pm, Zeyi W.
Excellent question. I have improved the description on LMS. The test data set can be used to evaluation model quality in the whole project.

UWA week 18 - 1st semester, week 8

photo Re: python library in final project (both)
Fri 1st May, 11:03pm, Zeyi W.
photo Re: Is there any change in Labs...? (both)
Tue 28th Apr, 3:24pm, Zeyi W.
It is the same.
photo Re: Regarding MultiWay Array Aggregation (M.A.A.) (both)
Mon 27th Apr, 7:16pm, Zeyi W.
Hi Lachlan. There are two key points you need to pay extra attention: 1) all the three 2-D cuboids (i.e. AB, BC and AC planes) are computed **simultaneously**; 2) the 3-D or base cuboid (i.e. ABC) can be read **once** only. The second point is particularly...

UWA week 17 - 1st semester, week 7

photo Re: use of fnlwgt (all 4)
Fri 24th Apr, 6:40pm, Zeyi W.
If you don't know how to use final weight, you can just ignore it.
photo Re: A couple questions (both)
Fri 24th Apr, 6:38pm, Zeyi W.
1. "Dashboard" of Power BI is the place which display the visualised results. You can go to the menu->Insert->Image to import images to power BI. 2. The data cube diagram is the diagram which looks like the star/snowflake schema or ER diagram. 3. What results...
photo Re: Submitting 2 sql files (both)
Fri 24th Apr, 6:23pm, Zeyi W.
photo Re: Any labs before hand in? (both)
Thu 23rd Apr, 5:38pm, Zeyi W.
There will be a lab session tomorrow morning from 10 to 12. Please try to attend if you can. You can find the Zoom information in LMS.
photo Re: Pathway error (both)
Thu 23rd Apr, 5:38pm, Zeyi W.
You may copy and paste the path to Windows file explorer, and see if you can open the folder/file. There will be a lab session tomorrow morning from 10 to 12. Please try to attend if you can.
photo Re: Concept Hierarchies (all 8)
Thu 23rd Apr, 5:32pm, Zeyi W.
photo Re: Starnet question (all 3)
Thu 23rd Apr, 5:32pm, Zeyi W.
It sounds a bit weird. To my understanding, at least you need to involve two dimensions (i.e. income and age) for the query. You may need to consider improving your design, or come up with a more complex business query which needs to use more than one...
photo Re: One visual per business question? (both)
Thu 23rd Apr, 5:28pm, Zeyi W.
One or two are okay. Please make sure that we can understand that two images are for one query though.
photo Re: Question about submission regarding project (all 3)
Thu 23rd Apr, 9:41am, Zeyi W.
Like the answer in the second post, you can submit the whole multidimensional project folder which contains the solution (sln.) file. The cube diagram is the one that looks like a star/snowflake schema. I think you are in the right place. You may use the...
photo Re: Dimensions and hierarchy (all 3)
Thu 23rd Apr, 9:35am, Zeyi W.
You don't have to have a concept hierarchy for every dimension. Some dimensions may not have concept hierarchies. You may consider using junk dimension (cf. Lecture 4).
photo Re: Submission (both)
Thu 23rd Apr, 9:30am, Zeyi W.
You can submit them in one zip file.
photo Re: Concept Hierarchies (all 8)
Thu 23rd Apr, 9:29am, Zeyi W.
You possibly need to think a bit harder. I believe more than one dimension has concept hierarchy.
photo Re: Visual Studio cube problems (all 4)
Thu 23rd Apr, 9:27am, Zeyi W.
You can do it that way. Alternatively, you can have two counts (>50k counter and <=50k counter) in the fact table, which would allow you to aggreate both on >50k and <=50k.
photo Re: Sql file submission (all 3)
Wed 22nd Apr, 5:55pm, Zeyi W.
If you create the tables manually, you can paste the diagram into your pdf file. However, you should at least submit the SQL script for populating the database. Alternatively, you can generate SQL scripts from SQL Server (even though you draw the diagrams...
photo Re: Visual Studio cube problems (all 4)
Wed 22nd Apr, 5:04pm, Zeyi W.
A boolean value is not recommended to serve as a measure in the fact table. The reason is that the boolean value is not additive. I would suggest that you use a "count" as the measure instead of true/false.
photo Re: Concept Hierarchies (all 8)
Wed 22nd Apr, 5:02pm, Zeyi W.
You don't have to have a concept hierarchy for every dimension. You may consider using junk dimension (cf. Lecture 4).
photo Re: Midsem answers (all 3)
Wed 22nd Apr, 12:59pm, Zeyi W.
You should be able to see your answers now. Please check it on LMS and let me know.
photo Re: Business query question (both)
Tue 21st Apr, 10:46pm, Zeyi W.
The short answer is yes, but such a query may be difficult to answer by your data warehouse. If you have enough time, feel free to answer it; otherwise, you may use simpler business queries.
photo Re: Midsem answers (all 3)
Tue 21st Apr, 5:11pm, Zeyi W.
I will ask the UWA LMS technical team tomorrow, and see if there is a way for you to find the submitted answers. If it is urgent, you can send me an email and I will send a screenshot of your answer to you.
photo Re: Description of the ETL process (both)
Mon 20th Apr, 7:25pm, Zeyi W.
In general, any things you have done in data preprocessing are part of ETL. Please refer to the end of Lecture 5. There are a few slides about ETL. You can manually preprocess the csv file, use Excel, use python scripts or whatever tools you prefer. In...

UWA week 16 - 1st semester, mid-semester break

photo Re: Boolean Measure in Fact Table (both)
Sun 19th Apr, 6:19pm, Zeyi W.
I wouldn't suggest using a boolean measure, because you may not be able to aggreate the measure. Instead, you may use "count".
photo Re: Dimension (all 3)
Sun 19th Apr, 5:15pm, Zeyi W.
Thanks Edward for the answer. One extra point I like to add is that you don't have to have a concept hierarchy for every dimension (e.g. Junk dimension).
photo Re: Fact Tables (all 3)
Sun 19th Apr, 4:58pm, Zeyi W.
I would recommend using Python, using the approach similar to the second post. You first assign each row (of the dimension table) in the csv file an ID. Then, you read a row in your fact table, replace the attribute values in the fact row with an ID to...
photo Re: SQL script submission (both)
Sun 19th Apr, 4:47pm, Zeyi W.
It would be better to include the instructions in the submission (e.g. pdf file), which is part of the ETL process. However, we should be able to find your default path pretty quickly, so please don't worry too much about the path.
photo Re: Files for submission (both)
Sat 18th Apr, 5:07pm, Zeyi W.
Please submit the .pdix file if you can. We would like to know you produce those figures using Power BI--a very useful tool which every student in this unit should learn to use. If you have difficulties in generating the .pdix file or you prefer other...
photo Re: ETL Process (both)
Sat 18th Apr, 5:04pm, Zeyi W.
Step 6 is part of ETL, and other data preprocessing work you have done is related to ETL. You can manually preprocess the csv file, use Excel, use python scripts or whatever tools you prefer.
photo Re: Dimension table (both)
Fri 17th Apr, 4:25pm, Zeyi W.
This is a great question. My suggestion is that you don't store those rows in your dimension table if they are not used. If you need to use those rows in the future (e.g. due to arrival of more data), you can insert the needed rows.
photo Re: Junk Dimension (both)
Fri 17th Apr, 10:15am, Zeyi W.
I would recommend you to use a junk dimension if you can. Just give it a name, i.e. miscellaneous, in your starnet.
photo Re: DWT Resources (all 5)
Thu 16th Apr, 4:32pm, Zeyi W.
You need the other 6 values to generate the original vector. If you have only the first two, the regenerated vector is an approximate one. You can try to do the regeneration, and you will find that the 6 values are used in the regeneration process.
photo Re: Can we add attributes to file? (all 4)
Thu 16th Apr, 11:02am, Zeyi W.
Both are okay.
photo Re: DWT Resources (all 5)
Thu 16th Apr, 10:41am, Zeyi W.
Hi Edward, You can refer to this link (https://mil.ufl.edu/nechyba/www/eel6562/course_materials/t5.wavelets/intro_dwt.pdf) which provides more details of the technique but is relatively easy to understand. Please note that signal processing (which includes...
photo Re: project requirement (all 3)
Wed 15th Apr, 8:13pm, Zeyi W.
1. Either submission of diagram file or SQL script for database creation is acceptable. It is okay to create the tables of the database from diagrams. However, you need to make sure that the diagram file can be opened from other computers (e.g. ask a...
photo Re: starnet design (all 4)
Wed 15th Apr, 4:20pm, Zeyi W.
If you directly use the data, you will not gain the marks in concept hierarchy. You should try to think about constructing concept hierarchies at least for some dimensions.
photo Re: Some questions: (all 3)
Wed 15th Apr, 3:46pm, Zeyi W.
You can logically convert the snowflake schema into a star schema. Then, you should be able to map the schema to starnet.
photo Re: Visio (all 4)
Wed 15th Apr, 3:42pm, Zeyi W.
Feel free to use a pen and paper to draw if using the software tool is too hard. However, please make sure we can understand the handwriting though :)
photo Re: fnlwgt meaning (all 7)
Wed 15th Apr, 3:39pm, Zeyi W.
Hi there. fnlwgt​: final weight. In other words, this is the number of people the census believes the entry represents. Feel free to ignore it if you don't need to use it.
photo Re: starnet design (all 4)
Wed 15th Apr, 3:36pm, Zeyi W.
You can have dimensions which contain the original values. For such dimensions, the higher level concept of the hierarchy is only the "all".
photo Re: Assignment Part 1: Code for data wrangling (both)
Wed 15th Apr, 12:32pm, Zeyi W.
Yes, you can write programs to process the data. You can submit the code for data preprocessing, or you can add the code of data preprocessing into your report. Let us know how you manipulate the data, which is part of the ETL process. For the other question...
photo Re: Can we add attributes to file? (all 4)
Wed 15th Apr, 12:18pm, Zeyi W.
You can generate the primary keys by yourself. You can also add other necessary information to your csv files or database tables. This is part of the Extract, Transform and Load (ETL) process.
photo Re: Can't see hierarchies when data is imported but fine when a live connection is used (all 4)
Wed 15th Apr, 12:15pm, Zeyi W.
Hi Edward, you can describe the concept hierarchy issues of Power BI in your report. In the report, you can generate the figures using live connection to SQL Server. In the submission of your Power BI file, you can use the "import" option. We will consider...
photo Re: Data cleaning (all 3)
Wed 15th Apr, 12:06pm, Zeyi W.
Edward has a very good point. You may find interesting patterns by keeping the rows with missing values, but it is also fine to remove them if those rows are not useful to answering your business queries.
photo Re: Joining csv file to create ID keys (both)
Wed 15th Apr, 12:02pm, Zeyi W.
I would suggest you to write a (python) script to transform your csv files. You can use "B+C" in your first csv file to find the corresponding "ID" in the second csv file.

UWA week 14 - 1st semester, week 6

photo Re: mock Q1 Q6 (both)
Thu 2nd Apr, 5:53pm, Zeyi W.
Please checkout the mock test answer, and the midterm test 2019 with solutions. The mock test was built on midterm test 2019. They are available in the Midterm Test page on LMS.
photo Re: Mid sem notes (both)
Thu 2nd Apr, 5:51pm, Zeyi W.
Yes, you can.

UWA week 13 - 1st semester, week 5

photo Re: Mid Sem Practice Test (all 5)
Sun 29th Mar, 8:41pm, Zeyi W.
The questions are similar to the ones in the mock test. I will make an announcement when the mock test answers are released. Please stay tuned.
photo Re: Lecture 4: 60,000 possible dimension table rows (all 3)
Sun 29th Mar, 8:35pm, Zeyi W.
Good answers. They are extreme/simplified examples. Based on your answers, I believe you have got the idea :)
photo Re: Mid Sem Practice Test (all 5)
Fri 27th Mar, 10:58am, Zeyi W.
I will double check. Sample answers will be provided.
photo Re: Flat File Connection Manager (both)
Tue 24th Mar, 8:25pm, Zeyi W.
Do you choose the display all the file types?
This Page

Program written by: [email protected]
Feedback welcome
Last modified:  8:27am May 24 2020