CS 548 Fall 2019 - Project 1

Computer Science Department

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2019
Project 1: Data Pre-processing

PROF. CAROLINA RUIZ

Due Date: Canvas submission by 3:00 pm on Thursday Sept. 12, 2019.

Instructions

Group Project: This is a group project. Students should work in groups of 2 students. Please do not split the project in a way that each student does only a portion of the work. Instead each student is expected to work on the entire project individually and then meet with the group to clarify doubts, share findings, and combine the project solutions into one group report. Submit just one written report. Help or assistance from other groups, other people, or online resources is NOT allowed.
Questions and Comments: If you have any questions about the project or the test, please post your questions to the Canvas discussion forum for this course. Do NOT email your question to the professor or the TA (unless your question is private and related just to your own situation). That way all students get to participate in and benefit from the discussion.
- To access the discussion forum go to Canvas, select "CS548-BCB503-CS583-F19-191: KDD-Data-Mining" under "My Courses", and then click on "Discussions" on the left hand-side bar.
- I suggest you set up your Canvas account so that you receive email notifications when anyone in the class posts comments on the forum. For this click on the "Subscribe" button.
- High quality participation on the discussion forum (e.g., providing good answers to other students' questions) will count toward your class participation grade.
Mandatory Readings: Read Chapters 1, 2, and Appendix B.1 from your textbook in detail.
Project Submission Instructions: Each group must submit the following two files using the Canvas system (submission name: Project1):
1. one pdf file with your group's project written report, and
2. one .py file containing all the Python code that your group used to perform each of the project parts below. Please organize your Python code neatly in the order of the project parts below with comments that help us understand your work.
Weka and Python: Install Python (required) and the Weka system developer version (optional) as described in the Course Webpage.
Regarding Python: (Required)
- See Prof. Ruiz's miscellaneous notes on Python.
- In this project, you will use Python's scikit-learn library for data preprocessing. It is recommended that you use the latest version of scikit-learn (currently v0-21-3).
- In addition to scikit-learn, you can use other Python libraries that are useful for the project (e.g., Pandas, NumPy, Matplotlib, ...) as long as you use the scikit-learn functions that are explicitly required in this project.
Regarding Weka: (Optional)
- You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. You need to unzip and/or use jar utilities this file to extract its contents. Inside, you will find the .java files that implement Weka.
- Consult the "README" file, the "documentation" webpage, and the "WekaManual" provided with the Weka system (in the same directory where Weka was downloaded). Browse through the "Package Documentation" to become familiar with it.
- When needed, use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:
```
java -Xmx768m -jar weka.jar
```

Problem I. Knowledge Discovery in Databases (15 points)

(3 points) Define knowledge discovery in databases.
(9 points) Briefly describe the steps of the knowledge discovery in databases process.
(3 points) Define data mining.

Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996. However, your answers must be written in your own words.

Problem II. Data Preprocessing: Attribute Transformations (140 points)

Consider the Flags Dataset available at the UCI Machine Learning Data Repository. See the link above for a description of this dataset.

Note that although most attribute values in this dataset are represented using numeric values, this dataset contains attributes that are conceptually of different types:

Discrete (or nominal) attributes:

1. name: Name of the country concerned
2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
3. zone: Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 3=SW, 4=NW
6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others 
18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
29. topleft: colour in the top-left corner (moving right to decide tie-breaks)
30. botright: Colour in the bottom-left corner (moving left to decide tie-breaks)

Boolean attributes:

11. red: 0 if red absent, 1 if red present in the flag
12. green: same for green
13. blue: same for blue
14. gold: same for gold (also yellow)
15. white: same for white
16. black: same for black
17. orange: same for orange (also brown)
24. crescent: 1 if a crescent moon symbol present, else 0
25. triangle: 1 if any triangles present, 0 otherwise
26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0
27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise

Continuous (or "numeric") attributes:

4. area: in thousands of square km
5. population: in round millions
8. bars: Number of vertical bars in the flag
9. stripes: Number of horizontal stripes in the flag
10. colours: Number of different colours in the flag
19. circles: Number of circles in the flag
20. crosses: Number of (upright) crosses
21. saltires: Number of diagonal crosses
22. quarters: Number of quartered sections
23. sunstars: Number of sun or star symbols

In this project, you will apply different scikit-learn preprocessing functions (see this link) to this dataset.

(5 points) Discrete attributes with too many values:
The attribute name (attribute #1) contains many values (one for each data instance). Would you keep this attribute in the dataset when mining for patterns? Why or why not. Explain.
(25 points) Converting discrete attributes to continuous:
1. (10 points) Read scikit-learn data transformation functions section 5.3.4 and use the OneHotEncoder function to encode each of these nominal attributes: mainhue (#18), topleft (#29), and botright (#30). Include the Python code you use for this in your written report and your .py file.
2. (15 points + 1 extra point) Attributes landmass (#2), zone (#3), language (#6) and religion (#7) are discrete ("nominal") even though their values were represented using numbers. For each of these attributes:
  - (2 points/each) Discuss whether or not the numeric encoding used is appropriate.
  - (2 points/each) If you answer above is "no", use the OneHotEncoder function to encode the attribute in a more appropriate way
    - Decide what the best values (the default value or another value) for the function parameters are. Explain your choices.
    - Include the OneHotEncoder function parameter values that you used in your written report and a brief description of your observations. You need to include your Python code in your .py file as well.
(25 points) Handling missing values:
In this part, you need to consider only the area attribute (#4):
- (5 points) Plot a graph that shows the distribution of this attribute. Include this plot in your report.
- Assume just for this part of the project (although this is false) that the value "0" in this attribute represents a missing value. Oscar's hint: An idea that makes it easier to solve this problem is to replace the "0" values in the area attribute with "NaN" so that scikit-learn assume these are missing values.
- Starting with the area attribute with missing values each time, apply each of the following scikit-learn imputation functions (read section 5.4):
  1. (10 points) Univariate feature imputation using the SimpleImputer function, experimenting with different strategies (mean, median, most frequent, and constant).
  2. (10 points) Multivariate feature imputation using the IterativeImputer function with default parameters (except for the "missing_values" parameter, which should be equal to 0 or to NaN, depending on what convention you are using). You should allow this iterative imputation to predict the 0 (or NaN) values of area using all of the other attributes in the dataset.
  - In each case above, plot a graph that shows the distribution of the transformed attribute.
  - Include in your written report the specific Python functions you used and with what parameters (3 out of 10 points), the plot of the transformed attribute (3 out of 10 points) and a brief description of what the function did to this attribute (4 out of 10 points). You need to include your Python code in your .py file as well.
(50 points) Standardization, scaling and normalization of continuous attributes:
In this part, you need to work only with the area attribute (#4):
- Include in your written report the plot that shows the distribution of the original attribute again.
- Starting with the original attribute each time, apply each of the following scikit-learn data transformation functions:
  1. (5 points) standardization using the scale function. Read scikit-learn data transformation functions section 5.3.1.
  2. (20 points) scaling to a range using the MinMaxScaler and the MaxAbsScaler functions (use the latter to see how it scales sparse data - read section 5.3.1.2); as well as robust_scale and RobustScaler, which handle outliers.
  3. (10 points) mapping to a uniform distribution using the QuantileTransformer and the quantile_transform functions. Read section 5.3.2.1
  4. (10 points) mapping to a Gaussian distribution using the PowerTransformer and the QuantileTransformer functions. Read section 5.3.2.2
  5. (5 points) normalization using the normalize function. Read section 5.3.3
  - In each case above, plot a graph that shows the distribution of the transformed attribute.
  - Include in your written report the specific Python functions you used and with what parameters (1 out of 5 points), the plot of the transformed attribute (2 out of 5 points) and a brief description of what the function did to this attribute (2 out of 5 points). You need to include your Python code in your .py file as well.
(25 points) Discretization:
In this part, you need to work only with the population attribute (#5):
- (5 points) Plot a graph that shows the distribution of this attribute. Include this plot in your report.
- Starting with the original attribute each time, apply each of the following scikit-learn data transformation functions (read section 5.3.5):
  1. (10 points) K-bins discretization using the KBinsDiscretizer function, experimenting with different encodings, strategies, n_bins_ and bin_edges_ values.
  2. (10 points) Feature binarization using the Binarizer function, experimenting with different threshold values.
  - In each case above, plot a graph that shows the distribution of the transformed attribute.
  - Include in your written report the specific Python functions you used and with what parameters (2 out of 10 points), the plot of the transformed attribute (2 out of 10 points) and a brief description of what the function did to this attribute (3 points out of 10 points). (The remaining 3 points out of 10 are for sufficient experimentation with different parameter values.) You need to include your Python code in your .py file as well.
(10 points) Custom transformation:
In this part, you need to work only with the area attribute (#4):
- Include in your written report the plot that shows the distribution of the original attribute again.
- (10 points) Use the FunctionTransformer described in scikit-learn data transformation functions (read section 5.3.8) to create a function to transform this attribute from being "thousands of square kms" to "thousands of square miles":
- Include in your written report the specific Python functions you used and with what parameters (5 out of 10 points), a brief description of how the function works (3 out of 10 points) and a plot of the transformed attribute (2 out of 10 points). You need to include your Python code in your .py file as well.

Problem III. Data Preprocessing: Dimensionality Reduction (130 points)

In each of the problems below, use the orginal Flags dataset after converting discrete attributes to continuous (in part II.2) above but without any of the transformations in parts II.3-II.6.
Study Section 2.4.5 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definitions and formulas for correlation and covariance.

(20 points) Correlation and Covariance Analysis:
- Correlation matrix:
  - (5 points) Calculate the correlation matrix of the dataset using Python. For example, you can use the numpy.corrcoef function. Include this matrix in your written report and the code in your .py file.
  - (5 points) Create a visualization of this matrix (e.g., a heatmap) using Python. There are several ways to do this, so look for a good visualization function in a Python library. Include both the visualization and the python code you used in your report, describing what Python library (or libraries) and what functions you used. Also include the code in your .py file.
- Covariance matrix:
  - (5 points) Calculate the covariance matrix of the dataset using Python. For example, you can use the numpy.cov function. Include this matrix in your written report and the code in your .py file.
  - (5 points) Create a visualization of this matrix (e.g., a heatmap) using Python. There are several ways to do this, so look for a good visualization function in a Python library. Include both the visualization and the python code you used in your report, describing what Python library (or libraries) and what functions you used. Also include the code in your .py file.
(15 points) Data Sampling:
Assume that you want to reduce the number of data instances by keeping just 60% of the data instances.
- Look for Python functions (e.g., sklearn.utils.resample) that allow you to apply:
  1. (5 points) (plain) Random sampling without replacement using uniform distribution;
  2. (5 points) (plain) Random sampling with replacement using uniform distribution;
  3. (5 points) Stratified random sampling without replacement using religion (attribute #7) as the target (or class) attribute.
- Include in your report and in your .py file what Python libraries and functions you used in each case and with what parameter values.
(60 points) Feature Selection:
- Manual feature selection:
  - (5 points) If you could keep all but 2 attributes, which 2 attributes would you remove from this dataset based on the correlation and the covariance matrices you constructed above? Explain your answer.
- Automatic feature selection: In this part, you will experiment with the feature selection techiques available in scikit-learn. When needed, use religion (attribute #7) as the classification target (here, use the original religion attribute without one-hot-encoding it); and population (attribute #5) as the regression target;
  - (5 points) VarianceThreshold. Read section 1.13.1 .
  - SelectKBest, experimenting with the following "score_functions":
    - (5 points each) For regression: f_regression, mutual_info_regression
    - (5 points each) For classification: chi2, f_classif, mutual_info_classif
    Read section 1.13.2.
  - (10 points) Recursive Feature Elimination (RFE). Read section 1.13.3.
  - (10 points) Look for an implementation of Correlation-based Feature Selection (CFS) in Python and experiment with it. See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6 and also Mark A.Hall's phd thesis. See Section 2.4.6 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definition and formulas for Mutual Information.
  - In each case above, include in your written report the specific Python functions you used and with what parameter values, the list of selected attributes and a brief description of your observations about the set of selected attributes. IMPORTANT (5 poins): Compare the sets of selected attributes obtained with different functions and/or parameters. You need to include your Python code in your .py file as well.
(35 points) Feature Extraction:
In this part, you will experiment with Principal Components Analysis (PCA). Read also Scikit-learn's Decomposing signals in components (matrix factorization problems)
Section 2.5.1.1 (Reading the whole section 2.5 is recommended though not required.)
- (15 points) First, use default parameter values for the PCA function (use svd_solver='auto' and leave n_components unset). Include in your report how many principal components were obtained, how much of the variance each of them explains (including cumulative variance explained), singular (eigen) value of each component. Also, include in your report the linear combinations that define the first tree new attributes (= components) obtained. Look at the results and elaborate on any interesting observations you can make about the results of the PCA function.
- (10 points) Now, assume that you need to reduce the number of dimensions as much as possible. Looking at the explained_variance_ , explained_variance_ratio_ , singular_values_ , and n_components_ values (if needed, rerun PCA with n_components set to 'mle'), determine a good number of components (= dimensions) to keep. Include this number in your written report and justify why you chose it.
- (10 points) Apply the PCA function again using n_components equal to the number of components you chose and copy equal to True (the default value of this parameter). Include in your written report the 4 first data instances (rows) in the transformed dataset.
- Include all of your Python code in your .py file.

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2019 Project 1: Data Pre-processing