Computer Science 494: Research Methods

Study Guide

Unit 4: Mining, Simulation, Optimization and Modeling

This unit contains the following sections:

  • 4.1 Preface
  • 4.2 Learning Outcomes
  • 4.3 Data Mining
  • 4.4 Simulation
  • 4.5 Optimization: Numerical Analysis, Numerical Methods, Operations Research
  • 4.6 Modeling: Mathematical Modeling and Bayesian Networks
  • 4.7 References

4.1 Preface

This unit introduces four unique computational techniques that can be of use in certain types of data analysis: data mining, simulation, optimization, and Bayesian networks.

We anticipate that you will need about 11 hours (660 minutes) to complete this unit.

4.2 Learning Objectives

After completing Unit 4, you should be able to

  • analyse patterns of data using data mining techniques.
  • design simulation experiments.
  • solve optimization problems.
  • model problems in mathematical and causal frameworks.

4.3 Data Mining

We estimate that this segment of Unit 4 should take about 2.5 hours (150 minutes) to complete.

Through the use of automated data-mining techniques, industries and businesses are discovering new trends and patterns of behavior and data that previously went unnoticed. Once they have uncovered this vital intelligence, it can be used for a variety of applications. (Chen, Chen, & Verma, 2010, p. 65)

For example, the Toronto Raptors basketball team uses data-mining techniques to prepare the team against opposing teams; the Bank of Montreal’s business intelligence and knowledge discovery program is used to gain insight into customer behaviour.

Reading Assignment 4.3.R1

This reading provides an introduction to data mining.

Zaïane, O.R. (1999). Chapter 1: Introduction to data mining. Retrieved January from the website of the University of Alberta: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapter1/ch1.pdf

Viewing Assignment 4.3.V1

The video clips listed below provide demonstrations of data mining.

creativcommIT (2010a). Introduction to data mining (1/3). Retrieved from http://youtu.be/EtFQv_B7YA8
Note: Do not be concerned by the black screen that you will see for the first several minutes of the video. Images begin at about 3:35.

creativecommIT (2010b). Introduction to data mining (2/3). Retrieved from http://youtu.be/UdxDjQneh0k

creativecommIT (2010c). Introduction to data mining (3/3). Retrieved from http://youtu.be/UvBUPKT6aw8

Tabladillo, M. (2009). Microsoft data mining demo: Fill from example. Retrieved from http://youtu.be/NSyFQPAkkqc

4.4 Simulation

We estimate that this segment of Unit 4 should take about 2 hours (120 minutes) to complete.

In general, simulation is used to model and analyze randomness in a system. Simulation modeling and analysis is especially useful

  • [when] it is impossible or extremely expensive to observe processes in the real world, e.g., next year’s cancer statistics, performance of the next space shuttle, and the effect of Internet advertising on a company’s sales.
  • [for] problems in which mathematical models can be formulated, but analytic solutions are either impossible (e.g., job shop scheduling problem, higher order difference equations), or too complicated (e.g., complex systems like the stock market, and large scale queuing models).
  • [when] it is impossible or extremely expensive to validate the mathematical model describing the system, e.g., due to insufficient data.

    Applications of simulation abound in the areas of government, defense, computer and communication systems, manufacturing, transportation (e.g., air traffic control), health care, ecology and environment, sociological and behavioural studies, biosciences, epidemiology, services (bank teller scheduling), economics and business analysis. (Maria, 1997, p. 11)

Reading Assignment 4.4.R1

The first reading below provides a quick introduction to simulation; the second discusses simulation in the context of social research design.

The third listed reading provides a very detailed introduction to simulation. We encourage you to browse it, be selective about those sections you choose to read in detail.

GoldSim. (2011). Introduction: What is simulation? Retrieved from http://www.goldsim.com/Web/Introduction/Simulation/

Trochim, W. & Davis, S. (1996). Computer simulations for research design. Retrieved from https://www.billtrochim.net/simul/simul.htm

Archam, H. (n.d.). Systems simulation: The shortest route to applications. Retrieved from http://home.ubalt.edu/ntsbarsh/simulation/sim.htm

Practice Assignment 4.4.P1

Install MITRE Corporation’s Tortuga, run an example from the introduction section, and describe the simulation technique employed in that demo.

4.5 Optimization: Numerical Analysis, Numerical Methods, Operations Research

We estimate that this segment of Unit 4 should take about 2.5 hours (150 minutes) to complete.

Numerical analysis is the study of algorithms used to solve for problems in continuous mathematics, as distinguished from discrete mathematics.

Reading Assignment 4.5.R1

The first item identified below is a brief introduction to numerical analysis. The second item is an excellent course page on numerical methods; we encourage you to browse through it, and keep it in mind for future reference, but do be selective about the parts you choose to read.

Numerical analysis. (2014, February 19). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Numerical_analysis&oldid=596131473

University of South Florida. (n.d.). Holistic numerical methods: Transforming numerical methods education for the STEM undergraduate. Retrieved January 12, 2012, from http://numericalmethods.eng.usf.edu/

Reading Assignment 4.5.R1

Operations research. (2014, January 24). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Operations_research&oldid=592246175
Note: Pay close attention to the introduction and to the sections titled “Overview” and “Problems addressed with operational research.”

Practice Assignment 4.5.P1

Visit the NEOS server at the website of the Wisconsin Institutes for Discovery, and solve a sample optimization problem.

4.6 Modeling: Mathematical Modeling and Bayesian Networks

We estimate that this segment of Unit 4 should take about 4 hours (240 minutes) to complete.

Reading Assignment 4.6.R1

The first reading below provides a brief overview of mathematical modeling; the tutorial that follows it provides an excellent and extensive introduction to the subject.

Mathematical model. (2014, February 27). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Mathematical_model&oldid=588574340

McLaughlin, M. P. (1993-1999). “. . . the very game . . .”: A tutorial on mathematical modeling. Retrieved January 13, 2012, from http://www.causascientia.org/math_stat/Tutorial.pdf

Note: This tutorial runs for 50 pages.

Reading Assignment 4.6.R2

Bayesian network. (2014. February 23). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Bayesian_network&oldid=596734837

Bayesian belief networks (BBNs) provide a way of analyzing data that is incomplete in order to draw conclusions based on probabilities. The quotes below suggest the strength and complexity of the technique. Do not be intimidated by these quotes. The assigned readings provide very comprehensible introductions to the topic, and you will have the opportunity to work with a BBN in the practice assignment for this section.

Belief networks (also known as Bayesian belief networks, Bayes networks, [directed acyclic graphs (DAGs)] and causal probabilistic networks), provide a method to represent relationships between propositions or variables, even if the relationships involve uncertainty, unpredictability or imprecision. (Roylance, Sperring, & Barraclough, 2001, p. 255)

Formally, Bayesian networks are directed acyclic graphs whose nodes represent variables, and whose missing edges encode conditional independencies between the variables. Nodes [can represent any kind of variable: a measured parameter, a latent variable or a hypothesis. They are not restricted to representing random variables]. Efficient algorithms exist that perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. (BeliefNetworks, 2009)

The term “Bayesian networks” was coined by Judea Pearl in 1985 to emphasize three aspects:

  1. The often subjective nature of the input information.
  2. The reliance on Bayes’s conditioning as the basis for updating information.
  3. The distinction between causal and evidential modes of reasoning, which underscores Thomas Bayes’ posthumously published paper of 1763.

Informal variants of such networks were first used by legal scholar John Henry Wigmore, in the form of Wigmore charts, to analyses trial evidence in 1913. Another variant, called path diagrams, was developed by the geneticist Sewall Wright and used in social and behavioral sciences (mostly with linear parametric models).

Bayesian networks are used for modeling knowledge in computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis), medicine, document classification, information retrieval, image processing, data fusion, decision support systems, engineering, gaming and law (“Bayesian Networks,” 2014).

A BBN can be a model of any dataset such as a weather dataset, a disease and its symptoms dataset, a military dataset, or a criminal incident dataset. Bayesian belief networks are especially useful when the information about the past and/or the current situation is vague, incomplete, conflicting, and/or uncertain.

Currently, various software packages enable a user to build a Bayesian Belief Network (BBN) for modeling a particular dataset [e.g., Netica, JavaBayes, and so on]. (Riesen & Gursel, 2008)

Bayesian belief networks are powerful tools for modeling causes and effects in a wide variety of domains. . . . They are compact networks of probabilities that capture the probabilistic relationship between variables, as well as historical information about their relationships. (Xu, Zheng & Guo, 2007, p. 101)

Bayesian belief networks make explicit the dependencies between different variables. In general there may be relatively few direct dependencies (modeled by arcs between nodes of the network); this means that many of the variables are conditionally independent. (Wang, 2007, p. 29)

Reading Assignment 4.6.R3

The first two readings below provide very simple introductions to BBNs; the third reading gives a basic introduction to the mathematics on which BBNs depend. The fourth item is a tutorial.

Marcot, B.G. (2005). What are “Bayesian network models”? Retrieved from http://www.plexusowls.com/PDFs/What%20Are%20Bayesian%20Network%20Models.pdf

Fenton, N. (n.d.a). What is a Bayesian network? Retrieved from http://www.eecs.qmul.ac.uk/~norman/BBNs/What_is_a_BBN_.htm

Fenton, N. (n.d.b). Bayes rule. Retrieved from the website of the School of Electronic Engineering and Computer Science of Queen Mary University of London: http://www.eecs.qmul.ac.uk/~norman/BBNs/Bayes_rule.htm

Fenton, N. (n.d.c). BBN tutorial. Retrieved from http://www.eecs.qmul.ac.uk/~norman/BBNs/BBN_Tutorial__About_this_section.htm

Practice Assignment 4.6.P1

Install Netica, a software program that helps you develop and use Bayesian networks. You will wish to the Norsys tutorial on using Netica.

  1. Run the demonstration BBN network that comes with Netica, and write a short note on the network and how it has been used as a model.
  2. Critically analyze the shortfall of BBN models.

Assignment 1

Review Assignment 1 on the course home page, and do whatever you can at this point.

4.7 References

Archam, H. (n.d.). Systems simulation: The shortest route to applications. Retrieved from http://home.ubalt.edu/ntsbarsh/simulation/sim.htm

Bayesian network. (2014. February 23). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Bayesian_network&oldid=596734837

BeliefNetworks. (2009). About BeliefNetworks.

Chen, P., Chen, I., & Verma, R. (2010, Improving an undergraduate data mining course with real-world projects. Journal of Circuits, Systems, and Computers, 25(4), 62–67. Retrieved from http://dl.acm.org/ft_gateway.cfm?id=1734810&type=pdf
Also available through the AU Library Services journal databases.

creativcommIT (2010a). Introduction to data mining (1/3). Retrieved from http://youtu.be/EtFQv_B7YA8

creativecommIT (2010b). Introduction to data mining (2/3). Retrieved from http://youtu.be/UdxDjQneh0k

creativecommIT (2010c). Introduction to data mining (3/3). Retrieved from http://youtu.be/UvBUPKT6aw8

Fenton, N. (n.d.a). What is a Bayesian network? Retrieved from http://www.eecs.qmul.ac.uk/~norman/BBNs/What_is_a_BBN_.htm

Fenton, N. (n.d.b). Bayes rule. Retrieved from the website of the School of Electronic Engineering and Computer Science of Queen Mary University of London: http://www.eecs.qmul.ac.uk/~norman/BBNs/Bayes_rule.htm

Fenton, N. (n.d.c). BBN tutorial. Retrieved from http://www.eecs.qmul.ac.uk/~norman/BBNs/BBN_Tutorial__About_this_section.htm

GoldSim. (2011). Introduction: What is simulation. Retrieved from http://www.goldsim.com/Web/Introduction/Simulation/

Marcot, B. G. (2005). What are “Bayesian belief network models”? Retrieved from http://www.plexusowls.com/PDFs/What%20Are%20Bayesian%20Network%20Models.pdf

Maria, A.. (1997). Introduction to modeling and simulation. In S. Andradóttir, K. J Healy, D. H. Withers & B. L. Nelson (Eds). Proceedings of the 1997 Winter Simulation Conference, (pp. 7–13). Washington: IEEE Computer Society. Retrieved from http://www.inf.utfsm.cl/~hallende/download/Simul-2-2002/Introduction_to_Modeling_and_Simulation.pdf

Mathematical model. (2014, February 27). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Mathematical_model&oldid=588574340

McLaughlin, M. P. (1999). “. . . the very game . . .”: A tutorial on mathematical modeling. Retrieved from http://www.causascientia.org/math_stat/Tutorial.pdf

Numerical analysis. (2014, February 19). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Numerical_analysis&oldid=596131473

Operations research. (2014, January 24). Wikipedia. Retrieved March 7, 2014, from http://en.wikipedia.org/w/index.php?title=Operations_research&oldid=592246175

Riesen, M. J., & Gursel, S. (2008). Bayesian belief network query tool: Patent application (US/12/256743). Retrieved from https://www.google.ch/patents/US20090106734

Roylance, B. J., Sperring, T. P. & Barraclough, T. G. (2001). Bench test determination of wear models to classify morphological attributes of wear debris. In G. Totten, L. Wedeven, J. Dickey & M. Anderson, (Eds. ), Bench testing of industrial fluid lubrication and wear properties used in machinery applications ASTM STP 1410 (pp. 235–258). West Conshohocken, PA: American Society for Testing and Materials.

Tabladillo, M. (2009). Microsoft data mining demo: Fill from example. Retrieved from http://youtu.be/NSyFQPAkkqc

Trochim, W., & Davis, S. (1996). Computer simulations for research design. Retrieved from https://www.billtrochim.net/simul/simul.htm

University of South Florida. (n.d.). Holistic numerical methods: Transforming numerical methods education for the STEM undergraduate. Retrieved from http://numericalmethods.eng.usf.edu/

Wang, C. (2007). Hybrid causal logic methodology for risk assessment. Ph.D. Dissertation, University of Maryland, College Park, MD. Retrieved from http://drum.lib.umd.edu/bitstream/1903/7729/1/umi-umd-5006.pdf

Xu, Z. Zheng, X., & Guo, P. (2007, June). Empirically validating software metrics for risk prediction based on intelligent methods. Journal of Digital Information Management, 5(3), 99–106.

Zaïane, O.R. (1999). Chapter 1: Introduction to data mining. Retrieved from the website of the University of Alberta: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapter1/ch1.pdf