Table of Contents:
Introduction…2Data mining main objectives
Data mining concept
Why data mining?
The process of knowledge discovery
Data mining methods
Data mining algorithms and models
Data mining: a practical case study
Mining the web
Advantages and disadvantages of data mining
Data Mining future
The steady advancement in Information Technology (IT) as we know results in the presence of a massive data stored either in operational databases or a huge data warehouses which increases the need to develop effective tools that are characterized by speed, accuracy and intelligence in the data analysis aspect and extraction of information and knowledge. Hence the so-called Data Mining (DM) and sometimes called knowledge discovery appeared as an effective technique aimed for finding knowledge from huge amounts of data i.e. transforming such data into useful information and transforming information into knowledge. "Data mining is the analysis step of the knowledge discovery in databases process". . Data mining has been emerged as a result of the development in the heterogeneous database systems in conjunction with the great development in computer hardware industry especially in storage technology. So we come for answering the questions what is data mining? And what is its importance? There are many definition of this concept which defines it simply as "mining knowledge from large amounts of data" while Jiawei Han and Micheline Kamber (2006 p.7) adopted a broad view that says “data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories". So data mining is exploration in large volumes of data as well as the discovery of the relationships between them and summarized it into useful knowledge to be used in business for increasing income and reducing costs or in other words answering questions that are specialized and very wide when compared to the traditional and statistical query tools. Data mining means extracting the underlying unusual concepts which were not previously known. It is not like the traditional techniques that used for extraction of quantitative information generated by applications such as decision support systems and statistical methods in which information are specified prior to the extraction process, but it is simply the discovery of a hideout value in the data warehouse to generate predictions for future use.
The discovery of knowledge process in databases in general includes a number of stages, starting from the collection of the raw data to the stage that concerning the obtaining of new knowledge. Through this paper we will discuss data mining concepts and the data processing techniques which cover all of these stages and include data cleaning, data integration, data selection, data transformation, pattern evaluation and lastly knowledge presentation. This paper also aims to discuss the data mining concepts and study its principles and algorithms for knowledge discovery. This study will present the data mining techniques and models such as classifications, clustering, associations, description, estimation and prediction and sampling some design approach by using oracle data miner software.
Data mining main objectives
Data mining has proved its existence as one of the successful solutions for the analysis of large amounts of data and turns it from the accumulated and incomprehensible data to valuable information which can be exploited to take advantage of it as knowledge. The main objectives of using data mining, however, can be classified in the following points:
1.To explain some of observed phenomena, for example why the proportion of smokers increases in the country in the recent years?
2.To verify a theory, for example: validation of the theory which says that large families interested in health insurance rather than small families.
3.To analyze the data for new and unexpected relationships, for example: how will be the public expenditure that was inherent in the deception operations in a wide range of credit cards.
Data mining concept
Data mining is a computerized or a manual search for knowledge from huge historical data without prior assumptions about what can be defined. It is an analytical process to explore and search a huge database to extract useful patterns and relationships and to find the correlation between its elements. Data mining is a new technology that enables the predictive pattern discovery, hypothesis creation and testing, and insight-provoking generation. Data mining which is also known as Knowledge Discovery in Database (KDD) is any application which has a capability for extracting hidden knowledge and it is not related to any specific industry. It is considered as one of the top ten information technology aspects that will change the world in the coming years. Data mining process blends between artificial intelligence science, statistics, machine learning and databases.
Why mining data?
We stand daily in long line at big supermarkets waiting for our turn to pay the value of some of our purposes which we have purchased. During this period of waiting we hear multiple beeps coming out from many barcode readers. We know that any barcode beep is a transaction, and represented by a purchase record stored in the database. So, hundred thousands of records can be accumulated in the database per day. These records, however, contain important information of purchase process for many items and also the best-selling items. But how can we benefit from all of this data and how to make it useful in our business? Data mining technology only can answer this question. The problem is not like the past concentrated in the lack of storage space and insufficient data but, in fact, our lack in the experiences that is capable to convert this data to a valuable knowledge. "It is a process that helps identify new opportunities by finding fundamental truths in apparently random data" as stated by Sumathi and Sivanandam (2006 p8).
As we said before, data mining technology is the process of extrapolation in a large volume of data in order to detect influencing factors of a particular behavior such as the causes, conditions, contraindications etc., it has abilities to perform tasks that is not found or can not be provided by the traditional applications. Data mining is not directed to a specific domain but it has a clear impact in all life aspects. So, we can summarize its importance in information industry and why we use it in the following points:
- Eextraction of useful patterns become important in light of rapid development and widespread dissemination of databases.
- Its usage provides institutions and security services in all areas the ability to explore and focus on the useful and effective information in the database.
- Data mining techniques focuses on building future predictions and explore the behavior and trends and allowing assessment of the right decisions in time.
Data mining techniques capable to answer complex questions in record time, especially the types of questions those has been difficult to find answers to them by using classical statistical techniques.
- Not to surrender to the limits imposed by traditional methods like statistics and numerical analysis, data mining provides the maximum benefits from the modern curriculum, such as artificial intelligence and qualitative analysis.
- Data mining helps to discover new relationships that may lead to the discovery of new theories and cooperates in science development.
- Data mining provides enterprises and institutions the ability to focus on most important information in databases.
The process of knowledge discovery
The discovery of knowledge in a database is a process connected with managers and decision makers who are involved in results implementation. The KDD process consists of seven stages and can be summarized as follows:
1. Data cleaning: during this phase we refine data and isolate data that contains noise, inconsistency or impurities from the data set.
2. Data integration: combining of multiple data sources, manipulating data of variable elements that may be included in a common data source.
3. Data selection: this stage is used to identify and retrieve relevant data from the data set.
4. Data transformation: is the process of transferring data that have been selected into a form suitable for search and retrieval procedures.
5. Data mining: is applications process where intelligent techniques and algorithms take place so as to gather useful data pattern.
6. Pattern evaluation: after extracting the important models of data patterns which represent the knowledge, these patterns are evaluated based on specific standards and measurements.
7. Knowledge presentation: the last stage of knowledge discovery in databases which is visual to the end user, this phase uses the basic visual technique to help the end user to understand and interpret the results of data mining.
Stages 1 through 4 represent data preprocessing procedures i.e. preparing accurate data for mining processes. Fig (1) illustrates a general overview of data mining system architecture. We notice that the data mining is only one step in KDD and it consists of complex data miner applications for knowledge extraction and the resulting knowledge may be stored further in a knowledge base. Referring to fig (1) we can say that the structure of data mining consists of six components as follows:
1. Data warehouse, flat files, database, World Wide Web or any other data storage container where the data preprocessing techniques take place.
2. Database or data warehouse server which is dealing with retrieving and capturing data according to the data mining user requests.
3. Knowledge base which is used for storing the extracted knowledge for further evaluation of the resulting patterns.
4. Data mining engine which consists of data mining applications modules that perform all data mining functions such as classification, summarization, clustering, association rules, prediction, time series analysis, regression and sequence discovery.
5. Pattern evaluation module which communicates with the data mining modules for measuring the interestingness and focusing the search towards interesting patterns.
6. User interface is responsible of communication between the end user and the data mining modules and provides the end user ability to perform his query tasks, perform all exploratory data mining, browsing databases and data warehouses and evaluate their schemas and data structures.
Abbildung in dieser Leseprobe nicht enthalten
Fig (1) Data Mining System Architecture.
The KDD knowledge output can be accommodated in decision making, query processing, information management and process control. Therefore, according to Jiawei Han and Micheline Kamber " Data mining is considered one of the most important frontiers in database and information systems and one of the most promising interdisciplinary developments in the information technology" (2006 p10).
Data preprocessing is considered to be the very serious stage in the data mining and the correct exploration in database should be built on a data that ensures the flow of knowledge. The databases, as we know, contain groups of a very large amount of data that is collected through certain automated methods that are not completely controlled. So the databases are vulnerable to missing, incorrect, incomplete, inconsistent and noisy data which represent the inputs to analysis processes and therefore the knowledge discovery. An attention should be paid to the quality of data, if not collected and selected carefully may leads to misleading results specifically in the predictive data mining. The data preprocessing methods such as data selection, data cleaning, data integration and data transformation should be applied to the database to correct errors, remove noisy data and gathering data from various data sources. Descriptive data summarization techniques can be applied first to highlight the data properties and distinguish between noisy, missing, incorrect, outliers and incomplete data. Human data entry contributes to an inaccurate and missing data and it is data cleaning process functions to deal with the data entry errors. Data integration process is responsible for well designed schema in its tables, attributes and constraints, i.e. schema that contains no redundancies when combing data from different data sources. The data transformation process is concerned with data structure i.e. the data should be transformed and encapsulated in a form suitable for mining. The ETL software which allows extract, transform and load data to and from a database or a data warehouse play an important role in the data transformation process. Finally, data mining requires that the database or data warehouse that containing an up to date data and as Larose said "Data mining often deals with data that hasn’t been looked at for years" (2005 p28).
Data mining methods
The data mining methods can be grouped into two major types: predictive data mining and descriptive data mining. Descriptive data mining such as summarization, clustering, association rules and sequence discovery deals with the general characteristics of data in the database and it depends on the reorganization of data and mining in its depths as if for the extraction of models that allows you to create a simple description of similar entities such as similar customers in sales database and no target is required for such data. The predictive data mining, on the other hand, is trying to find the best predictions based on the data, such as knowing the best and the preferred product to a specific customer. In brief, this type of data mining depends on the historical data i.e. using old information to predict or forecast for what will happen in the future. Unlike descriptive data mining, the predictive data mining has a target to achieve. The descriptive data mining tasks can be summarized in classification, regression, time series analysis and prediction. In the following paragraph we will highlight some of data mining tasks which include classification, clustering, association rule, sequence discovery, regression, and time series analysis.
Classification is used widely in solving many problems, especially those tasks which are related to the business. Classification means extracting groups of information based on common properties or characteristics of group's elements such as the classification of electricity prepaid customers based on their monthly purchases or classification of thermal electricity power stations based on fuel consumption. Classification is carried out through a series of data analysis that preparing the output in the form of divisions or classes which can be used later for the future data classification and this shows the main difference between clustering and classification. There are many types of techniques that can be used in the data classification by using many of algorithms such as, statistical algorithms, neural networks, decision tree, genetic algorithms and the nearest neighbor algorithms.