The extraordinary volume of data that is nowadays produced in different contexts and by various platforms and devices (such as social networks and multi-sensor system) has given rise to the Big Data movement and data economy. Big Data is one of the main pillars of the information society and a cornerstone for the next generation of smart systems, which act automatically and intelligently, optimizing productivity, business processes and managerial decisions. Despite significant advances in our ability to collect, store, manage and process data characterized by the 4Vs (Volume, Velocity, Variety and Veracity), Big Data’s business value still lies in the analytics. Raw data tends to be useless, unless it is properly processed and transformed to actual insights for any business. Applications helping with diagnosis of diseases, forecasting of the demand for electricity, prediction of a machine’s end of life, identification of the driving context etc. are all based on the processing of large volumes of data and deriving knowledge.
Big Data Analytics and Knowledge Extraction
There are many different ways and techniques for extracting knowledge from raw Big Data. In most cases data scientists, employ statistics for testing some knowledge-related hypotheses and machine learning as a means of building a high-performance software agent that is able to learn from the data. As part of the data mining and knowledge discovery process, scientists combine statistics and machine learning in a way that integrates theory and heuristics. Furthermore, they undertake other prerequisite activities for extracting and using knowledge, such as data cleaning, data transformation, as well as visualization of the extracted knowledge.
A data mining and knowledge extraction process may have different targets depending on the business problem at hand. Some of the most common tasks include:
- Classification, which aims at predicting the class of an item (e.g., automatically predicting whether a loan application will be accepted or rejected).
- Clustering, which automatically clusters datasets into distinct categories (e.g., clustering customers into different market segments).
- Association, which identifies relations between two or more variables in a large dataset (e.g., identifying whether a customer that buys a) a shirt and b) a pair of trousers is likely to buy a jacket as well).
- Summarization, which summarizes the properties of a dataset (e.g., automatically summarizing a document containing natural language).
- Deviation Detection, which is about identifying events, observations or items that do not follow a specific pattern.
For each one of the above tasks, data mining experts, have at their disposal, several tools and techniques (e.g., decision trees, Bayesian methods, linear regression, k-means clustering), which need to be appropriately configured and parameterized according to the business requirements at hand. The identification, validation and ultimate deployment of an optimal data mining model involves a series of tasks, which are carried out in an iterative fashion.
The following tasks are part of the data mining process.
The CRISP-DM Data Mining Process
Data mining processes analyze the datasets and evaluate alternative data mining models as a means of identifying and selecting the most suitable ones for deployment. The most widely used data mining process is the CRISP-DM (Cross Industry Standard Process for Data Mining), which comprises the following six phases:
- Understanding the business question: This initial phase is about understanding the purpose and the scope of the data mining process. It identifies the requirements and the ultimate target of the knowledge process, such as what has to be classified or predicted, with what speed and on the basis of what accuracy. Moreover, this phase creates an initial plan for dealing with the problem given the available datasets, including a list of methods to be explored.
- Data Understanding: This phase focuses on the collection and inspection of the available datasets. It aims at identifying data quality problems, while at the same time gaining some insights on the methods that are likely to be effective. The latter insights are based on the experience of the data scientists, who are in most cases able to identify the main properties of the available datasets by simply reviewing them.
- Data preparation: As part of this phase, the datasets are prepared to be used as input to a data mining model. This preparation process entails different transformation steps, such as filtering out fields that are not useful, homogenizing data formats, and cleaning the data from empty or incomplete attributes.
- Data Modeling: This phase focuses on the selection and calibration of the data mining models that will be used for the target problem (e.g., decision trees or linear regression models). This phase is very closely affiliated to the data preparation activities, given that different models may require different input datasets.
- Data models evaluation: During this phase, the candidate data models are evaluated in terms of their ability to solve the problem at hand. A successful data model should be able to solve the target business problem (e.g., classifying a customer to a proper segment), while at the same time respecting non-functional constraints (e.g., performance). In cases where the business requirements cannot be met, the first phase of the CRISP-DM is revisited in order to (re)formulate the business problem.
- Deployment: This is the final phase of the CRISP-DM based data mining process phase. It entails the actual deployment of the successful machine learning models. As part of this deployment phase, the developed system needs to be integrated in its operational environment (e.g., with other business information systems), while the extracted knowledge needs to be appropriate visualized.
Other Data Mining Methodologies
CRISP-DM is not the sole data mining methodology is use. Other popular methodologies include KDD (Knowledge Discovery in Databases) and SEMMA (Sample, Explore, Modify, Model, and Assess). These methodologies comprise of slightly different phases and activities when compared to CRISP-DM. However, they have similar characteristics to CRISP-DM:
- These are iterative, since they comprise of phases that can be executed in an iterative approach, till results that meet business requirements are produced.
- These are sector agnostic as they can be applied for knowledge extraction regardless of the application domain of the business problem at hand.
Moreover, they comprise of similar phases. For example, KDD includes the selection, pre-processing, data transformation, data mining, and interpretation-evaluation. On the other hand, SEMMA comprises of the sampling, exploration, modification, modeling and assessment phases. While there is no one-to-one mapping between these phases, the names of these phases indicate a clear pertinence to the structure and phases of the CRISP-DM data mining process.
In the Big Data era, it is very important to employ experts that have a very good understanding of data mining processes, as the business value of Big Data is mainly in the analytics. Given the the proclaimed talent gap in Big Data experts, it’s always a good idea to look for reliable and knowledgeable business partners that can help you derive knowledge and maximize the value of your data.