how to write a research paper on data mining

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

Advanced Search
Journal List
PeerJ Comput Sci

Adaptations of data mining methodologies: a systematic literature review

Associated data.

The following information was supplied regarding data availability:

SLR Protocol (also shared via online repository), corpus with definitions and mappings are provided as a Supplemental File .

The use of end-to-end data mining methodologies such as CRISP-DM, KDD process, and SEMMA has grown substantially over the past decade. However, little is known as to how these methodologies are used in practice. In particular, the question of whether data mining methodologies are used ‘as-is’ or adapted for specific purposes, has not been thoroughly investigated. This article addresses this gap via a systematic literature review focused on the context in which data mining methodologies are used and the adaptations they undergo. The literature review covers 207 peer-reviewed and ‘grey’ publications. We find that data mining methodologies are primarily applied ‘as-is’. At the same time, we also identify various adaptations of data mining methodologies and we note that their number is growing rapidly. The dominant adaptations pattern is related to methodology adjustments at a granular level (modifications) followed by extensions of existing methodologies with additional elements. Further, we identify two recurrent purposes for adaptation: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). The study suggests that standard data mining methodologies do not pay sufficient attention to deployment issues, which play a prominent role when turning data mining models into software products that are integrated into the IT architectures and business processes of organizations. We conclude that refinements of existing methodologies aimed at combining data, technological, and organizational aspects, could help to mitigate these gaps.

Introduction

The availability of Big Data has stimulated widespread adoption of data mining and data analytics in research and in business settings ( Columbus, 2017 ). Over the years, a certain number of data mining methodologies have been proposed, and these are being used extensively in practice and in research. However, little is known about what and how data mining methodologies are applied, and it has not been neither widely researched nor discussed. Further, there is no consolidated view on what constitutes quality of methodological process in data mining and data analytics, how data mining and data analytics are applied/used in organization settings context, and how application practices relate to each other. That motivates the need for comprehensive survey in the field.

There have been surveys or quasi-surveys and summaries conducted in related fields. Notably, there have been two systematic systematic literature reviews; Systematic Literature Review, hereinafter, SLR is the most suitable and widely used research method for identifying, evaluating and interpreting research of particular research question, topic or phenomenon ( Kitchenham, Budgen & Brereton, 2015 ). These reviews concerned Big Data Analytics, but not general purpose data mining methodologies. Adrian et al. (2004) executed SLR with respect to implementation of Big Data Analytics (BDA), specifically, capability components necessary for BDA value discovery and realization. The authors identified BDA implementation studies, determined their main focus areas, and discussed in detail BDA applications and capability components. Saltz & Shamshurin (2016) have published SLR paper on Big Data Team Process Methodologies. Authors have identified lack of standard in regards to how Big Data projects are executed, highlighted growing research in this area and potential benefits of such process standard. Additionally, authors synthesized and produced list of 33 most important success factors for executing Big Data activities. Finally, there are studies that surveyed data mining techniques and applications across domains, yet, they focus on data mining process artifacts and outcomes ( Madni, Anwar & Shah, 2017 ; Liao, Chu & Hsiao, 2012 ), but not on end-to-end process methodology.

There have been number of surveys conducted in domain-specific settings such as hospitality, accounting, education, manufacturing, and banking fields. Mariani et al. (2018) focused on Business Intelligence (BI) and Big Data SLR in the hospitality and tourism environment context. Amani & Fadlalla (2017) explored application of data mining methods in accounting while Romero & Ventura (2013) investigated educational data mining. Similarly, Hassani, Huang & Silva (2018) addressed data mining application case studies in banking and explored them by three dimensions—topics, applied techniques and software. All studies were performed by the means of systematic literature reviews. Lastly, Bi & Cochran (2014) have undertaken standard literature review of Big Data Analytics and its applications in manufacturing.

Apart from domain-specific studies, there have been very few general purpose surveys with comprehensive overview of existing data mining methodologies, classifying and contextualizing them. Valuable synthesis was presented by Kurgan & Musilek (2006) as comparative study of the state-of-the art of data mining methodologies. The study was not SLR, and focused on comprehensive comparison of phases, processes, activities of data mining methodologies; application aspect was summarized briefly as application statistics by industries and citations. Three more comparative, non-SLR studies were undertaken by Marban, Mariscal & Segovia (2009) , Mariscal, Marbán & Fernández (2010) , and the most recent and closest one by Martnez-Plumed et al. (2017) . They followed the same pattern with systematization of existing data mining frameworks based on comparative analysis. There, the purpose and context of consolidation was even more practical—to support derivation and proposal of the new artifact, that is, novel data mining methodology. The majority of the given general type surveys in the field are more than a decade old, and have natural limitations due to being: (1) non-SLR studies, and (2) so far restricted to comparing methodologies in terms of phases, activities, and other elements.

The key common characteristic behind all the given studies is that data mining methodologies are treated as normative and standardized (‘one-size-fits-all’) processes. A complementary perspective, not considered in the above studies, is that data mining methodologies are not normative standardized processes, but instead, they are frameworks that need to be specialized to different industry domains, organizational contexts, and business objectives. In the last few years, a number of extensions and adaptations of data mining methodologies have emerged, which suggest that existing methodologies are not sufficient to cover the needs of all application domains. In particular, extensions of data mining methodologies have been proposed in the medical domain ( Niaksu, 2015 ), educational domain ( Tavares, Vieira & Pedro, 2017 ), the industrial engineering domain ( Huber et al., 2019 ; Solarte, 2002 ), and software engineering ( Marbán et al., 2007 , 2009 ). However, little attention has been given to studying how data mining methodologies are applied and used in industry settings, so far only non-scientific practitioners’ surveys provide such evidence.

Given this research gap, the central objective of this article is to investigate how data mining methodologies are applied by researchers and practitioners, both in their generic (standardized) form and in specialized settings. This is achieved by investigating if data mining methodologies are applied ‘as-is’ or adapted, and for what purposes such adaptations are implemented.

Guided by Systematic Literature Review method, initially we identified a corpus of primary studies covering both peer-reviewed and ‘grey’ literature from 1997 to 2018. An analysis of these studies led us to a taxonomy of uses of data mining methodologies, focusing on the distinction between ‘as is’ usage versus various types of methodology adaptations. By analyzing different types of methodology adaptations, this article identifies potential gaps in standard data mining methodologies both at the technological and at the organizational levels.

The rest of the article is organized as follows. The Background section provides an overview of key concepts of data mining and associated methodologies. Next, Research Design describes the research methodology. The Findings and Discussion section presents the study results and their associated interpretation. Finally, threats to validity are addressed in Threats to Validity while the Conclusion summarizes the findings and outlines directions for future work.

The section introduces main data mining concepts, provides overview of existing data mining methodologies, and their evolution.

Data mining is defined as a set of rules, processes, algorithms that are designed to generate actionable insights, extract patterns, and identify relationships from large datasets ( Morabito, 2016 ). Data mining incorporates automated data extraction, processing, and modeling by means of a range of methods and techniques. In contrast, data analytics refers to techniques used to analyze and acquire intelligence from data (including ‘big data’) ( Gandomi & Haider, 2015 ) and is positioned as a broader field, encompassing a wider spectrum of methods that includes both statistical and data mining ( Chen, Chiang & Storey, 2012 ). A number of algorithms has been developed in statistics, machine learning, and artificial intelligence domains to support and enable data mining. While statistical approaches precedes them, they inherently come with limitations, the most known being rigid data distribution conditions. Machine learning techniques gained popularity as they impose less restrictions while deriving understandable patterns from data ( Bose & Mahapatra, 2001 ).

Data mining projects commonly follow a structured process or methodology as exemplified by Mariscal, Marbán & Fernández (2010) , Marban, Mariscal & Segovia (2009) . A data mining methodology specifies tasks, inputs, outputs, and provides guidelines and instructions on how the tasks are to be executed ( Mariscal, Marbán & Fernández, 2010 ). Thus, data mining methodology provides a set of guidelines for executing a set of tasks to achieve the objectives of a data mining project ( Mariscal, Marbán & Fernández, 2010 ).

The foundations of structured data mining methodologies were first proposed by Fayyad, Piatetsky-Shapiro & Smyth (1996a , 1996b , 1996c) , and were initially related to Knowledge Discovery in Databases (KDD). KDD presents a conceptual process model of computational theories and tools that support information extraction (knowledge) with data ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a ). In KDD, the overall approach to knowledge discovery includes data mining as a specific step. As such, KDD, with its nine main steps (exhibited in Fig. 1 ), has the advantage of considering data storage and access, algorithm scaling, interpretation and visualization of results, and human computer interaction ( Fayyad, Piatetsky-Shapiro & Smyth, 1996a , 1996c ). Introduction of KDD also formalized clearer distinction between data mining and data analytics, as for example formulated in Tsai et al. (2015) : “…by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining”.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g001.jpg

The main steps of KDD are as follows:

Step 1: Learning application domain: In the first step, it is needed to develop an understanding of the application domain and relevant prior knowledge followed by identifying the goal of the KDD process from the customer’s viewpoint.
Step 2: Dataset creation: Second step involves selecting a dataset, focusing on a subset of variables or data samples on which discovery is to be performed.
Step 3: Data cleaning and processing: In the third step, basic operations to remove noise or outliers are performed. Collection of necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for data types, schema, and mapping of missing and unknown values are also considered.
Step 4: Data reduction and projection: Here, the work of finding useful features to represent the data, depending on the goal of the task, application of transformation methods to find optimal features set for the data is conducted.
Step 5: Choosing the function of data mining: In the fifth step, the target outcome (e.g., summarization, classification, regression, clustering) are defined.
Step 6: Choosing data mining algorithm: Sixth step concerns selecting method(s) to search for patterns in the data, deciding which models and parameters are appropriate and matching a particular data mining method with the overall criteria of the KDD process.
Step 7: Data mining: In the seventh step, the work of mining the data that is, searching for patterns of interest in a particular representational form or a set of such representations: classification rules or trees, regression, clustering is conducted.
Step 8: Interpretation: In this step, the redundant and irrelevant patterns are filtered out, relevant patterns are interpreted and visualized in such way as to make the result understandable to the users.
Step 9: Using discovered knowledge: In the last step, the results are incorporated with the performance system, documented and reported to stakeholders, and used as basis for decisions.

The KDD process became dominant in industrial and academic domains ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Also, as timeline-based evolution of data mining methodologies and process models shows ( Fig. 2 below), the original KDD data mining model served as basis for other methodologies and process models, which addressed various gaps and deficiencies of original KDD process. These approaches extended the initial KDD framework, yet, extension degree has varied ranging from process restructuring to complete change in focus. For example, Brachman & Anand (1996) and further Gertosio & Dussauchoy (2004) (in a form of case study) introduced practical adjustments to the process based on iterative nature of process as well as interactivity. The complete KDD process in their view was enhanced with supplementary tasks and the focus was changed to user’s point of view (human-centered approach), highlighting decisions that need to be made by the user in the course of data mining process. In contrast, Cabena et al. (1997) proposed different number of steps emphasizing and detailing data processing and discovery tasks. Similarly, in a series of works Anand & Büchner (1998) , Anand et al. (1998) , Buchner et al. (1999) presented additional data mining process steps by concentrating on adaptation of data mining process to practical settings. They focused on cross-sales (entire life-cycles of online customer), with further incorporation of internet data discovery process (web-based mining). Further, Two Crows data mining process model is consultancy originated framework that has defined the steps differently, but is still close to original KDD. Finally, SEMMA (Sample, Explore, Modify, Model and Assess) based on KDD, was developed by SAS institute in 2005 ( SAS Institute Inc., 2017 ). It is defined as a logical organization of the functional toolset of SAS Enterprise Miner for carrying out the core tasks of data mining. Compared to KDD, this is vendor-specific process model which limits its application in different environments. Also, it skips two steps of original KDD process (‘Learning Application Domain’ and ‘Using of Discovered Knowledge’) which are regarded as essential for success of data mining project ( Mariscal, Marbán & Fernández, 2010 ). In terms of adoption, new KDD-based proposals received limited attention across academia and industry ( Kurgan & Musilek, 2006 ; Marban, Mariscal & Segovia, 2009 ). Subsequently, most of these methodologies converged into the CRISP-DM methodology.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g002.jpg

Additionally, there have only been two non-KDD based approaches proposed alongside extensions to KDD. The first one is 5A’s approach presented by De Pisón Ascacbar (2003) and used by SPSS vendor. The key contribution of this approach has been related to adding ‘Automate’ step while disadvantage was associated with omitting ‘Data Understanding’ step. The second approach was 6-Sigma which is industry originated method to improve quality and customer’s satisfaction ( Pyzdek & Keller, 2003 ). It has been successfully applied to data mining projects in conjunction with DMAIC performance improvement model (Define, Measure, Analyze, Improve, Control).

In 2000, as response to common issues and needs ( Marban, Mariscal & Segovia, 2009 ), an industry-driven methodology called Cross-Industry Standard Process for Data Mining (CRISP-DM) was introduced as an alternative to KDD. It also consolidated original KDD model and its various extensions. While CRISP-DM builds upon KDD, it consists of six phases that are executed in iterations ( Marban, Mariscal & Segovia, 2009 ). The iterative executions of CRISP-DM stand as the most distinguishing feature compared to initial KDD that assumes a sequential execution of its steps. CRISP-DM, much like KDD, aims at providing practitioners with guidelines to perform data mining on large datasets. However,CRISP-DM with its six main steps with a total of 24 tasks and outputs, is more refined as compared to KDD. The main steps of CRIPS-DM, as depicted in Fig. 3 below are as follows:

Phase 1: Business understanding: The focus of the first step is to gain an understanding of the project objectives and requirements from a business perspective followed by converting these into data mining problem definitions. Presentation of a preliminary plan to achieve the objectives are also included in this first step.
Phase 2: Data understanding: This step begins with an initial data collection and proceeds with activities in order to get familiar with the data, identify data quality issues, discover first insights into the data, and potentially detect and form hypotheses.
Phase 3: Data preparation: The third step covers activities required to construct the final dataset from the initial raw data. Data preparation tasks are performed repeatedly.
Phase 4: Modeling phase: In this step, various modeling techniques are selected and applied followed by calibrating their parameters. Typically, several techniques are used for the same data mining problem.
Phase 5: Evaluation of the model(s): The fifth step begins with the quality perspective and then, before proceeding to final model deployment, ascertains that the model(s) achieves the business objectives. At the end of this phase, a decision should be reached on how to use data mining results.
Phase 6: Deployment phase: In the final step, the models are deployed to enable end-customers to use the data as basis for decisions, or support in the business process. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized, presented, distributed in a way that the end-user can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g003.jpg

The development of CRISP-DM was led by industry consortium. It is designed to be domain-agnostic ( Mariscal, Marbán & Fernández, 2010 ) and as such, is now widely used by industry and research communities ( Marban, Mariscal & Segovia, 2009) . These distinctive characteristics have made CRISP-DM to be considered as ‘de-facto’ standard of data mining methodology and as a reference framework to which other methodologies are benchmarked ( Mariscal, Marbán & Fernández, 2010 ).

Similarly to KDD, a number of refinements and extensions of the CRISP-DM methodology have been proposed with the two main directions—extensions of the process model itself and adaptations, merger with the process models and methodologies in other domains. Extensions direction of process models could be exemplified by Cios & Kurgan (2005) who have proposed integrated Data Mining & Knowledge Discovery (DMKD) process model. It contains several explicit feedback mechanisms, modification of the last step to incorporate discovered knowledge and insights application as well as relies on technologies for results deployment. In the same vein, Moyle & Jorge (2001) , Blockeel & Moyle (2002) proposed Rapid Collaborative Data Mining System (RAMSYS) framework—this is both data mining methodology and system for remote collaborative data mining projects. The RAMSYS attempted to achieve the combination of a problem solving methodology, knowledge sharing, and ease of communication. It intended to allow the collaborative work of remotely placed data miners in a disciplined manner as regards information flow while allowing the free flow of ideas for problem solving ( Moyle & Jorge, 2001 ). CRISP-DM modifications and integrations with other specific domains were proposed in Industrial Engineering (Data Mining for Industrial Engineering by Solarte (2002) ), and Software Engineering by Marbán et al. (2007 , 2009) . Both approaches enhanced CRISP-DM and contributed with additional phases, activities and tasks typical for engineering processes, addressing on-going support ( Solarte, 2002 ), as well as project management, organizational and quality assurance tasks ( Marbán et al., 2009 ).

Finally, limited number of attempts to create independent or semi-dependent data mining frameworks was undertaken after CRISP-DM creation. These efforts were driven by industry players and comprised KDD Roadmap by Debuse et al. (2001) for proprietary predictive toolkit (Lanner Group), and recent effort by IBM with Analytics Solutions Unified Method for Data Mining (ASUM-DM) in 2015 ( IBM Corporation, 2016 : https://developer.ibm.com/technologies/artificial-intelligence/articles/architectural-thinking-in-the-wild-west-of-data-science/ ). Both frameworks contributed with additional tasks, for example, resourcing in KDD Roadmap, or hybrid approach assumed in ASUM, for example, combination of agile and traditional implementation principles.

The Table 1 above summarizes reviewed data mining process models and methodologies by their origin, basis and key concepts.

Name	Origin	Basis	Key concept	Year
Human-Centered	Academy	KDD	Iterative process and interactivity (user’s point of view and needed decisions)	1996, 2004
Cabena et al.	Academy	KDD	Focus on data processing and discovery tasks	1997
Anand and Buchner	Academy	KDD	Supplementary steps and integration of web-mining	1998, 1999
Two Crows	Industry	KDD	Modified definitions of steps	1998
SEMMA	Industry	KDD	Tool-specific (SAS Institute), elimination of some steps	2005
5 A’s	Industry	Independent	Supplementary steps	2003
6 Sigmas	Industry	Independent	Six Sigma quality improvement paradigm in conjunction with DMAIC performance improvement model	2003
CRISP-DM	Joint industry and academy	KDD	Iterative execution of steps, significant refinements to tasks and outputs	2000
Cios et al.	Academy	Crisp-DM	Integration of data mining and knowledge discovery, feedback mechanisms, usage of received insights supported by technologies	2005
RAMSYS	Academy	Crisp-DM	Integration of collaborative work aspects	2001–2002
DMIE	Academy	Crisp-DM	Integration and adaptation to Industrial Engineering domain	2001
Marban	Academy	Crisp-DM	Integration and adaptation to Software Engineering domain	2007
KDD roadmap	Joint industry and academy	Independent	Tool-specific, resourcing task	2001
ASUM	Industry	Crisp-DM	Tool-specific, combination of traditional Crisp-DM and agile implementation approach	2015

Research Design

The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology. Secondly, SLR supports structured synthesis of existing evidence, identification of research gaps, and provides framework to position new research activities ( Kitchenham, Budgen & Brereton, 2015 ). For our SLR, we followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . All SLR details have been documented in the separate, peer-reviewed SLR protocol (available at https://figshare.com/articles/Systematic-Literature-Review-Protocol/10315961 ).

Research questions

As suggested by Kitchenham, Budgen & Brereton (2015) , we have formulated research questions and motivate them as follows. In the preliminary phase of research we have discovered very limited number of studies investigating data mining methodologies application practices as such. Further, we have discovered number of surveys conducted in domain-specific settings, and very few general purpose surveys, but none of them considered application practices either. As contrasting trend, recent emergence of limited number of adaptation studies have clearly pinpointed the research gap existing in the area of application practices. Given this research gap, in-depth investigation of this phenomenon led us to ask: “How data mining methodologies are applied (‘as-is’ vs adapted) (RQ1)?” Further, as we intended to investigate in depth universe of adaptations scenarios, this naturally led us to RQ2: “How have existing data mining methodologies been adapted?” Finally, if adaptions are made, we wish to explore what the associated reasons and purposes are, which in turn led us to RQ3: “For what purposes are data mining methodologies adapted?”

Thus, for this review, there are three research questions defined:

Research Question 1: How data mining methodologies are applied (‘as-is’ versus adapted)? This question aims to identify data mining methodologies application and usage patterns and trends.
Research Question 2: How have existing data mining methodologies been adapted? This questions aims to identify and classify data mining methodologies adaptation patterns and scenarios.
Research Question 3: For what purposes have existing data mining methodologies been adapted? This question aims to identify, explain, classify and produce insights on what are the reasons and what benefits are achieved by adaptations of existing data mining methodologies. Specifically, what gaps do these adaptations seek to fill and what have been the benefits of these adaptations. Such systematic evidence and insights will be valuable input to potentially new, refined data mining methodology. Insights will be of interest to practitioners and researchers.

Data collection strategy

Our data collection and search strategy followed the guidelines proposed by Kitchenham, Budgen & Brereton (2015) . It defined the scope of the search, selection of literature and electronic databases, search terms and strings as well as screening procedures.

Primary search

The primary search aimed to identify an initial set of papers. To this end, the search strings were derived from the research objective and research questions. The term ‘data mining’ was the key term, but we also included ‘data analytics’ to be consistent with observed research practices. The terms ‘methodology’ and ‘framework’ were also included. Thus, the following search strings were developed and validated in accordance with the guidelines suggested by Kitchenham, Budgen & Brereton (2015) :

(‘data mining methodology’) OR (‘data mining framework’) OR (‘data analytics methodology’) OR (‘data analytics framework’)

The search strings were applied to the indexed scientific databases Scopus, Web of Science (for ‘peer-reviewed’, academic literature) and to the non-indexed Google Scholar (for non-peer-reviewed, so-called ‘grey’ literature). The decision to cover ‘grey’ literature in this research was motivated as follows. As proposed in number of information systems and software engineering domain publications ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ), SLR as stand-alone method may not provide sufficient insight into ‘state of practice’. It was also identified ( Garousi, Felderer & Mäntylä, 2016 ) that ‘grey’ literature can give substantial benefits in certain areas of software engineering, in particular, when the topic of research is related to industrial and practical settings. Taking into consideration the research objectives, which is investigating data mining methodologies application practices, we have opted for inclusion of elements of Multivocal Literature Review (MLR) 1 in our study. Also, Kitchenham, Budgen & Brereton (2015) recommends including ‘grey’ literature to minimize publication bias as positive results and research outcomes are more likely to be published than negative ones. Following MLR practices, we also designed inclusion criteria for types of ‘grey’ literature reported below.

The selection of databases is motivated as follows. In case of peer-reviewed literature sources we concentrated to avoid potential omission bias. The latter is discussed in IS research ( Levy & Ellis, 2006 ) in case research is concentrated in limited disciplinary data sources. Thus, broad selection of data sources including multidisciplinary-oriented (Scopus, Web of Science, Wiley Online Library) and domain-oriented (ACM Digital Library, IEEE Xplorer Digital Library) scientific electronic databases was evaluated. Multidisciplinary databases have been selected due to wider domain coverage and it was validated and confirmed that they do include publications originating from domain-oriented databases, such as ACM and IEEE. From multi-disciplinary databases as such, Scopus was selected due to widest possible coverage (it is worlds largest database, covering app. 80% of all international peer-reviewed journals) while Web of Science was selected due to its longer temporal range. Thus, both databases complement each other. The selected non-indexed database source for ‘grey’ literature is Google Scholar, as it is comprehensive source of both academic and ‘grey’ literature publications and referred as such extensively ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Further, Garousi, Felderer & Mäntylä (2019) presented three-tier categorization framework for types of ‘grey literature’. In our study we restricted ourselves to the 1st tier ‘grey’ literature publications of the limited number of ‘grey’ literature producers. In particular, from the list of producers ( Neto et al., 2019 ) we have adopted and focused on government departments and agencies, non-profit economic, trade organizations (‘think-tanks’) and professional associations, academic and research institutions, businesses and corporations (consultancy companies and established private companies). The 1st tier ‘grey’ literature selected items include: (1) government, academic, and private sector consultancy reports 2 , (2) theses (not lower than Master level) and PhD Dissertations, (3) research reports, (4) working papers, (5) conference proceedings, preprints. With inclusion of the 1st tier ‘grey’ literature criteria we mitigate quality assessment challenge especially relevant and reported for it ( Garousi, Felderer & Mäntylä, 2019 ; Neto et al., 2019 ).

Scope and domains inclusion

As recommended by Kitchenham, Budgen & Brereton (2015) it is necessary to initially define research scope. To clarify the scope, we defined what is not included and is out of scope of this research. The following aspects are not included in the scope of our study:

Context of technology and infrastructure for data mining/data analytics tasks and projects.
Granular methods application in data mining process itself or their application for data mining tasks, for example, constructing business queries or applying regression or neural networks modeling techniques to solve classification problems. Studies with granular methods are included in primary texts corpus as long as method application is part of overall methodological approach.
Technological aspects in data mining for example, data engineering, dataflows and workflows.
Traditional statistical methods not associated with data mining directly including statistical control methods.

Similarly to Budgen et al. (2006) and Levy & Ellis (2006) , initial piloting revealed that search engines retrieved literature available for all major scientific domains including ones outside authors’ area of expertise (e.g., medicine). Even though such studies could be retrieved, it would be impossible for us to analyze and correctly interpret literature published outside the possessed area of expertise. The adjustments toward search strategy were undertaken by retaining domains closely associated with Information Systems, Software Engineering research. Thus, for Scopus database the final set of inclusive domains was limited to nine and included Computer Science, Engineering, Mathematics, Business, Management and Accounting, Decision Science, Economics, Econometrics and Finance, and Multidisciplinary as well as Undefined studies. Excluded domains covered 11.5% or 106 out of 925 publications; it was confirmed in validation process that they primarily focused on specific case studies in fundamental sciences and medicine 3 . The included domains from Scopus database were mapped to Web of Science to ensure consistent approach across databases and the correctness of mapping was validated.

Screening criteria and procedures

Based on the SLR practices (as in Kitchenham, Budgen & Brereton (2015) , Brereton et al. (2007) ) and defined SLR scope, we designed multi-step screening procedures (quality and relevancy) with associated set of Screening Criteria and Scoring System . The purpose of relevancy screening is to find relevant primary studies in an unbiased way ( Vanwersch et al., 2011 ). Quality screening, on the other hand, aims to assess primary relevant studies in terms of quality in unbiased way.

Screening Criteria consisted of two subsets— Exclusion Criteria applied for initial filtering and Relevance Criteria , also known as Inclusion Criteria .

Exclusion Criteria were initial threshold quality controls aiming at eliminating studies with limited or no scientific contribution. The exclusion criteria also address issues of understandability, accessability and availability. The Exclusion Criteria were as follows:

Quality 1: The publication item is not in English (understandability).
either the same document retrieved from two or all three databases.
or different versions of the same publication are retrieved (i.e., the same study published in different sources)—based on best practices, decision rule is that the most recent paper is retained as well as the one with the highest score ( Kofod-Petersen, 2014 ).
if a publication is published both as conference proceeding and as journal article with the same name and same authors or as an extended version of conference paper, the latter is selected.
Quality 3: Length of the publication is less than 6 pages—short papers do not have the space to expand and discuss presented ideas in sufficient depth to examine for us.
Quality 4: The paper is not accessible in full length online through the university subscription of databases and via Google Scholar—not full availability prevents us from assessing and analyzing the text.

The initially retrieved list of papers was filtered based on Exclusion Criteria . Only papers that passed all criteria were retained in the final studies corpus. Mapping of criteria towards screening steps is exhibited in Fig. 4 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g004.jpg

Relevance Criteria were designed to identify relevant publications and are presented in Table 2 below while mapping to respective process steps is presented in Fig. 4 . These criteria were applied iteratively.

Relevance criteria	Criteria definition	Criteria justification
Relevance 1	Is the study about data mining or data analytics approach and is within designated list of domains?	Exclude studies conducted outside the designated domain list. Exclude studies not directly describing and/or discussing data mining and data analytics
Relevance 2	Is the study introducing/describing data mining or data analytics methodology/framework or modifying existing approaches?	Exclude texts considering only specific, granular data mining and data analytics techniques, methods or traditional statistical methods. Exclude publications focusing on specific, granular data mining and data analytics process/sub-process aspects. Exclude texts where description and discussion of data mining methodologies or frameworks is manifestly missing

As a final SLR step, the full texts quality assessment was performed with constructed Scoring Metrics (in line with Kitchenham & Charters (2007) ). It is presented in the Table 3 below.

Score	Criteria definition
3	Data mining methodology or framework is presented in full. All steps described and explained, tests performed, results compared and evaluated. There is clear proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system, and/or prototype or full solution implementation is discussed. Success factors described and presented
2	Data mining methodology or framework is presented, some process steps are missing, but they do not impact the holistic view and understanding of the performed work. Data mining process is clearly presented and described, tests performed, results compared and evaluated. There is proposal on usage, application, deployment of solution in organization’s business process(es) and IT/IS system(s)
1	Data mining methodology or framework is not presented in full, some key phases and process steps are missing. Publication focuses on one or some aspects (e.g., method, technique)
0	Data mining methodology or framework not presented as holistic approach, but on fragmented basis, study limited to some aspects (e.g., method or technique discussion, etc.)

Data extraction and screening process

The conducted data extraction and screening process is presented in Fig. 4 . In Step 1 initial publications list were retrieved from pre-defined databases—Scopus, Web of Science, Google Scholar. The lists were merged and duplicates eliminated in Step 2. Afterwards, texts being less than 6 pages were excluded (Step 3). Steps 1–3 were guided by Exclusion Criteria . In the next stage (Step 4), publications were screened by Title based on pre-defined Relevance Criteria . The ones which passed were evaluated by their availability (Step 5). As long as study was available, it was evaluated again by the same pre-defined Relevance Criteria applied to Abstract, Conclusion and if necessary Introduction (Step 6). The ones which passed this threshold formed primary publications corpus extracted from databases in full. These primary texts were evaluated again based on full text (Step 7) applying Relevance Criteria first and then Scoring Metrics .

Results and quantitative analysis

In Step 1, 1,715 publications were extracted from relevant databases with the following composition—Scopus (819), Web of Science (489), Google Scholar (407). In terms of scientific publication domains, Computer Science (42.4%), Engineering (20.6%), Mathematics (11.1%) accounted for app. 74% of Scopus originated texts. The same applies to Web of Science harvest. Exclusion Criteria application produced the following results. In Step 2, after eliminating duplicates, 1,186 texts were passed for minimum length evaluation, and 767 reached assessment by Relevancy Criteria .

As mentioned Relevance Criteria were applied iteratively (Step 4–6) and in conjunction with availability assessment. As a result, only 298 texts were retained for full evaluation with 241 originating from scientific databases while 57 were ‘grey’. These studies formed primary texts corpus which was extracted, read in full and evaluated by Relevance Criteria combined with Scoring Metrics . The decision rule was set as follows. Studies that scored “1” or “0” were rejected, while texts with “3” and “2” evaluation were admitted as final primary studies corpus. To this end, as an outcome of SLR-based, broad, cross-domain publications collection and screening we identified 207 relevant publications from peer-reviewed (156 texts) and ‘grey’ literature (51 texts). Figure 5 below exhibits yearly published research numbers with the breakdown by ‘peer-reviewed’ and ‘grey’ literature starting from 1997.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g005.jpg

In terms of composition, ‘peer-reviewed’ studies corpus is well-balanced with 72 journal articles and 82 conference papers while book chapters account for 4 instances only. In contrast, in ‘grey’ literature subset, articles in moderated and non-peer reviewed journals are dominant ( n = 34) compared to overall number of conference papers ( n = 13), followed by small number of technical reports and pre-prints ( n = 4).

Temporal analysis of texts corpus (as per Fig. 5 below) resulted in two observations. Firstly, we note that stable and significant research interest (in terms of numbers) on data mining methodologies application has started around a decade ago—in 2007. Research efforts made prior to 2007 were relatively limited with number of publications below 10. Secondly, we note that research on data mining methodologies has grown substantially since 2007, an observation supported by the 3-year and 10-year constructed mean trendlines. In particular, the number of publications have roughly tripled over past decade hitting all time high with 24 texts released in 2017.

Further, there are also two distinct spike sub-periods in the years 2007–2009 and 2014–2017 followed by stable pattern with overall higher number of released publications on annual basis. This observation is in line with the trend of increased penetration of methodologies, tools, cross-industry applications and academic research of data mining.

Findings and Discussion

In this section, we address the research questions of the paper. Initially, as part of RQ1, we present overview of data mining methodologies ‘as-is’ and adaptation trends. In addressing RQ2, we further classify the adaptations identified. Then, as part of RQ3 subsection, each category identified under RQ2 is analyzed with particular focus on the goals of adaptations.

RQ1: How data mining methodologies are applied (‘as-is’ vs. adapted)?

The first research question examines the extent to which data mining methodologies are used ‘as-is’ versus adapted. Our review based on 207 publications identified two distinct paradigms on how data mining methodologies are applied. The first is ‘as-is’ where the data mining methodologies are applied as stipulated. The second is with ‘adaptations’; that is, methodologies are modified by introducing various changes to the standard process model when applied.

We have aggregated research by decades to differentiate application pattern between two time periods 1997–2007 with limited vs 2008–2018 with more intensive data mining application. The given cut has not only been guided by extracted publications corpus but also by earlier surveys. In particular, during the pre-2007 research, there where ten new methodologies proposed, but since then, only two new methodologies have been proposed. Thus, there is a distinct trend observed over the last decade of large number of extensions and adaptations proposed vs entirely new methodologies.

We note that during the first decade of our time scope (1997–2007), the ratio of data mining methodologies applied ‘as-is’ was 40% (as presented in Fig. 6A ). However, the same ratio for the following decade is 32% ( Fig. 6B ). Thus, in terms of relative shares we note a clear decrease in using data mining methodologies ‘as-is’ in favor of adapting them to cater to specific needs.The trend is even more pronounced when comparing numbers—adaptations more than tripled (from 30 to 106) while ‘as-is’ scenario has increased modestly (from 20 to 51). Given this finding, we continue with analyzing how data mining methodologies have been adapted under RQ2.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g006.jpg

RQ2: How have existing data mining methodologies been adapted?

We identified that data mining methodologies have been adapted to cater to specific needs. In order to categorize adaptations scenarios, we applied a two-level dichotomy, specifically, by applying the following decision tree:

Level 1 Decision: Has the methodology been combined with another methodology? If yes, the resulting methodology was classified in the ‘integration’ category. Otherwise, we posed the next question.
Level 2 Decision: Are any new elements (phases, tasks, deliverables) added to the methodology? If yes, we designate the resulting methodology as an ‘extension’ of the original one. Otherwise, we classify the resulting methodology as a modification of the original one.

Thus, when adapted three distinct types of adaptation scenarios can be distinguished:

Scenario ‘Modification’: introduces specialized sub-tasks and deliverables in order to address specific use cases or business problems. Modifications typically concentrate on granular adjustments to the methodology at the level of sub-phases, tasks or deliverables within the existing reference frameworks (e.g., CRISP-DM or KDD) stages. For example, Chernov et al. (2014) , in the study of mobile network domain, proposed automated decision-making enhancement in the deployment phase. In addition, the evaluation phase was modified by using both conventional and own-developed performance metrics. Further, in a study performed within the financial services domain, Yang et al. (2016) presents feature transformation and feature selection as sub-phases, thereby enhancing the data mining modeling stage.
Scenario ‘Extension’: primarily proposes significant extensions to reference data mining methodologies. Such extensions result in either integrated data mining solutions, data mining frameworks serving as a component or tool for automated IS systems, or their transformations to fit specialized environments. The main purposes of extensions are to integrate fully-scaled data mining solutions into IS/IT systems and business processes and provide broader context with useful architectures, algorithms, etc. Adaptations, where extensions have been made, elicit and explicitly present various artifacts in the form of system and model architectures, process views, workflows, and implementation aspects. A number of soft goals are also achieved, providing holistic perspective on data mining process, and contextualizing with organizational needs. Also, there are extensions in this scenario where data mining process methodologies are substantially changed and extended in all key phases to enable execution of data mining life-cycle with the new (Big) Data technologies, tools and in new prototyping and deployment environments (e.g., Hadoop platforms or real-time customer interfaces). For example, Kisilevich, Keim & Rokach (2013) presented extensions to traditional CRISP-DM data mining outcomes with fully fledged Decision Support System (DSS) for hotel brokerage business. Authors ( Kisilevich, Keim & Rokach, 2013 ) have introduced spatial/non-spatial data management (extending data preparation), analytical and spatial modeling capabilities (extending modeling phase), provided spatial display and reporting capabilities (enhancing deployment phase). In the same work domain knowledge was introduced in all phases of data mining process, and usability and ease of use were also addressed.
Scenario ‘Integration’: combines reference methodology, for example, CRISP-DM with: (1) data mining methodologies originated from other domains (e.g., Software engineering development methodologies), (2) organizational frameworks (Balanced Scorecard, Analytics Canvass, etc.), or (3) adjustments to accommodate Big Data technologies and tools. Also, adaptations in the form of ‘Integration’ typically introduce various types of ontologies and ontology-based tools, domain knowledge, software engineering, and BI-driven framework elements. Fundamental data mining process adjustments to new types of data, IS architectures (e.g., real time data, multi-layer IS) are also presented. Key gaps addressed with such adjustments are prescriptive nature and low degree of formalization in CRISP-DM, obsolete nature of CRISP-DM with respect to tools, and lack of CRISP-DM integration with other organizational frameworks. For example, Brisson & Collard (2008) developed KEOPS data mining methodology (CRIPS-DM based) centered on domain knowledge integration. Ontology-driven information system has been proposed with integration and enhancements to all steps of data mining process. Further, an integrated expert knowledge used in all data mining phases was proved to produce value in data mining process.

To examine how the application scenario of each data mining methodology usage has developed over time, we mapped peer-reviewed texts and ‘grey’ literature to respective adaptation scenarios, aggregated by decades (as presented in the Fig. 7 for peer-reviewed and Fig. 8 for ‘grey’).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g007.jpg

For peer-reviewed research, such temporal analysis resulted in three observations. Firstly, research efforts in each adaptation scenario has been growing and number of publication more than quadrupled (128 vs. 28). Secondly, as noted above relative proportion of ‘as-is’ studies is diluted (from 39% to 33%) and primarily replaced with ‘Extension’ paradigm (from 25% to 30%). In contrast, in relative terms ‘Modification’ and ‘Integration’ paradigms gains are modest. Further, this finding is reinforced with other observation—most notable gaps in terms of modest number of publications remain in ‘Integration’ category where excluding 2008–2009 spike, research efforts are limited and number of texts is just 13. This is in stark contrast with prolific research in ‘Extension category’ though concentrated in the recent years. We can hypothesize that existing reference methodologies do not accommodate and support increasing complexity of data mining projects and IS/IT infrastructure, as well as certain domains specifics and as such need to be adapted.

In ‘grey’ literature, in contrast to peer-reviewed research, growth in number of publications is less profound—29 vs. 22 publications or 32% comparing across two decade (as per Fig. 8 ). The growth is solely driven by ‘Integration’ scenarios application (13 vs. 4 publications) while both ‘as-is’ and other adaptations scenarios are stagnating or in decline.

RQ3: For what purposes have existing data mining methodologies been adapted?

We address the third research question by analyzing what gaps the data mining methodology adaptations seek to fill and the benefits of such adaptations. We identified three adaptation scenarios, namely ‘Modification’, ‘Extension’, and ‘Integration’. Here, we analyze each of them.

Modification

Modifications of data mining methodologies are present in 30 peer-reviewed and 4 ‘grey’ literature studies. The analysis shows that modifications overwhelmingly consist of specific case studies. However, the major differentiating point compared to ‘as-is’ case studies is clear presence of specific adjustments towards standard data mining process methodologies. Yet, the proposed modifications and their purposes do not go beyond traditional data mining methodologies phases. They are granular, specialized and executed on tasks, sub-tasks, and at deliverables level. With modifications, authors describe potential business applications and deployment scenarios at a conceptual level, but typically do not report or present real implementations in the IS/IT systems and business processes.

Further, this research subcategory can be best classified based on domains where case studies were performed and data mining methodologies modification scenarios executed. We have identified four distinct domain-driven applications presented in the Fig. 9 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g009.jpg

IT, IS domain

The largest number of publications (14 or app. 40%), was performed on IT, IS security, software development, specific data mining and processing topics. Authors address intrusion detection problem in Hossain, Bridges & Vaughn (2003) , Fan, Ye & Chen (2016) , Lee, Stolfo & Mok (1999) , specialized algorithms for variety of data types processing in Yang & Shi (2010) , Chen et al. (2001) , Yi, Teng & Xu (2016) , Pouyanfar & Chen (2016) , effective and efficient computer and mobile networks management in Guan & Fu (2010) , Ertek, Chi & Zhang (2017) , Zaki & Sobh (2005) , Chernov, Petrov & Ristaniemi (2015) , Chernov et al. (2014) .

Manufacturing and engineering

The next most popular research area is manufacturing/engineering with 10 case studies. The central topic here is high-technology manufacturing, for example, semi-conductors associated—study of Chien, Diaz & Lan (2014) , and various complex prognostics case studies in rail, aerospace domains ( Létourneau et al., 2005 ; Zaluski et al., 2011 ) concentrated on failure predictions. These are complemented by studies on equipment fault and failure predictions and maintenance ( Kumar, Shankar & Thakur, 2018 ; Kang et al., 2017 ; Wang, 2017 ) as well as monitoring system ( García et al., 2017 ).

Sales and services, incl. financial industry

The third category is presented by seven business application papers concerning customer service, targeting and advertising ( Karimi-Majd & Mahootchi, 2015 ; Reutterer et al., 2017 ; Wang, 2017 ), financial services credit risk assessments ( Smith, Willis & Brooks, 2000 ), supply chain management ( Nohuddin et al., 2018 ), and property management ( Yu, Fung & Haghighat, 2013 ), and similar.

As a consequence of specialization, these studies concentrate on developing ‘state-of-the art’ solution to the respective domain-specific problem.

‘Extension’ scenario was identified in 46 peer-reviewed and 12 ‘grey’ publications. We noted that ‘Extension’ to existing data mining methodologies were executed with four major purposes:

Purpose 1: To implement fully scaled, integrated data mining solution and regular, repeatable knowledge discovery process— address model, algorithm deployment, implementation design (including architecture, workflows and corresponding IS integration). Also, complementary goal is to tackle changes to business process to incorporate data mining into organization activities.
Purpose 2: To implement complex, specifically designed systems and integrated business applications with data mining model/solution as component or tool. Typically, this adaptation is also oriented towards Big Data specifics, and is complemented by proposed artifacts such as Big Data architectures, system models, workflows, and data flows.
Purpose 3: To implement data mining as part of integrated/combined specialized infrastructure, data environments and types (e.g., IoT, cloud, mobile networks) .
Purpose 4: To incorporate context-awareness aspects.

The specific list of studies mapped to each of the given purposes presented in the Appendix ( Table A1 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in the Fig. 10 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g010.jpg

Main adaptation purpose	Publications
(1) To implement fully scaled, integrated data mining solution	, , , , , , , , , , , , , , ,
(2) To implement complex systems and integrated business applications with data mining model/solution as component or tool	, , , , , , , , , , , , , , , , , , ,
(3) To implement data mining as part of integrated/combined specialized infrastructure,data environments and types (e.g., IoT, cloud, mobile networks)	, , , , , , , , , , , , , , , , , , , ,
(4) To incorporate context-awareness aspects

In ‘Extension’ category, studies executed with the Purpose 1 propose fully scaled, integrated data mining solutions of specific data mining models, associated frameworks and processes. The distinctive trait of this research subclass is that it ensures repeatability and reproducibility of delivered data mining solution in different organizational and industry settings. Both the results of data mining use case as well as deployment and integration into IS/IT systems and associated business process(es) are presented explicitly. Thus, ‘Extension’ subclass is geared towards specific solution design, tackling concrete business or industrial setting problem or addressing specific research gaps thus resembling comprehensive case study.

This direction can be well exemplified by expert finder system in research social network services proposed by Sun et al. (2015) , data mining solution for functional test content optimization by Wang (2015) and time-series mining framework to conduct estimation of unobservable time-series by Hu et al. (2010) . Similarly, Du et al. (2017) tackle online log anomalies detection, automated association rule mining is addressed by Çinicioğlu et al. (2011) , software effort estimation by Deng, Purvis & Purvis (2011) , network patterns visual discovery by Simoff & Galloway (2008) . Number of studies address solutions in IS security ( Shin & Jeong, 2005 ), manufacturing ( Güder et al., 2014 ; Chee, Baharudin & Karkonasasi, 2016 ), materials engineering domains ( Doreswamy, 2008 ), and business domains ( Xu & Qiu, 2008 ; Ding & Daniel, 2007 ).

In contrast, ‘Extension’ studies executed for the Purpose 2 concentrate on design of complex, multi-component information systems and architectures. These are holistic, complex systems and integrated business applications with data mining framework serving as component or tool. Moreover, data mining methodology in these studies is extended with systems integration phases.

For example, Mobasher (2007) presents data mining application in Web personalization system and associated process; here, data mining cycle is extended in all phases with utmost goal of leveraging multiple data sources and using discovered models and corresponding algorithms in an automatic personalization system. Authors comprehensively address data processing, algorithm, design adjustments and respective integration into automated system. Similarly, Haruechaiyasak, Shyu & Chen (2004) tackle improvement of Webpage recommender system by presenting extended data mining methodology including design and implementation of data mining model. Holistic view on web-mining with support of all data sources, data warehousing and data mining techniques integration, as well as multiple problem-oriented analytical outcomes with rich business application scenarios (personalization, adaptation, profiling, and recommendations) in e-commerce domain was proposed and discussed by Büchner & Mulvenna (1998) . Further, Singh et al. (2014) tackled scalable implementation of Network Threat Intrusion Detection System. In this study, data mining methodology and resulting model are extended, scaled and deployed as module of quasi-real-time system for capturing Peer-to-Peer Botnet attacks. Similar complex solution was presented in a series of publications by Lee et al. (2000 , 2001) who designed real-time data mining-based Intrusion Detection System (IDS). These works are complemented by comprehensive study of Barbará et al. (2001) who constructed experimental testbed for intrusion detection with data mining methods. Detection model combining data fusion and mining and respective components for Botnets identification was developed by Kiayias et al. (2009) too. Similar approach is presented in Alazab et al. (2011) who proposed and implemented zero-day malware detection system with associated machine-learning based framework. Finally, Ahmed, Rafique & Abulaish (2011) presented multi-layer framework for fuzzy attack in 3G cellular IP networks.

A number of authors have considered data mining methodologies in the context of Decision Support Systems and other systems that generate information for decision-making, across a variety of domains. For example, Kisilevich, Keim & Rokach (2013) executed significant extension of data mining methodology by designing and presenting integrated Decision Support System (DSS) with six components acting as supporting tool for hotel brokerage business to increase deal profitability. Similar approach is undertaken by Capozzoli et al. (2017) focusing on improving energy management of properties by provision of occupancy pattern information and reconfiguration framework. Kabir (2016) presented data mining information service providing improved sales forecasting that supported solution of under/over-stocking problem while Lau, Zhang & Xu (2018) addressed sales forecasting with sentiment analysis on Big Data. Kamrani, Rong & Gonzalez (2001) proposed GA-based Intelligent Diagnosis system for fault diagnostics in manufacturing domain. The latter was tackled further in Shahbaz et al. (2010) with complex, integrated data mining system for diagnosing and solving manufacturing problems in real time.

Lenz, Wuest & Westkämper (2018) propose a framework for capturing data analytics objectives and creating holistic, cross-departmental data mining systems in the manufacturing domain. This work is representative of a cohort of studies that aim at extending data mining methodologies in order to support the design and implementation of enterprise-wide data mining systems. In this same research cohort, we classify Luna, Castro & Romero (2017) , which presents a data mining toolset integrated into the Moodle learning management system, with the aim of supporting university-wide learning analytics.

One study addresses multi-agent based data mining concept. Khan, Mohamudally & Babajee (2013) have developed unified theoretical framework for data mining by formulating a unified data mining theory. The framework is tested by means of agent programing proposing integration into multi-agent system which is useful due to scalability, robustness and simplicity.

The subcategory of ‘Extension’ research executed with Purpose 3 is devoted to data mining methodologies and solutions in specialized IT/IS, data and process environments which emerged recently as consequence of Big Data associated technologies and tools development. Exemplary studies include IoT associated environment research, for example, Smart City application in IoT presented by Strohbach et al. (2015) . In the same domain, Bashir & Gill (2016) addressed IoT-enabled smart buildings with the additional challenge of large amount of high-speed real time data and requirements of real-time analytics. Authors proposed integrated IoT Big Data Analytics framework. This research is complemented by interdisciplinary study of Zhong et al. (2017) where IoT and wireless technologies are used to create RFID-enabled environment producing analysis of KPIs to improve logistics.

Significant number of studies addresses various mobile environments sometimes complemented by cloud-based environments or cloud-based environments as stand-alone. Gomes, Phua & Krishnaswamy (2013) addressed mobile data mining with execution on mobile device itself; the framework proposes innovative approach addressing extensions of all aspects of data mining including contextual data, end-user privacy preservation, data management and scalability. Yuan, Herbert & Emamian (2014) and Yuan & Herbert (2014) introduced cloud-based mobile data analytics framework with application case study for smart home based monitoring system. Cuzzocrea, Psaila & Toccu (2016) have presented innovative FollowMe suite which implements data mining framework for mobile social media analytics with several tools with respective architecture and functionalities. An interesting paper was presented by Torres et al. (2017) who addressed data mining methodology and its implementation for congestion prediction in mobile LTE networks tackling also feedback reaction with network reconfigurations trigger.

Further, Biliri et al. (2014) presented cloud-based Future Internet Enabler—automated social data analytics solution which also addresses Social Network Interoperability aspect supporting enterprises to interconnect and utilize social networks for collaboration. Real-time social media streamed data and resulting data mining methodology and application was extensively discussed by Zhang, Lau & Li (2014) . Authors proposed design of comprehensive ABIGDAD framework with seven main components implementing data mining based deceptive review identification. Interdisciplinary study tackling both these topics was developed by Puthal et al. (2016) who proposed integrated framework and architecture of disaster management system based on streamed data in cloud environment ensuring end-to-end security. Additionally, key extensions to data mining framework have been proposed merging variety of data sources and types, security verification and data flow access controls. Finally, cloud-based manufacturing was addressed in the context of fault diagnostics by Kumar et al. (2016) .

Also, Mahmood et al. (2013) tackled Wireless Sensor Networks and associated data mining framework required extensions. Interesting work is executed by Nestorov & Jukic (2003) addressing rare topic of data mining solutions integration within traditional data warehouses and active mining of data repositories themselves.

Supported by new generation of visualization technologies (including Virtual Reality environments), Wijayasekara, Linda & Manic (2011) proposed and implemented CAVE-SOM (3D visual data mining framework) which offers interactive, immersive visual data mining with multiple visualization modes supported by plethora of methods. Earlier version of visual data mining framework was successfully developed and presented by Ganesh et al. (1996) as early as in 1996.

Large-scale social media data is successfully tackled by Lemieux (2016) with comprehensive framework accompanied by set of data mining tools and interface. Real time data analytics was addressed by Shrivastava & Pal (2017) in the domain of enterprise service ecosystem. Images data was addressed in Huang et al. (2002) by proposing multimedia data mining framework and its implementation with user relevance feedback integration and instance learning. Further, exploded data diversity and associated need to extend standard data mining is addressed by Singh et al. (2016) in the study devoted to object detection in video surveillance systems supporting real time video analysis.

Finally, there is also limited number of studies which addresses context awareness (Purpose 4) and extends data mining methodology with context elements and adjustments. In comparison with ‘Integration’ category research, here, the studies are at lower abstraction level, capturing and presenting list of adjustments. Singh, Vajirkar & Lee (2003) generate taxonomy of context factors, develop extended data mining framework and propose deployment including detailed IS architecture. Context-awareness aspect is also addressed in the papers reviewed above, for example, Lenz, Wuest & Westkämper (2018) , Kisilevich, Keim & Rokach (2013) , Sun et al. (2015) , and other studies.

Integration

‘Integration’ of data mining methodologies scenario was identified in 27 ‘peer-reviewed’ and 17 ‘grey’ studies. Our analysis revealed that this adaptation scenario at a higher abstraction level is typically executed with the five key purposes:

Purpose 1: to integrate/combine with various ontologies existing in organization .
Purpose 2: to introduce context-awareness and incorporate domain knowledge .
Purpose 3: to integrate/combine with other research or industry domains framework, process methodologies and concepts .
Purpose 4: to integrate/combine with other well-known organizational governance frameworks, process methodologies and concepts .
Purpose 5: to accommodate and/or leverage upon newly available Big Data technologies, tools and methods.

The specific list of studies mapped to each of the given purposes presented in Appendix ( Table A2 ). Main purposes of adaptations, associated gaps and/or benefits along with observations and artifacts are documented in Fig. 11 below.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-06-267-g011.jpg

Main adaptation purpose	Publications
(1) To integrate/combined with various ontologies existing in organization	, , , , ,
(2) To introduce context-awareness and incorporate domain knowledge	, , , , , ,
(3) To integrate/combine with other research/industry domains frameworks, process methodologies, and concepts	, , , , , , , , , , , , ,
(4) To integrate/combine with other organizational governance frameworks, process methodologies, concepts	, , , , , , , ,
(5) To accomodate or leverage upon newly available Big Data technologies, tools and methods	, , , , , ,

As mentioned, number of studies concentrates on proposing ontology-based Integrated data mining frameworks accompanies by various types of ontologies (Purpose 1). For example, Sharma & Osei-Bryson (2008) focus on ontology-based organizational view with Actors, Goals and Objectives which supports execution of Business Understanding Phase. Brisson & Collard (2008) propose KEOPS framework which is CRISP-DM compliant and integrates a knowledge base and ontology with the purpose to build ontology-driven information system (OIS) for business and data understanding phases while knowledge base is used for post-processing step of model interpretation. Park et al. (2017) propose and design comprehensive ontology-based data analytics tool IRIS with the purpose to align analytics and business. IRIS is based on concept to connect dots, analytics methods or transforming insights into business value, and supports standardized process for applying ontology to match business problems and solutions.

Further, Ying et al. (2014) propose domain-specific data mining framework oriented to business problem of customer demand discovery. They construct ontology for customer demand and customer demand discovery task which allows to execute structured knowledge extraction in the form of knowledge patterns and rules. Here, the purpose is to facilitate business value realization and support actionability of extracted knowledge via marketing strategies and tactics. In the same vein, Cannataro & Comito (2003) presented ontology for the Data Mining domain which main goal is to simplify the development of distributed knowledge discovery applications. Authors offered to a domain expert a reference model for different kind of data mining tasks, methodologies, and software capable to solve the given business problem and find the most appropriate solution.

Apart from ontologies, Sharma & Osei-Bryson (2009) in another study propose IS inspired, driven by Input-Output model data mining methodology which supports formal implementation of Business Understanding Phase. This research exemplifies studies executed with Purpose 2. The goal of the paper is to tackle prescriptive nature of CRISP-DM and address how the entire process can be implemented. Cao, Schurmann & Zhang (2005) study is also exemplary in terms of aggregating and introducing several fundamental concepts into traditional CRISP-DM data mining cycle—context awareness, in-depth pattern mining, human–machine cooperative knowledge discovery (in essence, following human-centricity paradigm in data mining), loop-closed iterative refinement process (similar to Agile-based methodologies in Software Development). There are also several concepts, like data, domain, interestingness, rules which are proposed to tackle number of fundamental constrains identified in CRISP-DM. They have been discussed and further extended by Cao & Zhang (2007 , 2008) , Cao (2010) into integrated domain driven data mining concept resulting in fully fledged D3M (domain-driven) data mining framework. Interestingly, the same concepts, but on individual basis are investigated and presented by other authors, for example, context-aware data mining methodology is tackled by Xiang (2009a , 2009b) in the context of financial sector. Pournaras et al. (2016) attempted very crucial privacy-preservation topic in the context of achieving effective data analytics methodology. Authors introduced metrics and self-regulatory (reconfigurable) information sharing mechanism providing customers with controls for information disclosure.

A number of studies have proposed CRISP-DM adjustments based on existing frameworks, process models or concepts originating in other domains (Purpose 3), for example, software engineering ( Marbán et al., 2007 , 2009 ; Marban, Mariscal & Segovia, 2009 ) and industrial engineering ( Solarte, 2002 ; Zhao et al., 2005 ).

Meanwhile, Mariscal, Marbán & Fernández (2010) proposed a new refined data mining process based on a global comparative analysis of existing frameworks while Angelov (2014) outlined a data analytics framework based on statistical concepts. Following a similar approach, some researchers suggest explicit integration with other areas and organizational functions, for example, BI-driven Data Mining by Hang & Fong (2009) . Similarly, Chen, Kazman & Haziyev (2016) developed an architecture-centric agile Big Data analytics methodology, and an architecture-centric agile analytics and DevOps model. Alternatively, several authors tackled data mining methodology adaptations in other domains, for example, educational data mining by Tavares, Vieira & Pedro (2017) , decision support in learning management systems ( Murnion & Helfert, 2011 ), and in accounting systems ( Amani & Fadlalla, 2017 ).

Other studies are concerned with actionability of data mining and closer integration with business processes and organizational management frameworks (Purpose 4). In particular, there is a recurrent focus on embedding data mining solutions into knowledge-based decision making processes in organizations, and supporting fast and effective knowledge discovery ( Bohanec, Robnik-Sikonja & Borstnar, 2017 ).

Examples of adaptations made for this purpose include: (1) integration of CRISP-DM with the Balanced Scorecard framework used for strategic performance management in organizations ( Yun, Weihua & Yang, 2014 ); (2) integration with a strategic decision-making framework for revenue management Segarra et al. (2016) ; (3) integration with a strategic analytics methodology Van Rooyen & Simoff (2008) , and (4) integration with a so-called ‘Analytics Canvas’ for management of portfolios of data analytics projects Kühn et al. (2018) . Finally, Ahangama & Poo (2015) explored methodological attributes important for adoption of data mining methodology by novice users. This latter study uncovered factors that could support the reduction of resistance to the use of data mining methodologies. Conversely, Lawler & Joseph (2017) comprehensively evaluated factors that may increase the benefits of Big Data Analytics projects in an organization.

Lastly, a number of studies have proposed data mining frameworks (e.g., CRISP-DM) adaptations to cater for new technological architectures, new types of datasets and applications (Purpose 5). For example, Lu et al. (2017) proposed a data mining system based on a Service-Oriented Architecture (SOA), Zaghloul, Ali-Eldin & Salem (2013) developed a concept of self-service data analytics, Osman, Elragal & Bergvall-Kåreborn (2017) blended CRISP-DM into a Big Data Analytics framework for Smart Cities, and Niesen et al. (2016) proposed a data-driven risk management framework for Industry 4.0 applications.

Our analysis of RQ3, regarding the purposes of existing data mining methodologies adaptations, revealed the following key findings. Firstly, adaptations of type ‘Modification’ are predominantly targeted at addressing problems that are specific to a given case study. The majority of modifications were made within the domain of IS security, followed by case studies in the domains of manufacturing and financial services. This is in clear contrast with adaptations of type ‘Extension’, which are primarily aimed at customizing the methodology to take into account specialized development environments and deployment infrastructures, and to incorporate context-awareness aspects. Thirdly, a recurrent purpose of adaptations of type ‘Integration’ is to combine a data mining methodology with either existing ontologies in an organization or with other domain frameworks, methodologies, and concepts. ‘Integration’ is also used to instill context-awareness and domain knowledge into a data mining methodology, or to adapt it to specialized methods and tools, such as Big Data. The distinctive outcome and value (gaps filled in) of ‘Integrations’ stems from improved knowledge discovery, better actionability of results, improved combination with key organizational processes and domain-specific methodologies, and improved usage of Big Data technologies.

We discovered that the adaptations of existing data mining methodologies found in the literature can be classified into three categories: modification, extension, or integration.

We also noted that adaptations are executed either to address deficiencies and lack of important elements or aspects in the reference methodology (chiefly CRISP-DM). Furthermore, adaptations are also made to improve certain phases, deliverables or process outcomes.

In short, adaptations are made to:

improve key reference data mining methodologies phases—for example, in case of CRISP-DM these are primarily business understanding and deployment phases.
support knowledge discovery and actionability.
introduce context-awareness and higher degree of formalization.
integrate closer data mining solution with key organizational processes and frameworks.
significantly update CRISP-DM with respect to Big Data technologies, tools, environments and infrastructure.
incorporate broader, explicit context of architectures, algorithms and toolsets as integral deliverables or supporting tools to execute data mining process.
expand and accommodate broader unified perspective for incorporating and implementing data mining solutions in organization, IT infrastructure and business processes.

Threats to Validity

Systematic literature reviews have inherent limitations that must be acknowledged. These threats to validity include subjective bias (internal validity) and incompleteness of search results (external validity).

The internal validity threat stems from the subjective screening and rating of studies, particularly when assessing the studies with respect to relevance and quality criteria. We have mitigated these effects by documenting the survey protocol (SLR Protocol), strictly adhering to the inclusion criteria, and performing significant validation procedures, as documented in the Protocol.

The external validity threat relates to the extent to which the findings of the SLR reflect the actual state of the art in the field of data mining methodologies, given that the SLR only considers published studies that can be retrieved using specific search strings and databases. We have addressed this threat to validity by conducting trial searches to validate our search strings in terms of their ability to identify relevant papers that we knew about beforehand. Also, the fact that the searches led to 1,700 hits overall suggests that a significant portion of the relevant literature has been covered.

In this study, we have examined the use of data mining methodologies by means of a systematic literature review covering both peer-reviewed and ‘grey’ literature. We have found that the use of data mining methodologies, as reported in the literature, has grown substantially since 2007 (four-fold increase relative to the previous decade). Also, we have observed that data mining methodologies were predominantly applied ‘as-is’ from 1997 to 2007. This trend was reversed from 2008 onward, when the use of adapted data mining methodologies gradually started to replace ‘as-is’ usage.

The most frequent adaptations have been in the ‘Extension’ category. This category refers to adaptations that imply significant changes to key phases of the reference methodology (chiefly CRISP-DM). These adaptations particularly target the business understanding, deployment and implementation phases of CRISP-DM (or other methodologies). Moreover, we have found that the most frequent purposes of adaptions are: (1) adaptations to handle Big Data technologies, tools and environments (technological adaptations); and (2) adaptations for context-awareness and for integrating data mining solutions into business processes and IT systems (organizational adaptations). A key finding is that standard data mining methodologies do not pay sufficient attention to deployment aspects required to scale and transform data mining models into software products integrated into large IT/IS systems and business processes.

Apart from the adaptations in the ‘Extension’ category, we have also identified an increasing number of studies focusing on the ‘Integration’ of data mining methodologies with other domain-specific and organizational methodologies, frameworks, and concepts. These adaptions are aimed at embedding the data mining methodology into broader organizational aspects.

Overall, the findings of the study highlight the need to develop refinements of existing data mining methodologies that would allow them to seamlessly interact with IT development platforms and processes (technological adaptation) and with organizational management frameworks (organizational adaptation). In other words, there is a need to frame existing data mining methodologies as being part of a broader ecosystem of methodologies, as opposed to the traditional view where data mining methodologies are defined in isolation from broader IT systems engineering and organizational management methodologies.

Supplemental Information

Supplemental information 1.

Unfortunately, we were not able to upload any graph (original png files). Based on Overleaf placed PeerJ template we constructed graphs files based on the template examples. Unfortunately, we were not able to understand why it did not fit, redoing to new formats will change all texts flow and generated pdf file. We submit graphs in archived file as part of supplementary material. We will do our best to redo the graphs further based on instructions from You.

Supplemental Information 2

File starts with Definitions page—it lists and explains all columns definitions as well as SLR scoring metrics. Second page contains"Peer reviewed" texts while next one "grey" literature corpus.

Funding Statement

The authors received no funding for this work.

Additional Information and Declarations

The authors declare that they have no competing interests.

Veronika Plotnikova conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Marlon Dumas conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Fredrik Milani conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Primary Sources

Have a language expert improve your writing

Check your paper for plagiarism in 10 minutes, generate your apa citations for free.

Knowledge Base
Using AI tools

What Is Data Mining? | Definition & Techniques

Published on July 20, 2023 by Kassiani Nikolopoulou .

Data mining is the process of extracting meaningful information from vast amounts of data. With data mining methods, organizations can discover hidden patterns, relationships, and trends in data, which they can use to solve business problems, make predictions, and increase their profits or efficiency.

The term “data mining” is actually a misnomer because the goal is not to extract the data itself, but rather meaningful information from the data .

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

What is data mining, what are different data mining techniques, how does data mining work, data mining application examples, other interesting articles, frequently asked questions.

Data mining, also known as knowledge discovery in data (KDD) , is a branch of data science that brings together computer software, machine learning (i.e., the process of teaching machines how to learn from data without human intervention), and statistics to extract or mine useful information from massive data sets.

Through our online interactions with companies, government agencies, or educational institutes, we produce a large amount of data. This “big data” consists of data sets so large that it’s not possible for a human to analyze them. Instead, this is done with the assistance of a computer.

Data mining transforms this raw data into practical knowledge that helps organizations answer important questions about their users or consumers. Data mining applications include consumer behavior analysis, sales forecasting, and fraud detection.

Don't submit your assignments before you do this

The academic proofreading tool has been trained on 1000s of academic texts. Making it the most accurate and reliable proofreading tool for students. Free citation check included.

how to write a research paper on data mining

Try for free

Data mining techniques draw from various fields like machine learning (ML) and statistics . Here are a few common data mining techniques:

Classification is the task of assigning new data to known or predefined categories. For example, sorting a data set consisting of emails as “spam” or “not spam.”
Clustering is the process of grouping data that share common characteristics into subgroups or clusters. Unlike classification (where groups are predefined), clustering is a discovery technique that helps us identify patterns. This allows businesses to create customer segments based on loyalty, communication preferences, or any other trait that emerges from the data.
Association rule learning is a technique that looks for relationships between data points. A grocery store chain may use association rule learning to find out which products are frequently bought together and use these insights for promotions.
Regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal is to predict the value of the dependent variable based on the values of the independent variables . For example, using historical data about houses with similar characteristics, we might predict the future value of a house.
Anomaly or outlier detection is the process of identifying unusual data within a data set (i.e., data that doesn’t follow the general pattern). This data may be interesting (e.g., if it signals a spike in the sales of certain products) or may need further investigation (e.g., if it indicates potential instances of fraud).

The data mining process involves using statistical methods and machine learning algorithms to identify patterns in data. Thanks to advancements in computer processing power and speed, analyzing data is largely automated.

Although there are different ways to describe the data mining process, a widely used model is the Cross-Industry Standard Process for Data Mining (CRISP-DM) , which includes the following stages:

Business understanding

Data understanding, data preparation, data modeling.

In the business understanding stage, we need to identify the problem we intend to solve through data mining (e.g., how to create a more targeted marketing campaign).

Data scientists and other relevant stakeholders need to define the business problem, which will inform the questions that guide the project. Additional research might be necessary to understand the business context. Determining project goals and success criteria is important for collecting the right data and evaluating the project’s outcomes.

Once the business problem is defined, we need to determine the type of data needed and identify relevant sources. In this step, data scientists collect data from various sources, such as transaction records and customer databases.

However, not every data point may be relevant for the project. For example, a company may only be interested in purchases via credit card. The goal here is to ensure that only the necessary data will be included. By the end of the data understanding stage, the data mining team should have selected the subset of data necessary to address the problem.

Data preparation is the most time-consuming stage and involves several actions to get the data ready for further processing and analysis. This may involve excluding duplicates, missing data , or outliers from the data (i.e., data cleansing ).

Data from multiple sources may be merged, organized, or adjusted in different ways to prepare for the next phase. At the end of this stage, the data mining team has identified the most relevant variables and prepared the final data set.

Data modeling is the process of organizing and understanding data in a structured way. It helps data mining teams find meaningful patterns and insights in the available data.

Data scientists use different models depending on the type of data they have and the problem they’re trying to solve. For example, they might want to identify which products are often purchased together or detect suspicious transactions in banks. To do this, they may use different techniques.

For example, they may apply classification techniques to categorize labeled data or use clustering techniques to group similar data points together. By iterating through this modeling process, data scientists try to reach the best solution.

They build models that group customers into segments that reflect shared travel interests and characteristics. They find out that their customers mainly consist of three distinct groups: “adventure seekers,” “cultural explorers,” and “family vacationers.” Note There are two main types of data: labeled and unlabeled .

Labeled data means that it has been manually annotated with specific information (e.g., emails labeled “spam” or “not spam”). In this case, data scientists can use a supervised machine learning approach , where the model learns from these labeled examples to make predictions on new, unseen data.
On the other hand, if the data is unlabeled , data scientists can use unsupervised machine learning , which helps them discover patterns and relationships within the data without any predefined labels.

During the evaluation stage , the data mining team begins to assess the model’s effectiveness in answering their initial question. This is a human-driven phase, as the project leader needs to decide if the model answers the original question well or uncovers new and previously unknown patterns.

Unlike the technical assessment in the modeling phase, the evaluation phase involves determining which model best meets the objectives and deciding how to proceed. This involves evaluating the results against success criteria, reviewing the process for any oversights, and summarizing findings.

The team may decide, for example, to move on to the next phase or, if the model does not align with the desired objectives, to explore alternative models or revisit the data.

The deployment step is about putting the knowledge and insights gathered from the project into practical use.

Depending on the original question or problem, deployment can be something simple like creating a report or a visual presentation, or something more complex like generating a new sales strategy. Deployment involves integrating the results into the organization’s operations or decision-making process.

Here are some real-world examples of data mining:

Market basket analysis. Retailers use data mining to analyze large data sets and discover consumers’ buying patterns, such as items that are frequently bought together or seasonal trends. They can use this information to better organize their physical stores or websites, predict sales, and promote deals
Academic research. In the field of literary studies, data mining techniques can be used to analyze texts and understand the emotions expressed by authors or characters. Sentiment analysis (or opinion mining) involves using natural language processing and machine learning algorithms to determine the emotional tone of a text.
Education . Educational data mining (EDM) aims to improve learning by analyzing a variety of educational data, such as students’ interactions with online learning environments or administrative data from schools and universities. This method can help education providers understand what students need and support them better (e.g., through customized lessons or by identifying and engaging with at-risk students before they drop out).

Check for common mistakes

Use the best grammar checker available to check for common mistakes in your text.

Fix mistakes for free

If you want to know more about ChatGPT , AI tools , fallacies , and research bias , make sure to check out some of our other articles with explanations and examples.

ChatGPT vs human editor
University policies on AI
Machine learning
Using ChatGPT for your studies
Sunk cost fallacy
Straw man fallacy
Slippery slope fallacy
Red herring fallacy
Ecological fallacy
Logical fallacy

Research bias

Implicit bias
Framing bias
Cognitive bias
Optimism bias
Hawthorne effect
Unconscious bias

Data mining and data analysis are often used interchangeably. However, they are two distinct processes in the field of data science.

Data mining is the process of uncovering hidden patterns, trends, or relationships in large data sets. It involves various techniques like machine learning and statistics , to find useful information in complex data and support decision-making and planning. This process is also called “knowledge discovery.”
Data analysis , on the other hand, is a broader term that describes the entire process of inspecting, cleaning, and organizing raw data. The goal is to draw conclusions, make inferences, and support decision-making. Data analysis includes various techniques like descriptive statistics , data mining, hypothesis testing , and regression analysis .

In other words, data mining is one of the techniques used for data analysis when there is a need to uncover hidden patterns and relationships in the data that other methods might miss, while data analysis encompasses a wider range of activities.

Data mining is important because it allows us to discover meaningful patterns and relationships in large volumes of data in a relatively quick and efficient way.

Data mining techniques can take advantage of data coming from different sources like social media platforms or customer databases and convert it into useful insights. In turn, these can answer business or research questions , make predictions, and inform decision making.

Data mining and machine learning are related fields, but they have different purposes:

The goal of machine learning is to develop algorithms that allow computers to learn without human intervention. It’s about making machines smarter, so they can carry out tasks related to human intelligence independently.
The goal of data mining is to sift through large data sets and extract useful information like patterns and relationships that can be used to support decision-making. In other words, it’s a tool for humans.

While data mining and machine learning have distinct goals, there is some overlap in their applications. Machine learning can be used as a means to conduct data mining by automatically detecting patterns in data. On the other hand, data gathered from data mining can be used to teach machines and improve their learning capabilities.

In short, data mining and machine learning can complement each other, but they are distinct in their purposes and applications.

Sources in this article

We strongly encourage students to use sources in their work. You can cite our article (APA Style) or take a deep dive into the articles below.

Nikolopoulou, K. (2023, July 20). What Is Data Mining? | Definition & Techniques. Scribbr. Retrieved September 16, 2024, from https://www.scribbr.com/ai-tools/data-mining/

Yağcı, M. (2022). Educational data mining: prediction of students’ academic performance using machine learning algorithms. Smart Learning Environments, 9 (1). https://doi.org/10.1186/s40561-022-00192-z

Is this article helpful?

Kassiani Nikolopoulou

Other students also liked, what is generative ai | meaning & examples, how to write good chatgpt prompts, easy introduction to reinforcement learning, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

About the author

Writing a research paper (4) – the introduction

This is the fourth blog post on how to write research papers. Today, I will discuss how to write the introduction of a scientific research paper , some common errors, and give some tips.

Why the introduction is important?

The introduction plays a very important role in a paper . If the introduction is not well-written and convincing, the reader may decide to stop reading the paper . The role of the introduction is to explain the context of the paper , describes the problem that will be addressed in the paper , briefly mention why previous studies have limitations with respect to that problem, and then briefly explain what will be the contribution of the paper .

A good introduction will have a clear logical structure, and will convince the reader that the problem addressed in the paper is worthy of being investigated (it is a novel problem that cannot be solved using existing approaches, it is not easy to solve, and solving this problem is useful). Moreover, the introduction will give an overview of the paper .

On the other hand, a bad introduction will be poorly organized and will not convince the reader that the problem addressed in the paper is useful, important or challenging. Thus, after reading the introduction, the reader may lose his interest in the paper . Writing a good introduction is thus very important.

What is the typical structure of an introduction?

Generally, the introduction of research papers always have more or less the same structure:

PART 1 (context): The first paragraph introduces the broad context of the paper , and then progressively goes from that broad context to a more specific context
PART 2 (problem): Then, a problem is mentioned and why it must be solved.
PART 3 (limitations): Then, the introduction briefly mentions that previous studies failed to solve that problem or have limitations. Hence, a new solution is needed (which will be described in the paper ).
PART 4 (contributions): The following paragraph briefly mention the main contributions of the paper and the key features of the proposed solutions. This may include one or two sentences about the results and conclusion drawn from these results.
PART 5 (plan of the paper): Then, often there will be a short paragraph explaining how the rest of the paper is organized. For example: “ The rest of this paper is organized as follows… Section 2 discussed related work. Section 3 …. Finally, a conclusion is drawn in Section 5. “

Some common errors

Here is a few common errors found in introductions:

English errors: An introduction should be well-written and devoid of English errors.
Poor structure : Some introductions do not follow the typical structure of an introduction, and are not organized in a logical way. In this case, the reader may feel lost, may become uninterested or may not be convinced that the research presented in the paper is worthy of being investigated. As a result, the reader may stop reading.
A very long introduction, with unnecessary details : Another common mistake is to write a very long introduction that contains too many details. But an introduction should generally be no longer than a page. Often, an introduction will contain too many details about related work that are not relevant for the purpose of the introduction. The introduction should only briefly discuss related work to explain the motivation of the paper . More details about related work can be given in other parts of the paper such as a dedicated “related work” section.
An introduction that is not convincing . The introduction needs to convince the reader that the research problem studied in the paper is important, useful and not trivial to solve. In many papers, a mistake is to not explain why the studied problem is useful. For example, in data mining research , I have read many papers that proposed some new algorithms, evaluated the algorithms with synthetic data, but did not explain clearly or show what are the real applications of the proposed algorithms.
An introduction that omit some relevant related work . Sometimes, the introduction of a paper will not cite some relevant studies. This happens when the author is not very familiar with his field of research , and sometimes authors will purposely not cite some relevant papers for various reasons. This can cause a paper to be rejected by reviewers.

To write a good introduction:

Make a plan of the main ideas that you want to talk and the structure of your introduction before writing it. This will help to organize your ideas, and will help to create an introduction that is logically organized.
When planning or writing your introduction, think about your target audience. Choose words and expressions that are appropriate for that audience. In the first paragraph, you can also explain the context of your work in a more general way to try to reach a broader audience.
Reading the introductions of other papers, and studying their structure, can help to write better introductions.
If necessary, ask a native English speaker to proofread your text.
After writing the introduction, read it again, and spend some time to think about how you can improve it. Generally, taking time to read your text again will help to improve your writing skills.

That is all for this topic! In this blog post, I provided some key ideas about how to write introductions of research papers. If you have any additional comments, please leave them in the comment section below. Hope you have enjoyed this post.

—- Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Number of visitors:

A comprehensive survey of data mining

Original Research
Published: 06 February 2020
Volume 12 , pages 1243–1257, ( 2020 )

Cite this article

Manoj Kumar Gupta ORCID: orcid.org/0000-0002-4481-8432 1 &
Pravin Chandra 1

5124 Accesses

59 Citations

Explore all metrics

Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper. The challenges and issues in area of data mining research are also presented in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

A Review of the Development and Future Trends of Data Mining Tools

A Survey on Big Data, Mining: (Tools, Techniques, Applications and Notable Uses)

Data Mining—A Tool for Handling Huge Voluminous Data

Explore related subjects.

Artificial Intelligence

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AAAI Press/The MIT Press, Massachusetts Institute of Technology. ISBN 0–262 56097–6 Fayap

Fayadd U, Piatesky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD), Portland, pp 82–88

Heikki M (1996) Data mining: machine learning, statistics, and databases. In: SSDBM ’96: proceedings of the eighth international conference on scientific and statistical database management, June 1996, pp 2–9

Arora RK, Gupta MK (2017) e-Governance using data warehousing and data mining. Int J Comput Appl 169(8):28–31

Google Scholar

Morik K, Bhaduri K, Kargupta H (2011) Introduction to data mining for sustainability. Data Min Knowl Discov 24(2):311–324

Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Elsevier, Netherlands

MATH Google Scholar

Friedman JH (1997) Data mining and statistics: What is the connection? in: Keynote Speech of the 29th Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997

Turban E, Aronson JE, Liang TP, Sharda R (2007) Decision support and business intelligence systems. 8 th edn, Pearson Education, UK

Gheware SD, Kejkar AS, Tondare SM (2014) Data mining: tasks, tools, techniques and applications. Int J Adv Res Comput Commun Eng 3(10):8095–8098

Kiranmai B, Damodaram A (2014) A review on evaluation measures for data mining tasks. Int J Eng Comput Sci 3(7):7217–7220

Sharma M (2014) Data mining: a literature survey. Int J Emerg Res Manag Technol 3(2):1–4

Venkatadri M, Reddy LC (2011) A review on data mining from past to the future. Int J Comput Appl 15(7):19–22

Chen M, Han J, Yu PS (1996) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8(6):866–883

Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: Proceedings of the 13th INDIACom-2019; IEEE Conference ID: 461816; 6th International Conference on “Computing for Sustainable Global Development”

Ponniah P (2001) Data warehousing fundamentals. Wiley, USA

Chandra P, Gupta MK (2018) Comprehensive survey on data warehousing research. Int J Inform Technol 10(2):217–224

Weiss SH, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco

Fu Y (1997) Data mining: tasks, techniques, and applications. IEEE Potentials 16(4):18–20

Abuaiadah D (2015) Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans Asian Low-Resour Lang Inf Process 15(3):1–17

Algergawy A, Mesiti M, Nayak R, Saake G (2011) XML data clustering: an overview. ACM Comput Surv 43(4):1–25

Angiulli F, Fassetti F (2013) Exploiting domain knowledge to detect outliers. Data Min Knowl Discov 28(2):519–568

MathSciNet MATH Google Scholar

Angiulli F, Fassetti F (2016) Toward generalizing the unification with statistical outliers: the gradient outlier factor measure. ACM Trans Knowl Discov Data 10(3):1–26

Bhatnagar V, Ahuja S, Kaur S (2015) Discriminant analysis-based cluster ensemble. Int J Data Min Modell Manag 7(2):83–107

Bouguessa M (2013) Clustering categorical data in projected spaces. Data Min Knowl Discov 29(1):3–38

MathSciNet Google Scholar

Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data 10(1):1–51

Carpineto C, Osinski S, Romano G, Weiss D (2009) A survey of web clustering engines. ACM Comput. Surv. 41(3):1–38

Ceglar A, Roddick JF (2006) Association mining. ACM Comput Surv 38(2):1–42

Chen YL, Weng CH (2009) Mining fuzzy association rules from questionnaire data. Knowl Based Syst 22(1):46–56

Fan Chin-Yuan, Fan Pei-Shu, Chan Te-Yi, Chang Shu-Hao (2012) Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals. Expert Syst Appl 39:8844–8851

Das R, Kalita J, Bhattacharya (2011) A pattern matching approach for clustering gene expression data. Int J Data Min Model Manag 3(2):130–149

Dincer E (2006) The k-means algorithm in data mining and an application in medicine. Kocaeli Univesity, Kocaeli

Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):1–32

Gupta MK, Chandra P (2019) P-k-means: k-means using partition based cluster initialization method. In: Proceedings of the international conference on advancements in computing and management (ICACM 2019), Elsevier SSRN, pp 567–573

Gupta MK, Chandra P (2019) An empirical evaluation of k-means clustering algorithm using different distance/similarity metrics. In: Proceedings of the international conference on emerging trends in information technology (ICETIT-2019), emerging trends in information technology, LNEE 605 pp 884–892 DOI: https://doi.org/10.1007/978-3-030-30577-2_79

Hea Z, Xua X, Huangb JZ, Denga S (2004) Mining class outliers: concepts, algorithms and applications in CRM. Expert Syst Appl 27(4):681e97

Hung LN, Thu TNT, Nguyen GC (2015) An efficient algorithm in mining frequent itemsets with weights over data stream using tree data structure. IJ Intell Syst Appl 12:23–31

Hung LN, Thu TNT (2016) Mining frequent itemsets with weights over data stream using inverted matrix. IJ Inf Technol Comput Sci 10:63–71

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput. Surv 31(3):1–60

Jin H, Wang S, Zhou Q, Li Y (2014) An improved method for density-based clustering. Int J Data Min Model Manag 6(4):347–368

Khandare A, Alvi AS (2017) Performance analysis of improved clustering algorithm on real and synthetic data. IJ Comput Netw Inf Secur 10:57–65

Koh YS, Ravana SD (2016) Unsupervised rare pattern mining: a survey. ACM Trans Knowl Discov Data 10(4):1–29

Kosina P, Gama J (2015) Very fast decision rules for classification in data streams. Data Min Knowl Discov 29(1):168–202

Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268

Kumar D, Bezdek JC, Rajasegarar S, Palaniswami M, Leckie C, Chan J, Gubbi J (2016) Adaptive cluster tendency visualization and anomaly detection for streaming data. ACM Trans Knowl Discov Data 11(2):1–24

Lee G, Yun U (2017) A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives. Future Gener Comput Syst 68:89–110

Li G, Zaki MJ (2015) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Discov 30(1):181–225. https://doi.org/10.1007/s10618-015-0409-y

Article MathSciNet MATH Google Scholar

Liao TW, Triantaphyllou E (2007) Recent advances in data mining of enterprise data: algorithms and applications. World Scientific Publishing, Singapore, pp 111–145

Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43:1

Mampaey M, Vreeken J (2011) Summarizing categorical data by clustering attributes. Data Min Knowl Discov 26(1):130–173

Menardi G, Torelli N (2012) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):4–28. https://doi.org/10.1007/s10618-012-0295-5

Mukhopadhyay A, Maulik U, Bandyopadhyay S (2015) A survey of multiobjective evolutionary clustering. ACM Comput Surv 47(4):1–46

Pei Y, Fern XZ, Tjahja TV, Rosales R (2016) ‘Comparing clustering with pairwise and relative constraints: a unified framework. ACM Trans Knowl Discov Data 11:2

Rafalak M, Deja M, Wierzbicki A, Nielek R, Kakol M (2016) Web content classification using distributions of subjective quality evaluations. ACM Trans Web 10:4

Reddy D, Jana PK (2014) A new clustering algorithm based on Voronoi diagram. Int J Data Min Model Manag 6(1):49–64

Rustogi S, Sharma M, Morwal S (2017) Improved Parallel Apriori Algorithm for Multi-cores. IJ Inf Technol Comput Sci 4:18–23

Shah-Hosseini H (2013) Improving K-means clustering algorithm with the intelligent water drops (IWD) algorithm. Int J Data Min Model Manag 5(4):301–317

Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31

Silva A, Antunes C (2014) Multi-relational pattern mining over data streams. Data Min Knowl Discov 29(6):1783–1814. https://doi.org/10.1007/s10618-014-0394-6

Sim K, Gopalkrishnan V, Zimek A, Cong G (2012) A survey on enhanced subspace clustering. Data Min Knowl Discov 26(2):332–397

Sohrabi MK, Roshani R (2017) Frequent itemset mining using cellular learning automata. Comput Hum Behav 68:244–253

Craw Susan, Wiratunga Nirmalie, Rowe Ray C (2006) Learning adaptation knowledge to improve case-based reasoning. Artif Intell 170:1175–1192

Tan KC, Teoh EJ, Yua Q, Goh KC (2009) A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst Appl 36(4):8616–8630

Tew C, Giraud-Carrier C, Tanner K, Burton S (2013) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045

Wang L, Dong M (2015) Exemplar-based low-rank matrix decomposition for data clustering. Data Min Knowl Discov 29:324–357

Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29:534–564

Wang B, Rahal I, Dong A (2011) Parallel hierarchical clustering using weighted confidence affinity. Int J Data Min Model Manag 3(2):110–129

Zacharis NZ (2018) Classification and regression trees (CART) for predictive modeling in blended learning. IJ Intell Syst Appl 3:1–9

Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29:765–791

Han J, Fu Y (1996) Exploration of the power of attribute-oriented induction in data mining. Adv Knowl Discov Data Min. AAAI/MIT Press, pp 399-421

Gupta A, Mumick IS (1995) Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng Bull 18(2):3

Sawant V, Shah K (2013) A review of distributed data mining using agents. Int J Adv Technol Eng Res 3(5):27–33

Gupta MK, Chandra P (2019) An efficient approach for selection of initial cluster centroids for k-means clustering algorithm. In: Proceedings international conference on recent developments in science engineering and technology (REDSET-2019), November 15–16 2019

Gupta MK, Chandra P (2019) MP-K-means: modified partition based cluster initialization method for k-means algorithm. Int J Recent Technol Eng 8(4):1140–1148

Gupta MK, Chandra P (2019) HYBCIM: hypercube based cluster initialization method for k-means. IJ Innov Technol Explor Eng 8(10):3584–3587. https://doi.org/10.35940/ijitee.j9774.0881019

Article Google Scholar

Enke David, Thawornwong Suraphan (2005) The use of data mining and neural networks for forecasting stock market returns. Expert Syst Appl 29:927–940

Mezyk Edward, Unold Olgierd (2011) Machine learning approach to model sport training. Comput Hum Behav 27:1499–1506

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34

Hüllermeier Eyke (2005) Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets Syst 156:387–406

Hullermeier Eyke (2011) Fuzzy sets in machine learning and data mining. Appl Soft Comput 11:1493–1505

Gengshen Du, Ruhe Guenther (2014) Two machine-learning techniques for mining solutions of the ReleasePlanner™ decision support system. Inf Sci 259:474–489

Smith Kate A, Gupta Jatinder ND (2000) Neural networks in business: techniques and applications for the operations researcher. Comput Oper Res 27:1023–1044

Huang Mu-Jung, Tsou Yee-Lin, Lee Show-Chin (2006) Integrating fuzzy data mining and fuzzy artificial neural networks for discovering implicit knowledge. Knowl Based Syst 19:396–403

Padhraic S (2000) Data mining: analysis on grand scale. Stat Method Med Res 9(4):309–327. https://doi.org/10.1191/096228000701555181

Article MATH Google Scholar

Saeed S, Ali M (2012) Privacy-preserving back-propagation and extreme learning machine algorithms. Data Knowl Eng 79–80:40–61

Singh Y, Bhatia PK, Sangwan OP (2007) A review of studies on machine learning techniques. Int J Comput Sci Secur 1(1):70–84

Yahia ME, El-taher ME (2010) A new approach for evaluation of data mining techniques. Int J Comput Sci Issues 7(5):181–186

Jackson J (2002) Data mining: a conceptual overview. Commun Assoc Inf Syst 8:267–296

Heckerman D (1998) A tutorial on learning with Bayesian networks. Learning in graphical models. Springer, Netherlands, pp 301–354

Politano PM, Walton RO (2017) Statistics & research methodol. Lulu. com

Wetherill GB (1987) Regression analysis with application. Chapman & Hall Ltd, UK

Anderberg MR (2014) Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks, vol 19. Academic Press, USA

Mihoci A (2017) Modelling limit order book volume covariance structures. In: Hokimoto T (ed) Advances in statistical methodologies and their application to real problems. IntechOpen, Croatia. https://doi.org/10.5772/66152

Chapter Google Scholar

Thompson B (2004) Exploratory and confirmatory factor analysis: understanding concepts and applications. American Psychological Association, Washington, DC (ISBN:1-59147-093-5)

Kuzey C, Uyar A, Delen (2014) The impact of multinationality on firm value: a comparative analysis of machine learning techniques. Decis Support Syst 59:127–142

Chan Philip K, Salvatore JS (1997) On the accuracy of meta-learning for scalable data mining. J Intell Inf Syst 8:5–28

Tsai Chih-Fong, Hsu Yu-Feng, Lin Chia-Ying, Lin Wei-Yang (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36:11994–12000

Liao SH, Chu PH, Hsiao PY (2012) Data mining techniques and applications—a decade review from 2000 to 2011. Expert Syst Appl 39:11303–11311

Kanevski M, Parkin R, Pozdnukhov A, Timonin V, Maignan M, Demyanov V, Canu S (2004) Environmental data mining and modelling based on machine learning algorithms and geostatistics. Environ Model Softw 19:845–855

Jain N, Srivastava V (2013) Data mining techniques: a survey paper. Int J Res Eng Technol 2(11):116–119

Baker RSJ (2010) Data mining for education. In: McGaw B, Peterson P, Baker E (eds) International encyclopedia of education, 3rd edn. Elsevier, Oxford, UK

Lew A, Mauch H (2006) Introduction to data mining and its applications. Springer, Berlin

Mukherjee S, Shaw R, Haldar N, Changdar S (2015) A survey of data mining applications and techniques. Int J Comput Sci Inf Technol 6(5):4663–4666

Data mining examples: most common applications of data mining (2019). https://www.softwaretestinghelp.com/data-mining-examples/ . Accessed 27 Dec 2019

Devi SVSG (2013) Applications and trends in data mining. Orient J Comput Sci Technol 6(4):413–419

Data mining—applications & trends. https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm

Keleş MK (2017) An overview: the impact of data mining applications on various sectors. Tech J 11(3):128–132

Top 14 useful applications for data mining. https://bigdata-madesimple.com/14-useful-applications-of-data-mining/ . Accessed 20 Aug 2014

Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Making 5(4):597–604

Padhy N, Mishra P, Panigrahi R (2012) A survey of data mining applications and future scope. Int J Comput Sci Eng Inf Technol 2(3):43–58

Gibert K, Sanchez-Marre M, Codina V (2010) Choosing the right data mining technique: classification of methods and intelligent recommendation. In: International Congress on Environment Modelling and Software Modelling for Environment’s Sake, Fifth Biennial Meeting, Ottawa, Canada

Download references

Author information

Authors and affiliations.

University School of Information, Communication and Technology, Guru Gobind Singh Indraprastha University, Sector-16C, Dwarka, Delhi, 110078, India

Manoj Kumar Gupta & Pravin Chandra

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manoj Kumar Gupta .

Rights and permissions

Reprints and permissions

About this article

Gupta, M.K., Chandra, P. A comprehensive survey of data mining. Int. j. inf. tecnol. 12 , 1243–1257 (2020). https://doi.org/10.1007/s41870-020-00427-7

Download citation

Received : 29 June 2019

Accepted : 20 January 2020

Published : 06 February 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s41870-020-00427-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Data mining techniques
Data mining tasks
Data mining applications
Classification
Find a journal
Publish with us
Track your research

TowardsMachineLearning

Process to write Research Papers and Journals for Data Scientists

What is a Research Paper?

A research paper is an expanded essay that presents your own interpretation or evaluation or argument. When you write an essay, you use everything that you personally know and have thought about a subject. When you write a research paper you build upon what you know about the subject and make a deliberate attempt to find out what experts know. A research paper involves surveying a field of knowledge in order to find the best possible information in that field. And that survey can be orderly and focused, if you know how to approach it.

Let’s look at various reasons to publish a research paper

Recognition
Discoverability
Reusability
Accessibility
Peer-review

Writing a research paper – where to start?

Publishing an article in a journal

Think about your goal:

Get cited by other researchers
Add a novel solution or perspective to the literature
Influence practice or policy
Identifying (and reading!) target journals first is important
Journals have different:
Aims and scope
Style guides and formats
Ethical guidelines
Data sharing requirements

Top 20 Journals for Data Scientists

IEEE Access
Springer International Journal of Data Science and Analytics
CODATA Data Science Journal
Computational Statistics & Data Analysis by IASC (International Association for Statistical Computing)
Journal of Big Data
Foundations and Trends in Machine Learning
International Journal of Business Intelligence and Data Mining
Sigkdd Explorations
Knowledge and Information Systems (KAIS)
Data Mining and Knowledge Discovery
Springer Machine Learning
Elsevier Pattern Recognition
Elsevier Neurocomputing
Elsevier Information Sciences
IEEE Transactions on Neural Networks and Learning Systems
IEEE Transactions on Knowledge and Data Engineering
Springer International Journal of Computer Vision
Journal of Machine Learning research (The MIT Press)

Identifying a Target Journal

Some Terms to Remember

Open Access Publication

Open access (OA) refers to freely available, digital, online information. Open access scholarly literature is free of charge and often carries less restrictive copyright and licensing barriers than traditionally published works, for both the users and the authors.
While OA is a newer form of scholarly publishing, many OA journals comply with well-established peer-review processes and maintain high publishing standards.

Subscription based Publication

Subscription-based journals typically require readers to pay for the content that they read. This has been the model followed traditionally in the academic publishing industry. Since individuals or institutions have to pay subscription charges, only those who have purchased a subscription to the journal/have purchased an article will have access to your paper.

Article Processing Charges (APC)

An article processing charge ( APC ), also known as a publication fee , is a fee which is sometimes charged to authors to make a work available open access in either an open access journal or hybrid journal. This fee may be paid by the author, the author’s institution, or their research funder. Some publishers waive the fee in cases of hardship or geographic location, but this is not a widespread practice

The ORCID ( Open Researcher and Contributor ID) is a nonproprietary alphanumeric code to uniquely identify authors and contributors of scholarly communication as well as ORCID’s website and services to look up authors and their bibliographic output (and other user-supplied pieces of information).
This addresses the problem that a particular author’s contributions to the scientific literature or publications in the humanities can be hard to recognize as most personal names are not unique, they can change (such as with marriage), have cultural differences in name order, contain inconsistent use of first-name abbreviations and employ different writing systems. It provides a persistent identity for humans, similar to tax ID numbers, that are created for content-related entities on digital networks by digital object identifiers (DOIs).

Peer Review

You will have to pass through a round of peer review before your paper is published.
Peer review is often very difficult and time- consuming, but essential to refining your work.
Sometimes you may not even make it as far as peer review – an editor may reject your paper (‘desk rejection’) because it does not meet the scope of the journal.
This is why it’s important to read the ‘Aims and Scope’ and ‘Instructions to Authors’ pages

Embargo Period

When an author decides to self-archive their research outputs in an Open Access repository they need to observe any embargo period a publisher might have in place. In practice this means that the author submits the research article to the institutional repository but only the metadata will be made openly accessible during the embargo period. Once the embargo period expires, the full-text article is released to the open web. Embargo periods vary between a few months to a few years.

Crossreference

Cross reference is a reference to information located somewhere else in the same document.
In both printed and online dictionaries cross-references are important because they form a network structure of relations existing between different parts of data, dictionary-internal as well as dictionary external. In an index, a cross-reference is often denoted by See also.

Things to Remember…

Carry out research in proper a way.
Make sure you are submitting to the most suitable journal
Understand the costs and funding options
Make sure your manuscript is accurate and readable
Understand the copyright agreement
You need to mention all your sources or the data sets which you are taking
Don’t just copy and paste things. It should be original !!!
If taking help from some books or sites you should mention them in references part.
Follow the proper order for table of contents for the Journal.
Article should be published in only one Journal.
Journals follow a through process for Peer-Review.
They have Anti-Plagiarism softwares with them
If your article contains some images make sure they are of high quality.
If you have some videos related to your research paper the it should be submitted in the supplementary material.
They have fixed the size for images or videos
See the metrics for the Journal you are submitting you article.

Table of Contents

Introduction
Related Literature
Preliminaries
Research Issues
Applications
Suggestions for future works
Availability of data and materials
Abbreviations
Acknowledgements
Additional Information

Submission checklist

Manuscript as a PDF in double column, single-spaced format using one of the required article templates. Templates for various formats like Microsoft Word and LaTex are provided on the Journal’s official site.
Author lists should be carefully considered before submission(Changes to author list post acceptance are not allowed.)
Corresponding authors are required to have an ORCID ID associated with their account in ScholarOne Manuscripts.
Short biographies are required for ALL authors. As per the required templates, biographies of all authors should be added to the article below the references.
All authors should be listed on both the manuscript PDF as well as on the author list within ScholarOne Manuscripts.
The article should be thoroughly reviewed for proper grammar before being submitted. Articles with poor grammar will be immediately rejected.
All research works should be carefully referenced.
The article should not be submitted elsewhere at the same time.
Supplemental material (if any)
Manuscript keywords (minimum of 3 and maximum of 10). Please carefully select the keywords as this is how they select a relevant Associate Editor to manage the peer review of your article.
Some Journals give you freedom to give names of the opposed reviewers(if any)
Video and Images (Please look at the size permitted by the Journal)
If the article was previously rejected after peer review with encouragement to update and resubmit, then a complete “list of updates” must be included in a separate document. The list of updates should have the following regarding each comment: 1) reviewers’ concerns, 2) authors’ response to the concerns, 3) actual changes implemented.
Some Journals have page limits while some of the Journals don’t have any such restrictions. But it is strongly recommended to complete your research within 50 pages.
Manuscripts follow a logical, standard structure in the following order: abstract (no more than 250 words), introduction, materials and methods, results (in case of a research paper), discussion.
You should give your data source for reference

Stages of peer review from submission to accept/reject decision

Anti-plagiarism Checking
Upon submission, each article will automatically be run through plagiarism software to check for similarity to other published work, including an author’s own work. If any issues are detected, the article will be immediately rejected, and may result in further investigation and/or corrective actions.
A combination of pre-screening and open access is the best possible defence against plagiarism. All articles submitted to Journals are automatically screened for plagiarism by the CrossCheck system from CrossRef. This system compares incoming articles to a large database of academic content, and alerts editors to any possible issues.
Rigorous Peer Review

Journal ensures that all research output, in both journals and books, is thoroughly peer-reviewed by external reviewers, and offer the option of open peer review if required. Publications of a commentary or opinion nature may not be sent for external peer review but will include extensive editorial review and revisions. Through membership of the Committee on Publication Ethics (COPE), editors and editorial boards receive ethical guidance and professional networking. All journals adhere to the COPE guidelines for best practice

Open Licenses

All content is released under open licenses from Creative Commons . They believe that only CC BY meets the requirements for true open access for books and journals, and strongly prefer CC0 for open research data.

Research Data

All journals and books strongly encourage authors to make the research objects associated with their publications openly available. This includes research data, software, bioresources and methodologies. This means that peer reviewers are able to better assess the foundations of claims made, and the research community and wider public are able to similarly validate authors’ work, and are more easily able to extend and build upon it.

All content is indexed with CrossRef and assigned a Digital Object Identifier (DOI). This means that all of their references are made available so that citations can be tracked by the publishing community, and the content is added to the Cross-Check anti-plagiarism database.

Citation of an article using DOI

With Online First publication, the time it takes for critical discoveries to reach the research community is significantly reduced.
Online First lets users access peer reviewed articles well before print publication. These articles are searchable and citable by their DOI (Digital Object Identifier). Formatting depends on the reference style of the journal, but in principle you can simply use the DOI instead of volume and page numbers, as in the following example:

Gamelin FX, Baquet G, Berthoin S, Thevenet D, Nourry C, Nottin S, Bosquet L (2009) Effect of high intensity intermittent training on heart rate variability in prepubescent children. Eur J Appl Physiol. doi: 10.1007/s00421-008-0955-8

After you submit final files you will automatically be directed to the Electronic Copyright Form. Since many journals are open access journals,therefore all articles will be published under either the Creative Commons Attribution License (CC BY) or the Creative Commons Attribution, NonCommercial, No Derivatives License (CCBY-NC-ND).
CC BY and CCBY-NC-ND allow authors to retain copyright of their article and permit a very broad range of reuse. Under CC BY, reuse for commercial purposes or to create derivative works is permitted, whereas under CCBY-NC-ND, reuse cannot be for commercial purposes or change the work in any way. Under the CC BY licenses, authors are responsible for protecting their content from possible abuses such as infringement and plagiarism.
Authors will be asked to select CC BY or CCBY-NC-ND prior to publication.

Publication

Once you have completed all the steps and your copyright information is completed, then the Journals which are open access they will send you email to settle your APC(Article Processing Charges) by credit card, debit card or any other medium which the Journal uses.
There are many waivers or discounts that may vary for different Journals. Please refer to their sites for any such information.
Now, after completing all your payments and previous steps now your article is ready to publish.
Different Journals have their own metrics for publishing your article.

After publication

Citation Alert-

Your article makes an impact. Now they’ll tell you about it!

Article-level metrics-

Explore how your article is being cited, used and discussed in the social web!

Marketing to worldwide audiences-

Their global marketing approach centers on customer channels – treating each of their audience groups in a way that addresses their unique needs

Promote your article to achieve a bigger impact for your work

To effectively spread the news about your research and your publications, it will be important to become an active and trusted member of the online communities. Your visibility will increase and this can lead to more downloads and citations for your articles.
You can also promote your article via your own email lists, online links, listservs, distribution at conferences and any other innovative techniques you wish to adopt.

Helpful Competiton

The COADATA connect Early Carrer and Alumni Network in collaboration with the CODATA Data Science Journal organized an Essay Competition for Early Carrer Researchers(ECR),defined as UG, Graduates, PG students or early carrer researchers with five years of completing the highest qualification.
Topic : “ Open data challenges to address Global and Societal issues”
It was conducted for 3 months.

Some Scenario Based Questions

What should I do if my article got rejected?

First, don’t get demotivated.
Focus on what your editor had sent you for improvement.
Improve all the concerns of the editor and send your article back for the process.

What if I have my own website and I wrote articles on this website and I want to include that information in my research paper?

Add this information the Additional Information part as mentioned in Table of Contents.
You can even write mail to your Journal Editor to add to his knowledge about this information.

How can I check for plagiarism by my own before submitting my article?

There are various anti-plagiarism softwares available online you can use them to check the percentage of plagiarism before submitting your article to Journal Editor for publishing.

What sites should I refer for selecting a proper journal for my research paper?

I recommend you to refer to the official site of the Journal for all the information.
If referring to any other site be aware of fake information.

Article Credit:-

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Review Paper on Data Mining Techniques and Applications

International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Volume-7, Issue-2, March 2019

5 Pages Posted: 2 Mar 2020

GVMGC Sonipat

Date Written: MARCH 31, 2019

Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help decision makers to make better decisions. Practically, data mining is really useful for any organization which has huge amount of data. Data mining help regular databases to perform faster. They also help to increase the profit, because of the correct decisions made with the help of data mining. This paper shows the various steps performed during the process of data mining and how it can be used by various industries to get better answers from huge amount of data.

Keywords: Data Mining, Regression, Time Series, Prediction, Association

Suggested Citation: Suggested Citation

Anshu (Contact Author)

Do you have a job opening that you would like to promote on ssrn, paper statistics, related ejournals, econometrics: econometric & statistical methods - special topics ejournal.

Subscribe to this fee journal for more curated articles on this topic

Web Technology eJournal

Decision-making & management science ejournal, data science, data analytics & informatics ejournal.

data mining Recently Published Documents

Total documents.

Latest Documents
Most Cited Documents
Contributed Authors
Related Sources
Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

82 Data Mining Essay Topic Ideas & Examples

🏆 best data mining topic ideas & essay examples, 💡 good essay topics on data mining, ✅ most interesting data mining topics to write about.

Ethical Implications of Data Mining by Government Institutions Critics of personal data mining insist that it infringes on the rights of an individual and result to the loss of sensitive information.
Levi’s Company’s Data Mining & Customer Analytics Levi, the renowned name in jeans is feeling the heat of competition from a number of other brands, which have come upon the scene well after Levi’s but today appear to be approaching Levi’s market […]
Data Mining and Its Major Advantages Thus, it is possible to conclude that data mining is a convenient and effective way of processing information, which has many advantages.
Data Mining Role in Companies The increasing adoption of data mining in various sectors illustrates the potential of the technology regarding the analysis of data by entities that seek information crucial to their operations.
Disadvantages of Using Web 2.0 for Data Mining Applications This data can be confusing to the readers and may not be reliable. Lastly, with the use of Web 2.
The Data Mining Method in Healthcare and Education Thus, I would use data mining in both cases; however, before that, I would discover a way to improve the algorithms used for it.
Data Mining Tools and Data Mining Myths The first problem is correlated with keeping the identity of the person evolved in data mining secret. One of the major myths regarding data mining is that it can replace domain knowledge.
Hybrid Data Mining Approach in Healthcare One of the healthcare projects that will call for the use of data mining is treatment evaluation. In this case, it is essential to realize that the main aim of health data mining is to […]
Terrorism and Data Mining Algorithms However, this is a necessary evil as the nation’s security has to be prioritized since these attacks lead to harm to a larger population compared to the infringements.
Transforming Coded and Text Data Before Data Mining However, to complete data mining, it is necessary to transform the data according to the techniques that are to be used in the process.
Data Mining and Machine Learning Algorithms The shortest distance of string between two instances defines the distance of measure. However, this is also not very clear as to which transformations are summed, and thus it aims to a probability with the […]
Summary of C4.5 Algorithm: Data Mining 5 algorism: Each record from set of data should be associated with one of the offered classes, it means that one of the attributes of the class should be considered as a class mark.
Data Mining in Social Networks: Linkedin.com One of the ways to achieve the aim is to understand how users view data mining of their data on LinkedIn.
Ethnography and Data Mining in Anthropology The study of cultures is of great importance under normal circumstances to enhance the understanding of the same. Data mining is the success secret of ethnography.
Issues With Data Mining It is necessary to note that the usage of data mining helps FBI to have access to the necessary information for terrorism and crime tracking.
Large Volume Data Handling: An Efficient Data Mining Solution Data mining is the process of sorting huge amount of data and finding out the relevant data. Data mining is widely used for the maintenance of data which helps a lot to an organization in […]
Cryptocurrency Exchange Market Prediction and Analysis Using Data Mining and Artificial Intelligence This paper aims to review the application of A.I.in the context of blockchain finance by examining scholarly articles to determine whether the A.I.algorithm can be used to analyze this financial market.
“Data Mining and Customer Relationship Marketing in the Banking Industry“ by Chye & Gerry First of all, the article generally elaborates on the notion of customer relationship management, which is defined as “the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company”.
Data Mining Techniques and Applications The use of data mining to detect disturbances in the ecosystem can help to avert problems that are destructive to the environment and to society.
Ethical Data Mining in the UAE Traffic Department The research question identified in the assignment two is considered to be the following, namely whether the implementation of the business intelligence into the working process will beneficially influence the work of the Traffic Department […]
Canadian University Dubai and Data Mining The aim of mining data in the education environment is to enhance the quality of education for the mass through proactive and knowledge-based decision-making approaches.
Data Mining and Customer Relationship Management As such, CRM not only entails the integration of marketing, sales, customer service, and supply chain capabilities of the firm to attain elevated efficiencies and effectiveness in conveying customer value, but it obliges the organization […]
E-Commerce: Mining Data for Better Business Intelligence The method allowed the use of Intel and an example to build the study and the literature on data mining for business intelligence to analyze the findings.
Data Warehouse and Data Mining in Business The circumstances leading to the establishment and development of the concept of data warehousing was attributed to the fact that failure to have a data warehouse led to the need of putting in place large […]
Data Mining: Concepts and Methods Speed of data mining process is important as it has a role to play in the relevance of the data mined. The accuracy of data is also another factor that can be used to measure […]
Data Mining Technologies According to Han & Kamber, data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data that in most circumstances is stored in repositories, business databases […]
Data Mining: A Critical Discussion In recent times, the relatively new discipline of data mining has been a subject of widely published debate in mainstream forums and academic discourses, not only due to the fact that it forms a critical […]
Commercial Uses of Data Mining Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations.
A Discussion on the Acceptability of Data Mining Today, more than ever before, individuals, organizations and governments have access to seemingly endless amounts of data that has been stored electronically on the World Wide Web and the Internet, and thus it makes much […]
Applying Data Mining Technology for Insurance Rate Making: Automobile Insurance Example
Applebee’s, Travelocity and Others: Data Mining for Business Decisions
Applying Data Mining Procedures to a Customer Relationship
Business Intelligence as Competitive Tool of Data Mining
Overview of Accounting Information System Data Mining
Applying Data Mining Technique to Disassembly Sequence Planning
Approach for Image Data Mining Cultural Studies
Apriori Algorithm for the Data Mining of Global Cyberspace Security Issues
Database Data Mining: The Silent Invasion of Privacy
Data Management: Data Warehousing and Data Mining
Constructive Data Mining: Modeling Consumers’ Expenditure in Venezuela
Data Mining and Its Impact on Healthcare
Innovations and Perspectives in Data Mining and Knowledge Discovery
Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection
Linking Data Mining and Anomaly Detection Techniques
Data Mining and Pattern Recognition Models for Identifying Inherited Diseases
Credit Card Fraud Detection Through Data Mining
Data Mining Approach for Direct Marketing of Banking Products
Constructive Data Mining: Modeling Argentine Broad Money Demand
Data Mining-Based Dispatching System for Solving the Pickup and Delivery Problem
Commercially Available Data Mining Tools Used in the Economic Environment
Data Mining Climate Variability as an Indicator of U.S. Natural Gas
Analysis of Data Mining in the Pharmaceutical Industry
Data Mining-Driven Analysis and Decomposition in Agent Supply Chain Management Networks
Credit Evaluation Model for Banks Using Data Mining
Data Mining for Business Intelligence: Multiple Linear Regression
Cluster Analysis for Diabetic Retinopathy Prediction Using Data Mining Techniques
Data Mining for Fraud Detection Using Invoicing Data
Jaeger Uses Data Mining to Reduce Losses From Crime and Waste
Data Mining for Industrial Engineering and Management
Business Intelligence and Data Mining – Decision Trees
Data Mining for Traffic Prediction and Intelligent Traffic Management System
Building Data Mining Applications for CRM
Data Mining Optimization Algorithms Based on the Swarm Intelligence
Big Data Mining: Challenges, Technologies, Tools, and Applications
Data Mining Solutions for the Business Environment
Overview of Big Data Mining and Business Intelligence Trends
Data Mining Techniques for Customer Relationship Management
Classification-Based Data Mining Approach for Quality Control in Wine Production
Data Mining With Local Model Specification Uncertainty
Employing Data Mining Techniques in Testing the Effectiveness of Modernization Theory
Enhancing Information Management Through Data Mining Analytics
Evaluating Feature Selection Methods for Learning in Data Mining Applications
Extracting Formations From Long Financial Time Series Using Data Mining
Financial and Banking Markets and Data Mining Techniques
Fraudulent Financial Statements and Detection Through Techniques of Data Mining
Harmful Impact Internet and Data Mining Have on Society
Informatics, Data Mining, Econometrics, and Financial Economics: A Connection
Integrating Data Mining Techniques Into Telemedicine Systems
Investigating Tobacco Usage Habits Using Data Mining Approach
Electronics Engineering Paper Topics
Cyber Security Topics
Google Paper Topics
Hacking Essay Topics
Identity Theft Essay Ideas
Internet Research Ideas
Microsoft Topics
Chicago (A-D)
Chicago (N-B)

IvyPanda. (2024, March 2). 82 Data Mining Essay Topic Ideas & Examples. https://ivypanda.com/essays/topic/data-mining-essay-topics/

"82 Data Mining Essay Topic Ideas & Examples." IvyPanda , 2 Mar. 2024, ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda . (2024) '82 Data Mining Essay Topic Ideas & Examples'. 2 March.

IvyPanda . 2024. "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

1. IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

Bibliography

IvyPanda . "82 Data Mining Essay Topic Ideas & Examples." March 2, 2024. https://ivypanda.com/essays/topic/data-mining-essay-topics/.

IvyPanda uses cookies and similar technologies to enhance your experience, enabling functionalities such as:

Basic site functions
Ensuring secure, safe transactions
Secure account login
Remembering account, browser, and regional preferences
Remembering privacy and security settings
Analyzing site traffic and usage
Personalized search, content, and recommendations
Displaying relevant, targeted ads on and off IvyPanda

Please refer to IvyPanda's Cookies Policy and Privacy Policy for detailed information.

Certain technologies we use are essential for critical functions such as security and site integrity, account authentication, security and privacy preferences, internal site usage and maintenance data, and ensuring the site operates correctly for browsing and transactions.

Cookies and similar technologies are used to enhance your experience by:

Remembering general and regional preferences
Personalizing content, search, recommendations, and offers

Some functions, such as personalized recommendations, account preferences, or localization, may not work correctly without these technologies. For more details, please refer to IvyPanda's Cookies Policy .

To enable personalized advertising (such as interest-based ads), we may share your data with our marketing and advertising partners using cookies and other technologies. These partners may have their own information collected about you. Turning off the personalized advertising setting won't stop you from seeing IvyPanda ads, but it may make the ads you see less relevant or more repetitive.

Personalized advertising may be considered a "sale" or "sharing" of the information under California and other state privacy laws, and you may have the right to opt out. Turning off personalized advertising allows you to exercise your right to opt out. Learn more in IvyPanda's Cookies Policy and Privacy Policy .

Welcome to TOP 10 research articles

Top 10 data mining papers: recommended reading ? datamining & knowledgement management research, citation count: 85, data mining and its applications for knowledge management: a literature review from 2007 to 2012.

Tipawan Silwattananusarn 1 and KulthidaTuamsuk 2

1 Ph.D. Student in Information Studies Program, Khon Kaen University, Thailand and 2 Head, Information & Communication Management Program, Khon Kaen University, Thailand

Data mining is one of the most important steps of the knowledge discovery in databases process and is considered as significant subfield in knowledge management. Research in data mining continues growing in business and in learning organization over coming decades. This review paper explores the applications of data mining techniques which have been developed to support knowledge management process. The journal articles indexed in ScienceDirect Database from 2007 to 2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge resource; (ii) knowledge types and/or knowledge datasets; (iii) data mining tasks; and (iv) data mining techniques and applications used in knowledge management. The article first briefly describes the definition of data mining and data mining functionality. Then the knowledge management rationale and major knowledge management tools integrated in knowledge management cycle are described. Finally, the applications of data mining techniques in the process of knowledge management are summarized and discussed.

Data mining; Data mining applications; Knowledge management

[1] An, X. & Wang, W. (2010). Knowledge management technologies and applications: A literature review . IEEE, 138-141. doi:10.1109/ICAMS.2010.5553046

[2] Berson, A., Smith, S.J. &Thearling, K. (1999). Building Data Mining Applications for CRM. New York: McGraw-Hill .

[3] Cant�, F.J. & Ceballos, H.G. (2010). A multiagent knowledge and information network approach for managing research assets . Expert Systems with Applications, 37(7), 5272-5284.doi:10.1016/j.eswa.2010.01.012

[4] Cheng, H., Lu, Y. & Sheu, C. (2009). An ontology-based business intelligence application in a financial knowledge management system .Expert Systems with Applications, 36, 3614�3622. Doi:10.1016/j.eswa.2008.02.047

[5] Dalkir, K. (2005). Knowledge Management in Theory and Practice . Boston: Butterworth-Heinemann.

[6] Dawei, J. (2011). The Application of Date Mining in Knowledge Management .2011 International Conference on Management of e-Commerce and e-Government, IEEE Computer Society, 7-9. doi:10.1109/ICMeCG.2011.58

[7] Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases.AI Magazine, 17(3), 37-54.

[8] Gorunescu, F. (2011). Data Mining: Concepts, Models, and Techniques . India: Springer.

[9] Han, J. &Kamber, M. (2012). Data Mining: Concepts and Techniques . 3rd.ed. Boston: Morgan Kaufmann Publishers.

[10] Hwang, H.G., Chang, I.C., Chen, F.J. & Wu, S.Y. (2008). Investigation of the application of KMS for diseases classifications: A study in a Taiwanese hospital . Expert Systems with Applications, 34(1), 725-733. doi:10.1016/j.eswa.2006.10.018

[11] Lavrac, N., Bohanec, M., Pur, A., Cestnik, B., Debeljak, M. &Kobler, A. (2007).Data mining and visualization for decision support and modeling of public health-care resources.Journal of Biomedical Informatics, 40, 438-447. doi:10.1016/j.jbi.2006.10.003

[12] Li, X., Zhu, Z. & Pan, X. (2010). Knowledge cultivating for intelligent decision making in small & middle businesses .Procedia Computer Science, 1(1), 2479-2488. doi:10.1016/j.procs.2010.04.280

[13] Li, Y., Kramer, M.R., Beulens, A.J.M., Van Der Vorst, J.G.A.J. (2010). A framework for early warning and proactive control systems in food supply chain networks. Computers in Industry, 61, 852�862. Doi:101.016/j.compind.2010.07.010

[14] Liao, S.H., Chen, C.M., Wu, C.H. (2008). Mining customer knowledge for product line and brand extension in retailing. Expert Systems with Applications, 34(3), 1763-1776. doi:10.1016/j.eswa.2007.01.036

[15] Liao, S. (2003). Knowledge management technologies and applications-literature review from 1995 to 2002 . Expert Systems with Applications, 25, 155-164. doi:10.1016/S0957-4174(03)00043-5

[16] Liu, D.R. & Lai, C.H. (2011). Mining group-based knowledge flows for sharing task knowledge. Decision Support Systems ,50(2), 370-386. doi:10.1016/j.dss.2010.09.004

[17] Lee, M.R. & Chen, T.T. (2011). Revealing research themes and trends in knowledge management: From 1995 to 2010. Knowledge-Based Systems.doi:10.1016/j.knosys.2011.11.016

[18] McInerney, C.R. & Koenig, M.E. (2011). Knowledge Management (KM) Processes in Organizations: Theoretical Foundations and Practice . USA: Morgan & Claypool Publishers. doi:10.2200/S00323ED1V01Y201012ICR018

[19] McInerney, C. (2002). Knowledge Management and the Dynamic Nature of Knowledge .Journal of the American Society for Information Science and Technology, 53(12), 1009-1018. doi:10.1002/asi.10109

[20] Ngai, E., Xiu, L. &Chau, D. (2009). Application of data mining techniques in customer relationship management: A literature review and classification . Expert Systems with Applications, 36, 2592- 2602. doi:10.1016/j.eswa.2008.02.021

[21] Ruggles, R.L. (ed.). (1997). Knowledge Management Tools. Boston: Butterworth-Heinemann.

[22] Sher, P.J. & Lee, V.C. (2004). Information technology as a facilitator for enhancing dynamic capabilities through knowledge management.Information & Management, 41, 933-945. doi:10.1016/j.im.2003.06.004

[23] Tseng, S.M. (2008). The effects of information technology on knowledge management systems .Expert Systems with Applications, 35, 150-160. doi:10.1016/j.eswa.2007.06.011

[24] Ur-Rahman, N. & Harding, J.A. (2012). Textual data mining for industrial knowledge management and text classification: A business oriented approach . Expert Systems with Applications, 39, 4729-4739. doi:10.1016/j.eswa.2011.09.124

[25] Wang, F. & Fan, H. (2008). Investigation on Technology Systems for Knowledge Management.IEEE, 1-4. doi:10.1109/WiCom.2008.2716

[26] Wang, H. & Wang, S. (2008). A knowledge management approach to data mining process for business intelligence. Industrial Management & Data Systems, 108(5), 622-634.

[27] Wu, W., Lee, Y.T., Tseng, M.L. & Chiang, Y.H. (2010). Data mining for exploring hidden patterns between KM and its performance.Knowledge-Based Systems, 23, 397-401. doi:10.1016/j.knosys.2010.01.014

Citation Count: 83

Analysis of heart diseases dataset using neural network approach.

K. Usha Rani

Dept. of Computer Science, Sri Padmavathi Mahila Visvavidyalayam (Women�s University), Tirupati – 517502 , Andhra Pradesh, India

One of the important techniques of Data mining is Classification. Many real world problems in various fields such as business, science, industry and medicine can be solved by using classification approach. Neural Networks have emerged as an important tool for classification. The advantages of Neural Networks helps for efficient classification of given data. In this study a Heart diseases dataset is analyzed using Neural Network approach. To increase the efficiency of the classification process parallel approach is also adopted in the training phase.

Data mining, Classification, Neural Networks, Parallelism, Heart Disease

[1] John Shafer, Rakesh Agarwal, and Manish Mehta, (1996) � SPRINT:A scalable parallel classifier for data mining �, In Proc. Of the VLDB Conference, Bombay, India..

[2] Sunghwan Sohn and Cihan H. Dagli, (2004) � Ensemble of Evolving Neural Networks in classification �, Neural Processing Letters 19: 191-203, Kulwer Publishers.

[3] K. Anil Jain, Jianchang Mao and K.M. Mohiuddi, (1996) � Artificial Neural Networks: A Tutorial �, IEEE Computers, pp.31-44.

[4] George Cybenk,, (1996)�Neural Networks in Computational Science and Engineering�, IEEE Computational Science and Engineering, pp.36-42

[5] R. Rojas, (1996) �Neural Networks: a systematic introduction�, Springer-Verlag.

[6] R.P.Lippmann,�Pattern classification using neural networks, (1989)� IEEE Commun. Mag., pp.47�64.

[7] Simon Haykin, (2001) � Neural Networks � A Comprehensive Foundation �, Pearson Education.

[8] B.Widrow, D. E. Rumelhard, and M. A. Lehr, (1994) �Neural networks: Applications in industry, business and science,� Commun. ACM, vol. 37, pp.93�105.

[9] W. G. Baxt, (1990) � Use of an artificial neural network for data analysis in clinical decisionmaking: The diagnosis of acute coronary occlusion ,� Neural Comput., vol. 2, pp. 480�489..

[10] Dr. A. Kandaswamy, (1997) �Applications of Artificial Neural Networks in Bio Medical Engineering�, The Institute of Electronics and Telecommunicatio Engineers, Proceedings of the Zonal Seminar on Neural Networks, Nov 20-21.

[11] A. Kusiak, K.H. Kernstine, J.A. Kern, K A. McLaughlin and T.L. Tseng, (2000) � Data mining: Medical and Engineering Case Studies �, Proceedings of the Industrial Engineering Research Conference, Cleveland, Ohio, May21-23,pp.1-7.

[12] H. B. Burke, (1994) � Artificial neural networks for cancer research: Outcome prediction ,� Sem. Surg. Oncol., vol. 10, pp. 73�79.

[13] H. B. Burke, P. H. Goodman, D. B. Rosen, D. E. Henson, J. N. Weinstein, F. E. Harrell, J. R.Marks, D. P. Winchester, and D. G. Bostwick, (1997) � Artificial neural networks improve the accuracy of cancer survival prediction ,� Cancer, vol. 79, pp. 857�8621997.

[14] Siri Krishan Wasan1,Vasudha Bhatnagar2 and Harleen Kaur, (2006)� The impact of Data Mining Techniques on Medical Diagnostics�, Data Science Journal, Volume 5, 119-126.

[15] Scales, R., & Embrechts, M., (2002) �Computational Intelligence Techniques for Medical Diagnostic�, Proceedings of Walter Lincoln Hawkins, Graduate Research Conference from the World Wide Web: http://www.cs.rpi.edu/~bivenj/MRC/proceedings/papers/researchpaper.pdf

[16] S. M. Kamruzzaman , Md. Monirul Islam, (2006)� An Algorithm to Extract Rules from Artificial Neural Networks for Medical Diagnosis Problems�, International Journal of Information Technology, Vol. 12 No. 8.

[17] Hasan Temurtas, Nejat Yumusak, Feyzullah Temurtas, (2009)� A comparative study on diabetes disease diagnosis using neural networks�, Expert Systems with Applications: An International Journal , Volume 36 Issue 4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.1, No.5, September 2011 8

[18] D Gil, M Johnsson, JM Garcia Chamizo, (2009) , � Application of artificial neural networks in the diagnosis of urological dysfunctions �, Expert Systems with Applications Volume 36, Issue 3, Part 2, Pages 5754-5760, Elsevier

[19] R. Dybowski and V. Gant, (2007), � Clinical Applications of Artificial Neural Networks �, Cambridge University Press.

[20] O. Er, N. Yumusak and F. Temurtas, (2010) “Chest disease diagnosis using artificial neural networks”, Expert Systems with Applications, Vol.37, No.12, pp. 7648-7655.

[21] S. Moein, S. A. Monadjemi and P. Moallem, (2009) “ A Novel Fuzzy-Neural Based Medical Diagnosis System “, International Journal of Biological & Medical Sciences, Vol.4, No.3, pp. 146-150.

Citation Count: 80

Predicting students? performance using id3 and c4.5 classification algorithms.

Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao

Department of Computer Engineering, Fr. C.R.I.T., Navi Mumbai, Maharashtra, India

An educational institution needs to have an approximate prior knowledge of enrolled students to predict their performance in future academics. This helps them to identify promising students and also provides them an opportunity to pay attention to and improve those who would probably get lower grades. As a solution, we have developed a system which can predict the performance of students from their previous performances using concepts of data mining techniques under Classification. We have analyzed the data set containing information about students, such as gender, marks scored in the board examinations of classes X and XII, marks and rank in entrance examinations and results in first year of the previous batch of students. By applying the ID3 (Iterative Dichotomiser 3) and C4.5 classification algorithms on this data, we have predicted the general and individual performance of freshly admitted students in future examinations.

Classification, C4.5, Data Mining, Educational Research, ID3, Predicting Performance

[1] Han, J. and Kamber, M., (2006) Data Mining: Concepts and Techniques , Elsevier.

[2] Dunham, M.H., (2003) Data Mining: Introductory and Advanced Topics, Pearson Education Inc.

[3] Kantardzic, M., (2011) Data Mining: Concepts, Models, Methods and Algorithms, Wiley-IEEE Press.

[4] Ming, H., Wenying, N. and Xu, L., (2009) �An improved decision tree classification algorithm based on ID3 and the application in score analysis�, Chinese Control and Decision Conference (CCDC), pp1876-1879.

[5] Xiaoliang, Z., Jian, W., Hongcan Y., and Shangzhuo, W., (2009) � Research and Application of the improved Algorithm C4.5 on Decision Tree �, International Conference on Test and Measurement (ICTM), Vol. 2, pp184-187.

[6] CodeIgnitor User Guide Version 2.14, http://ellislab.com/codeigniter/user-guide/toc.html

[7] RapidMiner, http://rapid-i.com/content/view/181/190/

[8] MySQL � The world�s most popular open source database, http://www.mysql.com/

Citation Count: 51

Diagnosis of diabetes using classification mining techniques.

Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly

Department of Computer Science, BITS Pilani Dubai, United Arab Emirates

Diabetes has affected over 246 million people worldwide with a majority of them being women. According to the WHO report, by 2025 this number is expected to rise to over 380 million. The disease has been named the fifth deadliest disease in the United States with no imminent cure in sight. With the rise of information technology and its ontinued advent into the medical and healthcare sector, the cases of diabetes as well as their symptoms are well documented. This paper aims at finding solutions to diagnose the disease by analyzing the patterns found in the data through classification analysis by employing Decision Tree and Na�ve Bayes algorithms. The research hopes to propose a quicker and more efficient technique of diagnosing the disease, leading to timely treatment of the patients.

Classification, Data Mining, Decision Tree, Diabetes and Na�ve Bayes.

[1] National Diabetes Information Clearinghouse (NDIC), http://diabetes.niddk.nih.gov/dm/pubs/type1and2/#signs

[2] Global Diabetes Community, http://www.diabetes.co.uk/diabetes_care/blood-sugar-level-ranges.html

[3] Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques�, Morgan Kauffman Publishers, 2001

[4] S. Kumari and A. Singh, � A Data Mining Approach for the Diagnosis of Diabetes Mellitus �, Proceedings of Seventh lnternational Conference on Intelligent Systems and Control, 2013, pp. 373-375

[5] C. M. Velu and K. R. Kashwan, �Visual Data Mining Techniques for Classification of Diabetic Patients�, 3rd IEEE International Advance Computing Conference (IACC), 2013

[6] Sankaranarayanan.S and Dr Pramananda Perumal.T, � Predictive Approach for Diabetes Mellitus Disease through Data Mining Technologies �, World Congress on Computing and Communication Technologies, 2014, pp. 231-233

[7] Mostafa Fathi Ganji and Mohammad Saniee Abadeh, �Using fuzzy Ant Colony Optimization for Diagnosis of Diabetes Disease�, Proceedings of ICEE 2010, May 11-13, 2010

[8] T.Jayalakshmi and Dr.A.Santhakumaran, � A Novel Classification Method for Diagnosis of Diabetes Mellitus Using Artificial Neural Networks �, International Conference on Data Storage and Data Engineering, 2010, pp. 159-163

[9] Sonu Kumari and Archana Singh, �A Data Mining Approach for the Diagnosis of Diabetes Mellitus�, Proceedings of71hlnternational Conference on Intelligent Systems and Control (ISCO 2013)

[10] Neeraj Bhargava, Girja Sharma, Ritu Bhargava and Manish Mathuria, Decision Tree Analysis on J48 Algorithm for Data Mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 6, June 2013.

[11] Michael Feld, Dr. Michael Kipp, Dr. Alassane Ndiaye and Dr. Dominik Heckmann �Weka: Practical machine learning tools and techniques with Java implementations�

[12] White, A.P., Liu, W.Z.: Technical note: Bias in information-based measures in decision tree induction . Machine Learning 15(3), 321�329 (1994)

Citation Count: 42

A new clutering approach for anomaly intrusion detection.

Ravi Ranjan and G. Sahoo

Department of Information Technology, Birla Institute of Technology, Mesra, Ranchi

Recent advances in technology have made our work easier compare to earlier times. Computer network is growing day by day but while discussing about the security of computers and networks it has always been a major concerns for organizations varying from smaller to larger enterprises. It is true that organizations are aware of the possible threats and attacks so they always prepare for the safer side but due to some loopholes attackers are able to make attacks. Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion detection by using the approach of K-medoids method of clustering and its certain modifications. The proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means algorithm.

Clustering, data mining, intrusion detection, network security

[1] J. Anderson, � Computer security threat monitoring and surveillance �, 1980.

[2] Dorothy E. Denning, �An intrusion-detection model�, IEEE Transactions on software engineering, pp. 222�232, 1987.

[3] Kemmerer, R., and Vigna, G. �Intrusion Detection: A Brief History and Overview.� IEEE Security & Privacy, v1 n1, Apr 2002, p27-30.

[4] S. Staniford-Chen, S. Cheung, R. Crawford., M. Dilger, J. Frank, J. Hoagland, K. Levitt, C.Wee, R.Yip, D. Zerkle . � GrIDS- A Graph-Based Intrusion Detection system for Large Networks .� Proc National Information Systems Security conf, 1996.

[5] M.Jianliang, S.Haikun and B.Ling. The Application on Intrusion Detection based on K- Means Cluster Algorithm . International Forum on Information Technology and Application, 2009.

[6] Yu Guan, Ali A. Ghorbani and Nabil Belacel. Y-means: a clustering method for Intrusion Detection. In Canadian Conference on Electrical and Computer Engineering, pages 14, Montral, Qubec, Canada, May 2003.

[7] Zhou Mingqiang, HuangHui, WangQian, � A Graph-based Clustering Algorithm for Anomaly Intrusion Detection � In computer science and education (ICCSE), 7th International Conference ,2012.

[8] Chitrakar, R. and Huang Chuanhe, � Anomaly detection using Support Vector Machine Classification with K-Medoids clustering � In Internet (AH-ICI), 3rd Asian Himalayas International conference, 2012.

[9] Yang Jian, �An Improved Intrusion Detection Algorithm Based on DBSCAN�, Micro Computer Information, 25,1008-0570(2009)01- 3- 0058-03, 58-60,2009.

[10] Li Xue-yong, Gao Guo- �A New Intrusion Detection Method Based on Improved DBSCAN �, In Information Engineering (ICIE), WASE International conference, 2010.

[11] Lei Li, De-Zhang, Fang-Cheng Shen, � A novel rule-based Intrusion Detection System using data mining �, In ICCSIT, IEEE International conference, 2010.

[12] Z. Muda, W. Yassin, M.N. Sulaiman and N.I.Udzir, � Intrusion Detection based on K-Means Clustering and OneR Classification � In Information Assurance and Security (IAS), 7th International conference, 2011.

[13] Zhengjie Li, Yongzhong Li, Lei Xu, � Anomaly intrusion detection method based on K-means clustering algorithm with particle swarm optimization �, In ICM, 2011.

[14] Kapil Wankhade, Sadia Patka, Ravindra Thool, � An Overview of Intrusion Detection Based on Data Mining Techniques �, In Proceedings of 2013 International Conference on Communication Systems and Network Technologies, IEEE, 2013, pp.626-629. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.2, March 2014 38

[15] H. Fatma, L. Mohamed, �A two-stage technique to improve intrusion detection systems based on data mining algorithms�, In ICMSAO, 2013.

[16] A.M. Chandrasekhar, K. Raghuveer, � Intrusion detection technique by using K-means,fuzzy neural network and SVM classifiers �, In ICCCI, 2013.

[17] Margaret H. Dunham, � Data Mining: Introductory and Advanced Topics �,ISBN: 0130888923, published by Pearson Education, Inc.,2003.

[18] KDD.KDDCup1999Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html , 1999.

Citation Count: 34

Incremental learning: areas and methods ?a survey.

Prachi Joshi 1 and Parag Kulkarni 2

1 Assistant Professor, MIT College of Engineering, Pune and 2 Adjunct Professor, College of Engineering, Pune

While the areas of applications in data mining are growing substantially, it has become extremely necessary for incremental learning methods to move a step ahead. The tremendous growth of unlabeled data has made incremental learning take up a big leap. Starting from BI applications to image classifications, from analysis to predictions, every domain needs to learn and update. Incremental learning allows to explore new areas at the same time performs knowledge amassing. In this paper we discuss the areas and methods of incremental learning currently taking place and highlight its potentials in aspect of decision making. The paper essentially gives an overview of the current research that will provide a background for the students and research scholars about the topic.

Incremental, learning, mining, supervised, unsupervised, decision-making

[1] Y. Lui, J. Cai, J. Yin, A. Fu, Clustering text data streams , Journal of Computer Science and Technology, 2008, pp 112-128.

[2] A. Fahim, G. Saake, A. Salem, F. Torky, M. Ramadan, K-means for spherical clusters with large variance in sizes , Journal of World Academy of Science, Engineering and Technology, 2008.

[3] F. Camastra, A. Verri, A novel kernel method for clustering, IEEE Transactions on Pattern Analysis and Machince Intelligence, Vol. 27, no.5, 2005, pp 801-805.

[4] F. Shen, H. Yu, Y. Kamiya, O. Hasegawa, An Online Incremental Semi-Supervised Learning Method , Journal of advanced Computational Intelligence and Intelligent Informatics, Vol. 14, No.6, 2010.

[5] T. Zhang, R. Ramakrishnan, M. Livny, Birch: An efficient data clustering method for very large databases, Proc. ACM SIGMOD Intl.Conference on Management of Data , 1996, pp.103-114.

[6] S. Deelers, S. Auwantanamongkol, Enhancing k-means algorithm with initial cluster centers derived from data partitioning along the data axis with highest variance , International Journal of Electrical and Computer Science, 2007, pp 247-252.

[7] S. Young, A. Arel, T. Karnowski, D. Rose, A Fast and Stable Incremental Clustering Algorithm , Proc. of International Conference on Information Technology New Generations, 2010, pp 204-209.

[8] M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrival, Proc. of ACM symposium on Theory of Computeion , 1997, pp 626- 635.

[9] K. Hammouda, Incremental document clustering using Cluster similarity histograms , Proc. of IEEE International Conference on Web Intelligence, 2003, pp 597- 601.

[10] X. Su, Y. Lan,R. Wan, Y. Qin, A fast incremental clustering algorithm , Proc. of International Symposium on Information Processing, 2009, pp 175-178.

[11] T. Li, HIREL: An incremental clustering for relational data sets , Proc. of IEEE International Conference on Data Mining, 2008, pp 887 � 892.

[12] P. Lin, Z. Lin, B. Kuang, P. Huang, A Short Chinese Text Incremental Clustering Algorithm Based on Weighted Semantics and Naive Bayes , Journal of Computational Information Systems, 2012, pp 4257- 4268.

[13] C. Chen, S. Hwang, Y. Oyang, An Incremental hierarchical data clustering method based on gravity theory , Proc. of PAKDD, 2002, pp 237-250.

[14] M. Ester, H. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental Clustering for Mining in a Data Warehousing Environment , Proc. of Intl. Conference on very large data bases, 1998, pp 323-333.

[15] G. Shaw, Y. Xu, Enhancing an incremental clustering algorithm for web page collections, Proc. of IEEE/ACM/WIC Joint Conference on Web Intelligence and and Intelligent Agent Technology, 2009.

[16] C. Hsu, Y. Huang, Incremental clustering of mixed data based on distance hierarchy , Journal of Expert systems and Applications, 35, 2008, pp 1177 � 1185.

[17] S. Asharaf, M. Murty, S. Shevade, Rough set based incremental clustering of interval data, Pattern Recognition Letters, Vol.27 (9), 2006, pp 515-519.

[18] Z. Li, Incremental Clustering of trajectories , Computer and Information Science, Springer 2010, pp 32-46.

[19] S. Elnekava, M. Last, O. Maimon, I ncremental clustering of mobile objects , Proc. of IEEE International Conference on Data Engineering, 2007, pp 585-592.

[20] S. Furao, A. Sudo, O. Hasegawa, An online incremental learning pattern -based reasoning system, Journal of Neural Networks, Elsevier, Vol. 23,(1), 2010.pp 135-143.

[21] S. Ferilli, M. Biba, T.Basile, F. Esposito, Incremental Machine learning techniques for document layout understanding , Proc. of IEEE Conference on Pattern Recognition, 2008, pp 1-4.

[22] S. Ozawa, S. Pang, N. Kasabov, Incremental Learning of chunk data for online pattern classification systems, IEEE Transactions on Neural Networks, Vo. 19 (6), 2008, pp 1061-1074.

[23] Z. Chen, L. Huang, Y. Murphey, Incremental learning for text document classification , Proc. of IEEE Conference on Neural Networks, 2007, pp 2592-2597. 51

[24] R. Polikar, L. Upda, S. Upda, V. Honavar, Learn ++: An incremental learning algorithm for supervised neural networks , IEEE Transactions on Systems, Man and Cybernatics, Vol.31 (4), 2001, pp 497-508.

[25] H. He, S. Chen, K. Li, X. Xu, Incremental learning from stream data, IEEE Transactions on Neural Networks , Vol.22(12), 2011, pp 1901-1914.

[26] A. Bouchachia, M. Prosseger, H. Duman, Semi supervised incremental learning, Proc. of IEEE International Conference on Fuzzy Systems, 2010 pp 1-7.

[27] R. Zhang, A. Rudnicky, A new data section principle for semi-supervised incremental learning , Computer Science department, paper 1374, 2006, http://repository.cmu.edu/compsci/1373 .

[28] Z. Li, S. Watchsmuch, J. Fritsch, G. Sagerer, Semi-supervised incremental learning of manipulative tasks, Proc. of International Conference on Machine Vision Applications, 2007, pp 73-77.

[29] A. Misra, A. Sowmya, P. Compton, Incremental learning for segmentation in medical images , Proc. of IEEE Conference on Biomedical Imaging, 2006.

[30] P. Kranen, E. Muller, I. Assent, R. Krieder, T. Seidl, Incremental Learning of Medical Data for MultiStep Patient Health Classification, Database technology for life sciences and medicine, 2010.

[31] J. Wu, B. Zhang, X. Hua, J, Zhang, A semi-supervised incremental learning framework for sports video view classification, Proc. of IEEE Conference on Multi-Media Modelling, 2006.

[32] S. Wenzel, W. Forstner, Semi supervised incremental learning of hierarchical appearance models , The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. Vol.37,2008.

[33] S. Ozawa, S. Toh, S. Abe, S. Pang, N. Kasabov, Incremental Learning for online face recognition , Proc. of IEEE Conference on Neural Networks, Vol. 5, 2005 pp 3174-3179.

[34] Z. Erdem, R. Polikar, F. Gurgen, N. Yumusak, Ensemble of SVMs for Incremental Learning , Multiple Classifier Systems, Springer Verlang,, 2005, pp 246-256.

[35] X. Yang, B. Yuan, W. Liu, Dynamic Weighting ensembles for incremental learning , Proc. of IEEE conference in pattern recognition. 2009, pp 1-5.

[36] R. Elwell, R. Polikar, Incremental Learning of Concept drift in nonstationary environments, IEEE Transactions on Neural Networks, Vol.22 (10), 2011 pp 1517- 1531.

[37] W. Khreich, E. Granger, A. Miri, R. Sabourin, A survey of techniques for incremental learning of HMM parameters , Journal of Information Science, Elsevier, 2012.

[38] O. Buffet, A. Duetch, F. Charpillet, Incremental Reinforcement Learning for designing multi-agent systems , Proc. of ACM International Conference on Autonomous Agents, 2001.

[39] E. Demidova, X. Zhou, W. Nejdl, A probabilistic scheme for keyword-based incremental query construction, IEEE Transactions on Knowledge and Data Engineering, 2012, pp 426-439.

[40] R. Roscher, W. Forestner, B. Waske, I2VM: Incremental import vector machines , Journal of Image and Vision Computing, Elsevier, 2012.

Citation Count: 33

A prototype decision support system for optimizing the effectiveness of elearning in educational institutions.

S. Abu-Naser, A. Al-Masri, Y. Abu Sultan and I. Zaqout

Al Azhar University Gaza, Palestine,

In this paper, a prototype of a Decision Support System (DSS) is proposed for providing the knowledge for optimizing the newly adopted e-learning education strategy in educational institutions. If an educational institution adopted e-learning as a new strategy, it should undertake a preliminary evaluation to determine the percentage of success and areas of weakness of this strategy. If this evaluation is done manually, it would not be an easy task to do and would not provide knowledge about all pitfall symptoms. The proposed DSS is based on exploration (mining) of knowledge from large amounts of data yielded from the operating the institution to its business. This knowledge can be used to guide and optimize any new business strategy implemented by the institution. The proposed DSS involves Database engine, Data Mining engine and Artificial Intelligence engine. All these engines work together in order to extract the knowledge necessary to improve the effectiveness of any strategy, including e-learning

DSS, E-learning, knowledge, Database, Data mining, Artificial Intelligence.

[1] Power, D.J., (2002) Decision Support Systems: Concepts and Resources for Managers . Quorum Books/Greenwood Publishing.

[2] Han, J. and M. Kamberm (2006). Data mining: concepts and techniques. Amsterdam ; Boston San Francisco, CA, Elsevier; Morgan Kaufmann.

[3] Clark, R. C., & Mayer, R. E., (2003). e-Learning and the Science of Instruction: Proven Guidelines for Consumers and Designers of Multimedia Learning. San Francisco: Jossey-Bass.

[4] Kamber, M., Winstone, L., Gong, W., Cheng, S. and Han, J. (1997). Generalization and decision tree induction: efficient classification in data mining . In 7th International Workshop on Research Issues in Data Engineering (RIDE ’97) High Performance Database Management for Large-Scale Applications, pp.111.

[5] Agrawal, R., Imielinski,T. and Swami, A., (1993), M ining association rules between sets of items in large databases In Prooc. of the ACM SIGMOD Int’l Conf. on Management of Data (ACM SIGMOD ’93), Washington, USA.

[6] MERCERON, A. and YACEF, K,. (2005). Educational Data Mining: a Case Study. In Artificial Intelligence in Education (AIED2005), C.-K. LOOI, G. MCCALLA, B.

[7] Russell S., Peter Norvig, P., (2010), Artificial intelligence: a modern approach, 3rd edition, Prentice Hall.

[8] Power, D.J., A Brief History of Decision Support Systems, DSSResources.COM, World-Wide Web, (2011), http://dssresources.com/history/dsshistory.html , version 2.6

[9] Sanjeev, P. and Zytkow, J.M., (1995). Discovering enrollment knowledge in university databases . In KDD, pp. 246-251.

[10]Luan, J., (2002).Data mining, knowledge management in higher education, potential applications. In workshop associate of institutional research international conference, Toronto, pp. 1- 18.

[11]Deniz, D.Z. and Ersan, I., (2001) Using an academic DSS for student, course and program assessment, International Conference on Engineering Education, Oslo, pp.6B8-12�6B8 17.

[12]Deniz, D.Z. and Ersan, I., (2002). An academic decision-support system based on academic performance evaluation for student and program assessment , International Journal of Engineering Education, Vol. 18, No. 2, pp.236�244.

[13]Minaei-Bidgli, B. and Punch,W.,(2003). Using genetic algorithms for data mining optimizing in an educational web-based system . In GECCO, pp. 2252-2263.

[14]Dasgupta, P. and Khazanchi, D., (2005). Adaptive decision support for academic course scheduling using intelligent software agents. International Journal of Technology in Teaching and Learning, Vol. 1, No 2,pp., 63-78.

[15]Mansmann, S. and Scholl, M. H., (2007 ). Decision Support System for Managing Educational Capacity Utilization in Education , IEEE Transactions Vol. 50, No. 2, pp. 143 � 150.

[16]Inmon, W.H. and Kelley, C., (1993). Rdb/VMS: Developing the Data Warehouse. QED Publishing Group, Boston.

[17]Agrawal, R., Gupta, A., and Sarawagi, S., (1995). Modeling multidimensional databases . IBM Research Report.

[18]Han, J.; Cercone, N. and Cai, Y., (1991). Attribute-Oriented Induction in Relational Databases In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pp. 213-228.

[19]Lauden, K. and Lauden J., (2009). Management information Systems. Prentice Hall ; 11th edition.

[20]Nwelih, E. and Chiemeke, S.C. (2010) Academic Advising Decision Support System for Nigerian Universities, Anthology of Abstracts of the 3rd International Conference on ICT for Africa, March 25-27, Yaound�, Cameroon. Baton Rouge, LA: International Center for IT and Development.

[21]Marta Zorrilla, Diego Garc�a and Elena �lvarez.(2010). A Decision Support System to improve eLearning Environments. BEWEB 2010 – International Workshop on Business intelligence and the WEB ,March 22-26, 2010 – Lausanne (Switzerland).

[22]Roberto Llorente and Maria Morant, (2011), Data Mining in Higher Education, Kimito Funatsu, InTech, 2011.

[23]Falakmasir M., and Habibi J., (2010), Using Educational Data Mining Methods to Study the Impact of Virtual Classroom in E-Learning, Educational Data Mining 2010, 3rd International Conference on Educational Data Mining , Pittsburgh, PA, USA, June 11-13, 2010.

[24]Rajibussalim M., (2010), Mining Students� Interaction Data from a System that Support Learning by Reflection, Educational Data Mining 2010, 3rd International Conference on Educational Data Mining , Pittsburgh, PA, USA, June 11-13, 2010.

[25]Kumar R. and Chadrasekaran R.,(2011), Attribute Correction – Data Cleaning Using Association Rule and Clustering Methods , International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol(1),No(2).

[26]Srinivas K., Raghavendra G. and Govardhan A., (2011), Survey on Prediction of Heart Morbidity Using Data Mining Techniques

Citation Count: 27

Experimental study of data clustering using k-means and modified algorithms.

M.P.S Bhatia and Deepika Khurana

University of Delhi, New Delhi, India

The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in exploratory data analysis. This paper presents results of the experimental study of different approaches to k- Means clustering, thereby comparing results on different datasets using Original k-Means and other modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and execution time.

Data Mining, Clustering Algorithm, k- Means, Silhouette Validity Index.

[1] Ran Vijay Singh and M.P.S Bhatia , � Data Clustering with Modified K-means Algorithm �, IEEE International Conference on Recent Trends in Information Technology, ICRTIT 2011, pp 717-721.

[2] D. Napoleon and P. Ganga lakshmi, � An Efficient K-Means Clustering Algorithm for Reducing Time Complexity using Uniform Distribution Data Points �, IEEE 2010.

[3] Tajunisha and Saravanan, � Performance Analysis of k-means with different initialization methods for high dimensional data � International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010

[4] Neha Aggarwal and Kriti Aggarwal,� A Mid- point based k �mean Clustering Algorithm for Data Mining �. International Journal on Computer Science and Engineering (IJCSE) 2012.

[5] Barile� Barisi Baridam,� More work on k-means Clustering algortithm: The Dimensionality Problem �. International Journal of Computer Applications (0975 � 8887)Volume 44� No.2, April 2012.

[6] Shi Na, Li Xumin, Guan Yong �Research on K-means clustering algorithm�. Proc of Third International symposium on Intelligent Information Technology and Security Informatics, IEEE 2010.

[7] Ahamad Shafeeq and Hareesha � Dynamic clustering of data with modified K-mean algorithm �, Proc. International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) � (2012) IACSIT Press, Singapore 2012.

[8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids initialization for K-means.

[9] Data Mining Concepts and Techniques,Second edition Jiawei Han and Micheline Kamber.

[10] �Towards more accurate clustering method by using dynamic time warping� International Journal of Data Mining and Knowledge Management Process (IJDKP), Vol.3, No.2,March 2013.

[11] C. S. Li, � Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters �, �2011 International Conference on Advances in Engineering, Elsevier�, pp. 324-328, vol.24, 2011.

[12] A Review of Data Clustering Approaches Vaishali Aggarwal, Anil Kumar Ahlawat, B.N Panday. ISSN: 2277-3754 International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 4, April 2012.

[13] Ali Alijamaat, Madjid Khalilian, and Norwati Mustapha, � A Novel Approach for High Dimensional Data Clustering � 2010 Third International Conference on Knowledge Discovery and Data Mining.

[14] Zhong Wei, et al. “ Improved K-Means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property ” IEEE Transactions on Nanobioscience, Vol.4., No.3. Sep. 2005. 255-265.

[15] K.A.Abdul Nazeer, M.P.Sebastian, �I mproving the Accuracy and Efficiency of the k-means Clustering Algorithm �,Proceeding of the World Congress on Engineering, vol 1,london, July 2009.

[16] Mu-Chun Su and Chien-Hsing Chou � A Modified version of k-means Algorithm with a Distance Based on Cluster Symmetry �.IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol 23 No. 6 ,June 2001.

Citation Count: 26

Data, text and web mining for business intelligence : a survey.

Abdul-Aziz Rashid Al-Azmi

Department of Computer Engineering, Kuwait University, Kuwait

The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data. This uncovered knowledge helps in gaining completive advantages, better customers� relationships, and even fraud detection. In this survey, we�ll describe how these techniques work, how they are implemented. Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential

[1] Bill Palace, (1996) �Technology Note prepared for Management 274A� Anderson Graduate School of Management at UCLA.

[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, (2008) � The Elements of Statistical Learning: Data Mining, Inference and Prediction ,� New York, Springer-Verlag, ISBN 0 387 95284-5

[3] Doug Alexander, (2011) �Data Mining�, [email protected]

[4] Michael Goebel, Le Gruenwald, (1999) � A Survey Of Data Mining And Knowledge Discovery Software Tools ,� SIGKDD Explorations, Vol. 1, Issue 1. Pg 20, ACM SIGKDD.

[5] Chidanand Apte, Bing Liu, Edwin P.D. Pednault, Padhraic Smyth, (2002) �Business Applications of Data Mining,� Communications of the ACM, Vol. 45, No. 8.

[6] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, (1996) � From Data Mining to Knowledge Discovery in Databases ,� AI Magazine, American Association for Artificial Intelligence AAAI, Vol. 17 No. 3.

[7] Marti Hearst, (2003) �What Is Text Mining?� SIMS, UC Berkeley.

[8] Prof. Anita Wasilewska, (2011) �Web Mining Presentation 1� CSE 590 Data Mining, Stony Brook.

[9] Prasanna Desikan, Colin DeLong, Sandeep Mane, Kalyan Beemanapalli, Kuo-Wei Hsu, Prasad Sriram, Jaideep Srivastava, Vamsee Venuturumilli, (2009) �Web Mining for Business Computing� Handbooks in Information Systems v.3, Emerald Group Publishing Limited.

[10] MineIT (2010) �Web Mining, The E-Tailers� Holy Grail?� http://www.mineit.com

[11] Maria C. Ferreira de Oliveira and H. Levkowitz, (2003) � From Visual Data Exploration to Visual Data Mining: A Survey � IEEE Transactions on Visualization and Computer Graphics, Vol. 9, No. 3.

[12] E.H. Chi, (2000) �A Taxonomy of Visualization Techniques Using the Data State Reference Model,� In the Proceedings of the Information Visualization Symposium InfoVis 2000, pp. 69-75.

[13] A. Hotho, A. Nu�rnberger, G. Paa�, (2005) �A Brief Survey of Text Mining� GLDV-Journal for Computational Linguistics and Language Technologies.

[14] The Cross Industry Standard Process for Data Mining Blog (2008).

[15] Feldman, R. & Dagan, I. (1995) �Knowledge discovery in texts� In Proceeding of the First International Conference on Knowledge Discovery (KDD), pp. 112�117.

[16] Michele Fattori, Giorgio Pedrazzi, Roberta Turra, (2003) �Text mining applied to patent mapping: a practical business case� World Patent Information, Volume 25, Issue 4.

[17] Ajith Abraham, (2003) � Business Intelligence from Web Usage Mining � Journal of Information & Knowledge Management, Vol. 2, No. 4, iKMS & World Scientific Publishing Co.

[18] Vishal Gupta, Gurpreet S. Lehal, (2009) � A Survey of Text Mining Techniques and Applications � Journal of Emerging Technologies in Web Intelligence, Vol. 1, No. 1.

[19] W. H. Inmon, (1996) �The Data Warehouse and Data Mining� Communications of the ACM, Vol. 39, No. 11, ACM.

[20] Rajender Singh Chhillar, (2008) �Extraction Transformation Loading, A Road to Data Warehouse,� Second National Conference Mathematical Techniques: Emerging Paradigms for Electronics and IT Industries, India, pp. 384-388.

[20] Samia Jones, Omprakash K. Gupta, 2006) �Web Data Mining: A Case Study� Communications of the IIMA, Vol. 6, Issue 4.

[21] J.R. Quinlan, (1986) �Induction of Decision Trees�, Machine Learning, Kluwer Academic Publishers, Boston.

[22] Cohen KB, Hunter L, (2008) � Getting Started in Text Minin g� PLoS Comput Biol.

[23] Judy Redfearn and the JISC Communications team, (2006) �What Text Mining can do� Briefing paper, �Joint Information Systems Committee� JISC.

[24] Neto, J., Santos, A., Kaestner, C., Freitas, A. 2000) � Document Clustering and Text Summarization � In the Proceeding of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining PADD-2000, London, UK.

[25] R. Kosla and H. Blockeel, (2000)� Web mining research a survey ,� SIGKDD Explorations, vol. 2, pp.1�15.

[26] Sankar K. Pal, Varun Talwar, Pabitra Mitra, (2002) � Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions � IEEE Transactions on Neural Networks, Vol. 13,No. 5.

[27] Ralf Mikut, and Markus Reischl, (2011) �Data mining tools� Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 1, Issue 5.

[28] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten (2009) � The WEKA data mining software: an update � SIGKDD Explorer News.

[29] Dorronsoro, J., Ginel, F., Sanchez, C. & Cruz, C. (1997) �Neural Fraud Detection in Credit Card Operations� IEEE Transactions on Neural Networks.

[30] Clifton Phua, Vincent Lee, Kate Smith, Ross Gayler, (2010) � A Comprehensive Survey of Data Mining-based Fraud Detection Research � Cornell University library, CoRR.

[31] Sang Jun Lee, Keng Siau, (2001) � A Review of Data Mining Techniques � Industrial Management and Data Systems, 101/1, MCB University Press.

[32] IBM, SurfAid Analytics (2003).

[33] Federico Michele Facca, Pier Luca Lanzi, (2005) �Mining interesting knowledge from weblogs: a survey� Data & Knowledge Engineering, 53, Elsevier.

[34] Mu-Chen Chen, Cheng-Lung Huang, Kai-Ying Chen, Hsiao-Pin Wu, (2005) � Aggregation of Orders in Distribution Centers using Data Mining � Expert Systems with Applications, Volume 28, Issue 3, Pages 453-460, Elsevier.

[35] Van den Berg, J. P. (1999) �A literature survey on planning and control of warehousing systems� IIE Transactions, 31, PP.751�762.

[36] Fitzsimons, M., Khabaza, T., and Shearer, C. (1993) �The Application of Rule Induction and Neural Networks for Television Audience Prediction� In Proceedings of ESOMAR/EMAC/AFM Symposium on Information Based Decision Making in Marketing, Paris, pp 69-82.

[37] Gregory Piatetsky-Shapiro, Ron Brachman, Tom Khabaza (1996) � An Overview of Issues in Developing Industrial Data Mining and Knowledge Discovery Applications � KDD-96 Proceedings.

[38] Amir F. Atiya, (2001) �Bankruptcy Prediction for Credit Risk Using Neural Networks: A Survey and New Results� IEEE Transactions on Neural Networks, vol. 12, no. 4.

[39] M. Crouhy, D. Galai, and R. Mark, (2000) � A comparative analysis of current credit risk models ,� J. Banking & Finance, vol. 24, pp. 59�117.

[40] Marinela Mircea, Bogdan Ghilic-Micu, Marian Stoica, (2007) � Combining Business Intelligence with Cloud Computing to Delivery Agility in Actual Economy � Department of Economic Informatics The Bucharest Academy of Economic Studies.

[41] Thiagarajan Ramakrishnan, Mary C. Jones, Anna Sidorova, (2011) � Factors Influencing Business Intelligence and Data Collection Strategies: An empirical investigation �, Decision Support Systems.

[42] Surajit Chaudhuri, Vivek Narasayya, (2011) � New Frontiers in Business Intelligence � The 37th International Conference on Very Large Data Bases, Seattle, Washington, Vol. 4, No. 12, VLDB.

[43] Consumer Packaged Goods Company Multi-Model Study, (1998) �Data Mining Case Study: Retail�. [44] IBM Software Group Case Study. (2010) �Great Canadian Gaming Corporation Leverages IBM Cognos 8: Solutions for Financial Consolidation and Reporting Standardization�.

[45] A. Vellidoa, P.J.G. Lisboaa, J. Vaughan, (1999) � Neural Networks in Business: a Survey of Applications (1992�1998) � Expert Systems with Applications 17, pp. 51�70, Elsevier Science.

[46] Injazz J. Chen, K. Popovich, (2003) “Understanding Customer Relationship Management (CRM): People, process and technology”, Business Process Management Journal, Vol. 9, pp.672 � 688.

[47] Dave Smith (2010) � Using Data and Text Mining to Drive Innovation � PhUSE 2010, UK.

[48] Dien D. Phan, Douglas R. Vogel, (2010) �A Model of Customer Relationship Management and Business Intelligence Systems for Catalogue and Online Retailers�, Information & Management, Vol.47, Issue 2, Pages 69-77.

[49] Christian Thomsen, Torben Bach Pedersen (2009)� A Survey of Open Source Tools for Business Intelligence � International Journal of Data Warehousing and Mining, Vol. 5, Issue 3, IGI Global.

[50] Meryem Duygun Fethi, Fotios Pasiouras (2010) �Assessing Bank Efficiency and Performance with Operational Research and Artificial Intelligence Techniques: A survey� European Journal of Operational Research, pp. 189�198, Elsevier.

[51] Rafael Berlanga, Oscar Romero, Alkis Simitsis, Victoria Nebot, Torben Bach Pedersen, Alberto Abell�, Mar�a Jos� Aramburu (2012 ) � Semantic Web Technologies for Business Intelligence � IGI.

[52] Manuel Mej�a-Lavalle, Ricardo Sosa R., Nemorio Gonz�lez M., and Liliana Argotte R. (2009) �Survey of Business Intelligence for Energy Markets� E. Corchado et al. (Eds.): HAIS, LNAI 5572, pp. 235�243, Springer-Verlag Berlin Heidelberg.

[53] Shantanu Godbole, Shourya Roy, (2008) � Text Classification, Business Intelligence, and Interactivity: Automating C-Sat Analysis for Services Industry � KDD�08, ACM Las Vegas, USA.

[54] Carlos Rodr�guez, Florian Daniel, F. Casati, Cinzia Cappiello (2010) � Toward Uncertain Business Intelligence: The Case of Key Indicators � Internet Computing, IEEE, vol.14, no.4, pp.32-40.

[55] K.A. Taipale (2003) “ Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data ” Columbia Science and Technology Law Review 5.

[56] Will Hedfield (2009) �Case study: Jaeger uses data mining to reduce losses from crime and waste�.

[57] K. Laundon and J. Laundon (2011) �Foundations of Business Intelligence: Databases and Information Management� Managing Information Systems: Managing the Digital Firm, Pearson Education Inc.

[58] Oksana Grabova, Jerome Darmont, Jean-Hugues Chauchat, Iryna Zolotaryova (2010) � Business Intelligence for Small and Middle-Sized Enterprises � SIGMOD Rec. 39.

[59] Byung-Kwon Park and Il-Yeol Song (2011) �Toward total business intelligence incorporating structured and unstructured data� In Proceedings of the 2nd International Workshop on Business intelligencE and the WEB (BEWEB ’11), ACM, NY, USA.

[60] Y. Li, M.R. Kramer, A.J.M. Beulens, J.G.A.J. van der Vorst (2010) � A Framework for Early Warning and Proactive Control Systems in Food Supply Chain Networks � Computers in Industry, Vol. 61, Issue 9, pp. 852-862.

[61] MAIA Intelligence (2009) �Business Intelligence in Manufacturing�.

[62] Srinivasa Rao P, Saurabh Swarup (2001) �Business Intelligence and Logistics� Wipro Technologies.

[63] Atos, (2011) �Business Intelligence solutions: Decisions that are Better-Informed Leading to LongTerm Competitive Advantage�.

[64] K. Laundon and J. Laundon (2012) �Enhancing Decision Making� Managing Information Systems: Managing the Digital Firm, Pearson Education, Pearson Hall.

[65] INSEAD, World Economic Forum (2009) �The Global Information Technology Report 2008�2009: Mobility in a Networked World�, Geneva.

[66] Aura-Mihaela Mocanu, Daniela Litan, Stefan Olaru, A. Munteanu (2010) � Information Systems in the Knowledge Based Economy � WSEAS Transactions on Business and Economics, Issue 1, Vol. 7

[67] A. S. Al- Mudimigh, F. Saleem, Z. Ullah, F. N. Al-Aboud (2009) �Implementation of Data Mining Engine on CRM -Improve Customer Satisfaction� International Conference on Information and Communication Technologies ICICT ’09, vol., no., pp.193-197.

[68] Case study by Zap Technology, (2010) �KFC/Pizza Hut makes efficiency gains with Zap Business Intelligence: Businesses become more agile, responsive and performance-focused�.

Citation Count: 21

Applications of data mining techniques in life insurance.

A. B. Devale 1 and R. V. Kulkarni 2

1 Arts, Commerce, Science College, Palus Dist. Sangli, Maharashtra and 2 Shahu Institute of Business Research, Kolhapur, Maharashtra

Knowledge discovery in financial organization have been built and operated mainly to support decision making using knowledge as strategic factor. In this paper, we investigate the use of various data mining techniques for knowledge discovery in insurance business. Existing software are inefficient in showing such data characteristics. We introduce different exhibits for discovering knowledge in the form of association rules, clustering, classification and correlation suitable for data characteristics. Proposed data mining techniques, the decision- maker can define the expansion of insurance activities to empower the different forces in existing life insurance sector.

Insurance, Association rules, Clustering, Classification, Correlation, Data mining.

[1] Alex Berson and Stephen J. Smith, � Data Warehousing, Data Mining, And OLAP �,MC Graow�Hill, 1997.

[2] Bigus and Joseph P, �Data Mining With Neural Networks�, MC Graw�Hill, New York 1996.

[3] Christopher J. Matheus, Gregory Piatetshy�Shapiro and Dwight Mcneill�, Selecting and Reporting what is Interesting The Kefir Application to Health Care Data�, Advances in Knowledge Discovery and Data Mining, AAA1 Press/The MIT Press, 1996.

[4] Dasrathy B. V., Ed, � Nearest Neighbor Norms: NN Pattern Classification Techniques �,IEEE, Computer Society Press, Calif. 1990.

[5] David Cheung, Vincent T., Ada W. Fu and Yongjian Fv, � Efficient Mining of Association Rules in Distributed Databases �, IEEE, 1996.

[6] Graig Silverstein, Sergey Brin and Rajeev Montwani, �Beyond Market Baskets: Generalizing Association Rules to Dependence Rules�, Data Mining and Knowledge Discovery, Vol. 2, No. 1, Jan 1998, Kluwer Academic Publishers.

[7] Hongjun LU, Ling Feng and Jiawei Han, �Beyond Intratransaction Association Analysis: Mining Multidimensional Intertransaction Association Rules�, ACM Transactions on Information Systems, Vol. 18, October 2000.

[8] Huan Liu, Farhad Hussain, Chew Lim Tan and Manoranjan Dash, � Discretization: An Enabling Technique “, Data Mining and Knowledge Discovery�, vol. 6 No. 4, October 2002.

[9] J. Date, “ An Introduction to Database Systems “, Addition Wesley Longman, Seven Edition, 2000.

[10] Jiawei Han, Laks V. S. Lakshmanan and Raymond T.NG, �Constraint-Based Multidimensional Data Mining�, IEEE, August 1999.

[11] Jorg-Uwe Kietz, Regina Zucker and Anca Vaduva, �Mining Mart: Combining Case- Based Reasoning and multi-Strategy Learning Into a Frame For Reusing KDD-Applications�, Proc 5th Workshop on Multi-Strategy Learning (MSL 2000) Portugal, June 2000, Kluwer Academic Publishers.

[12] Ken Orr, �Data Warehouse Technology�, Copyright. The Ken Or Institute, 1997.

[13] Krzysztof J. Cios, Witold Pedryez and Roman W. Surniarski, � Data Mining Methods for Knowledge Discovery�, Kluwer Academic Publishers 1998 Second Printing 2000.

[14] Mariano Fernendez Lopez, Asuncion Gomez-Perez, Juan Pazos Sierra, Polytechnic and Alejandro Pazos Sierra, � Building a Chemical Ontology Using Methontology and the Ontology Design Environment �, IEEE Intelligent System. Jan / Feb 1999.

[15] Martin Staudt, Anca Vaduva and Thomas c, � Metadata Management and Data Warehouse “, Technical Report, Information System Research, Swiss Life, University of Zurich, Department of Computer Science, July 1999. [email protected]

[16] Ming-Syan chen, Jiawei Han and Philip S. Yu, � Data Mining: An Overview From a Database Perspective �, IEEE Transactions on Knowledge and Data Engineering Vol. 8, No. 6, Dec. 1996.

[17] Natalya Friedman Noy and Carole D. Hafner, � The State of The Art in Ontology Design �, AI Magazine Vol. 18, No. 3, Fall 1997.

[18] Rakesh A. grawal, � Parallel Mining of Associations Rule �, IEEE, Dec 1996.

[19] Ramakrishnan Srikant and Rakesh A. Grawal, � Mining Quantitative Association Rules in Large Relational Tables �, Proc Sigmod �96, 6/96 Montreal Canada, 1996 ACM.

[20] Ramakrishnan Srikant and Rakesh A. Grawal, � Mining Generalized Association Rules “, Proceedings of The �21st VLDB Conference�, Zurich, Switzerland, 1995.

[21] Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Hon and Alex Pany, � Exploratory Mining and Pruning Optimizations of Constrained Associations Rules �, ACM 1998 page 13.

[22] Mr. A. B. Devale and Dr. R. V. Kulkarni � A REVIEW OF DATA MINING TECHNIQUES IN INSURANCE SECTOR � Golden Research Thoughts Vol – I , ISSUE – VII [ January 2012 ]

COMMENTS

(PDF) Data mining techniques and applications
The paper covers all data mining techniques , algorithms and some organisations which have adopted data mining technology to have better information about business patterns. Read more Article
Adaptations of data mining methodologies: a systematic literature
The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.
Data Mining Methods and Obstacles: A Comprehensive Analysis
Data mini ng is a step in the knowledge discovery in databases (KDD) procedure, a data science method. for gathering, processing, and conducting more in depth analyses of data. While they ...
What Is Data Mining?
Data mining is the process of extracting meaningful information from vast amounts of data. With data mining methods, organizations can discover hidden patterns, relationships, and trends in data, which they can use to solve business problems, make predictions, and increase their profits or efficiency. The term "data mining" is actually a ...
Data mining techniques and applications
This paper reviews data mining techniques and its applications such as educational data mining (EDM), finance, commerce, life sciences and medical etc. We group existing approaches to determine how the data mining can be used in different fields. Our categorization specifically focuses on the research that has been published over the period ...
(PDF) Trends in data mining research: A two-decade review using topic
The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...
Writing a research paper (4)
Generally, the introduction of research papers always have more or less the same structure: PART 1 (context): The first paragraph introduces the broad context of the paper, and then progressively goes from that broad context to a more specific context. PART 2 (problem): Then, a problem is mentioned and why it must be solved.
Data Mining Research
Data mining is defined as the process of discovering hidden and potentially useful information from very large databases [4].The progress in data mining research has made it possible to implement several data mining operations efficiently on large databases. While this is surely an important contribution, we should not lose sight of the final goal of data mining - it is to enable database ...
Introduction to Data Mining
In the initial section, we discuss KDD—knowledge discovery in the database with its different phases like data cleaning, data integration, data selection and transformation, representation. In this chapter, we give a brief introduction to data mining. Comparative discussion about classification and clustering helps the end-user to distinguish ...
Data Mining: Concepts and Methods Research Paper
Data mining finds its applications in different industries due to a number of benefits that can be derived from its use. Various methods of data mining include predictive analysis, web mining, and clustering and association discovery (Han, Kamber and Pei, 2011). Get a custom research paper on Data Mining: Concepts and Methods. 192 writers online.
PDF A comprehensive survey of data mining
To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper.
Process to write Research Papers and Journals for Data Scientists
Top 20 Journals for Data Scientists. IEEE Access. Springer International Journal of Data Science and Analytics. CODATA Data Science Journal. Computational Statistics & Data Analysis by IASC (International Association for Statistical Computing) Journal of Big Data. Foundations and Trends in Machine Learning.
Data Mining: Data Mining Concepts and Techniques
Data mining is a field of intersection of computer science and statistics used to discover patterns in the information bank. The main aim of the data mining process is to extract the useful information from the dossier of data and mold it into an understandable structure for future use. There are different process and techniques used to carry out data mining successfully.
345193 PDFs
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING
Review Paper on Data Mining Techniques and Applications
Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...
A Systematic Literature Review on Applying CRISP-DM Process Model
The results show that several data mining projects since 2017 base on CRISP-DM. It is seen as the de-facto standard in data mining projects, as the results of RQ1 show. This finding coincides also with references in other research disciplines close to data mining, like in big data research. CRISP-DM is still a suitable process model [32].
A sample study on applying data mining research techniques in
The purpose of this research is to present a sample study analyzing data gathered from an educational study using data mining techniques appropriate for processing these data. In order to achieve this aim, a "Computer Self-efficiency Scale" used in educational sciences was selected and this scale was applied in a study group.
data mining Latest Research Papers
Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches.
82 Data Mining Essay Topic Ideas & Examples
Commercial Uses of Data Mining. Data mining process entails the use of large relational database to identify the correlation that exists in a given data. The principal role of the applications is to sift the data to identify correlations. A Discussion on the Acceptability of Data Mining.
(PDF) DATA MINING: A BRIEF INTRODUCTION
Data mining should be regarded as a process. The process. involves the following steps: (1) Efficient data storage and data processing. (2) Decide on the number of variables to be investigated. (3 ...
Top 10 Data Mining Papers: Recommended Reading
Data Mining and Its Applications for Knowledge Management: A Literature Review from 2007 to 2012. Tipawan Silwattananusarn1 and KulthidaTuamsuk2. 1 Ph.D. Student in Information Studies Program, Khon Kaen University, Thailand and 2 Head, Information & Communication Management Program, Khon Kaen University, Thailand.