Data mining is the systematic application of statistical methods supported by artificial intelligence to automatically find patterns, trends, cross-connections or correlations in existing data. Data mining is often, but incorrectly, used synonymously for “knowledge discovery in databases”. However, KDD also includes preprocessing and evaluation and is therefore superior to data mining.
Data mining is necessary and motivated by big data: huge amounts of data which can be collected relatively easily by different tools, but can hardly be analysed manually. To prevent loss of knowledge, it is used across industries and disciplines. In contrast to classical statistical methods, data mining has the advantage that not only manually prepared hypotheses are tested or refuted, but new hypotheses are generated and decision processes can be adapted and validated.
Viewed superficially, data mining and machine learning are a contrasting pair when large amounts of data are being worked with. In machine learning, familiar patterns are superficially recognised in new data sets.
Conversely, data mining is used to recognise and process new relationships (what is often referred to as “unsupervised learning” works with machine learning in a similar way). Therefore, the two processes cannot be completely separated from each other due to the many similarities. Knowledge, rules and patterns collected with the help of data mining are needed for machine learning.
Text mining is a procedure similar to data mining. However, it is not applied to big data, but to natural language sources or documents. With the help of statistical and linguistic methods, a text mining software gains structures, patterns, contexts of meaning and core information that help the user to grasp the essential content of the text without having to read it completely. These processes are largely automated.
Subsequently, a data mining procedure is often applied to the data obtained from the texts in order to relate the data to the underlying texts and to identify correlations and connections. Borrowed procedures from information retrieval (IR) also make it possible to record core data and information used to answer search queries. The relevant individual documents are thus identified in databases with a large number of sources.