Contextualizing Geometric Data Analysis and Related Data Analytics: A Virtual Microscope for Big Data Analytics
Contextualizing Geometric Data Analysis and Related Data Analytics: A Virtual Microscope for Big Data Analytics Fionn Murtagh, Mohsen Farid Big Data Lab, Department of Electronics, Computing and Mathematics, University of Derby, Derby, UK Email: February 16, 2022 Abstract An objective of this work is to contextualize the analysis of large and multi-faceted data sources. Consider for example, health research in the context of social characteristics. Also there may be social research in the context of health characteristics. Related to this can be requirements for contextualizing Big Data analytics. A major challenge in Big Data analytics is the bias due to self selection. In general, and in practical settings, the aim is to determine the most revealing coupling of mainstream data and context. This is technically processed in Correspondence Analysis through use of the main and the supplementary data elements, i.e., individuals or objects, attributes and modalities. Keywords: analytical focus; contextualization of data and information; Correspondence Analysis; Multiple Correspondence Analysis; dimensionality reduction; mental health 1 Introduction Clearly, regression and other supervised, predictor and predicted relationships, have their fixed roles in all of data analytics. In unsupervised analytics, information discovery is at issue, from what can be multi-faceted data sourcing. It may be a case of dynamic (evolving) data and also data of heterogeneous precision (sensitivity, trustworthiness). Also, at issue may be both qualitative and quantitative data. Examples of such multi-faceted data sources are the increasing prevalence of secondary data sources in the domains of Big Data, and Internet of Things. Such secondary data sources may well count as contextual data. Given multi-faceted data, it is, or it can be, open to the data scientist to consider different contextualization strategies. Consider e.g. how mobile communications, and their 1 arXiv:1611.09948v1 [cs.AI] 30 Nov 2016 monitoring, provide locational data on all device bearers. So, for example, transport analytics can be contextually based on such secondary data and information sources. In this article, it is sought to explore general strategies for contextualizing analytics. Technically, the main attributes are aided, for analytical interpretation, i.e. extracting information from data, and consolidating that information into knowledge, by the supplementary, contextual, attributes. The general discussion of [17] points to the need for focus in benefits to be drawn from data analytics. In a general sense, we may regard dimensionality reduction as a form of focused data analytics, since typically (in Correspondence Analysis, or Principal Components Analysis, PCA, or multidimensional scaling, etc.) the percentage variance or inertia explained by the reduced dimensionality space becomes the focus. Similarly feature selection is often used, to focus on the selected variables. Often the clustering of data is with the objective of retaining the clusters for the analytics. Therefore summarization is associated with focusing the analysis. Thus focus is a key practical consideration in data analysis. Then, though, Benz´ecri’s principle of what we are analysing being both homogeneous and exhaustive should be very central also, from a practical perspective. Just in this practical sense, we begin by regarding contextualization as the association of two or more foci, i.e. two or more directions of analytical interest. This association is asymmetric. One focal point is located with reference to the other. This can be generalized to multiple focal directions in our analyses. We begin with such contextualization in general. Then we take contextualization further in the context of Big Data. In the former, applications related to health and lifestyle analytics are at issue. In the latter, a central theme is the use of secondary data, whether or not with Big Data characteristics, so as to direct and focus the analytics. 1.1 Secondary and Contextual Data Sources: Complementarity of Supervised and Unsupervised Analytics There follows a short description of unsupervised analytics, related to dynamics and evolving contexts, and supervised analytics, related to having a training set and invoking a machine learning or statistical modelling approach to specified, stable daa sources. In [1], the 2nd principle of data analysis is the following: “The model should follow the data, and not the reverse.” Unlike statistical testing that justifies a model, “... what we need is to have a rigorous method that extracts structures from data”. “A model is, briefly, a system of formulas which allows the calculation, as a function of the unobservable variables, the observed quantities”. “The term, unobservable variable, is, in a certain sense, relative to the state of science. More than a physical measure, firstly conjectured, such as energy or electric charge, it is now recognised to exist, and is measured as space or time. But the human sciences grope around still to establish rigorous laws: and while in astron2 omy some very simple axioms govern the movement of the most complex systems, a psychologist can rarely boast that the exhaustive study of elementary phenomena allows the precise prediction of the evolution of a complex case: ... the reach of a model rarely goes beyond the determined field of observations for which it was conceived. ... Often, effectively, the model has so many unobservable variables, and the experiments are so difficult (or the observations are so rare), that one has no trouble to explain the latter by the former (observations by experiments). ... The model allows, optimally, to predict, but not to understand. ... our aspiration to understand (not only to predict)”. In [18] relating to activities of daily living (using the acronym, ADLs) of the elderly and those in poor health, there is use made of “contextual information from uncertain sensor data”. There is this: “... our algorithm learns directly from incomplete data, and inhabitants’ behavioural patterns are characterized using the learned probability distribution over various activities. The model is used to infer the activities, and the inhabitants who have carried them out.” For example, the monitoring of a kettle boiling water and perhaps some other sensor, these sensor data are used to model and predict some aspects of “making a tea with milk and sugar”, “making coffee with milk”, etc. Further predictive models are at issue in [19], leading to a mobile app that will trigger personalized reminders. The foregoing examples are such that sensor systems provide contextual data, as also do movement and other types of activities, with the main attention given to health and mental well-being. 2 Short Introduction to General Contextualization Correspondence Analysis, CA, is most appropriate for latent semantic analysis, with input data being a cross-tabulation of observations or individuals, and variables or attributes. The cross-tabulation is usually frequency of occurrent data, encompassing presence-absence values. There can be considered also recoded quantitative data. Quantitative data, without recoding, can be used. Therefore CA is very appropriate for mixed qualitative and quantitative data. Categorical data is quite central here. Contrasted with PCA, where attribute centering to zero mean, and attribute reduction to unit standard deviation, termed standardization when both carried out, and contrasted with the TF-IDF, term frequency, inverse document frequency that is used in Latent Semantic Analysis of document-term cross-tabulation data, CA uses the following. The chisquared distribution is defined on both rows and on columns. This is a weighted Euclidean distance of, respectively, row and column profiles. A profile is defined as the row or column vector being divided, respectively, by the row or column total. This is what results from frequencies in the input cross-tabulation matrix, i.e. each value divided by the grand total. In [10, 14], CA is characterized as “a tale of three metrics”, these being the chi-squared metric for the dual clouds of observations and attributes
Written for
- Institution
- Data Analytics
- Course
- Data Analytics
Document information
- Uploaded on
- July 26, 2024
- Number of pages
- 19
- Written in
- 2023/2024
- Type
- Exam (elaborations)
- Contains
- Questions & answers
Subjects
-
analytical focus contextualization of data and in