Data Analytics Dictionary - Glossary of Terms

Dictionary

Can you spot a co-efficient from a continuous variable? Define where multivariate data analysis ends and regression analysis begins? Here are some common terms used in data analytics.

A

Advised future: A Control Advisor optimized manipulated variable setting that gives the best theoretical outcome of the process.

Alarm rules: Rules on when alarms should be triggered, such as when 3 are outside the limit in a set of 10 for instance. Common rules are the Western Electric rules and Nelson rules.

Algorithm: An unambiguous mathematical specification or statistical process used to perform analysis of data.

Analysis of variance: A statistical technique to separate and estimate different causes of variation.

ANOVA: See Analysis of Variance.

AR: Augmented Reality, where holographic objects are projected onto the real world.

AR model: Auto regressive model. Used in the analysis of time series data.

ARL: Average Run Length.

ARMA model: Auto Regressive Moving Average model. Used in the analysis of time series data.

Audit trail: Activity log that tracks all changes in and to the system.

B

Batch conditions: Batch conditions pertain to the whole batch and are therefore used in the batch level model (BLM). Also sub-divided into Initial conditions & Final conditions.

Batch Context Generator: A system that automatically detects and contextualizes batches from triggers on tags in the system.

Batch evolution model (BEM): A regression model of how a batch process evolves over time or maturity.

Batch folding: How batches are realigned to create a summary for the whole batch production (batch level).

Batch level model (BLM): A model which summarizes the whole batch and can be combined with final quality attributes to forecast and predict final quality attributes in the end.

Batch process: A finite duration process.

Batch statistical process control (BSPC): The application of control charting techniques to a batch process. Analogous to MSPC (multivariate statistical process control) and its control charting techniques applied to a continuous process.

Best basis: Best basis is an option used in wavelet transformation for high frequency signals. See also DWT.

Bilinear modeling: Matrices modeled as a product of two low rank matrices, e.g. X=T*P’.

Block-wise variable scaling: Making the total variance equal for each block of similar variables in a dataset.

BSPC: Batch Statistical Process Control.

C

Calibration dataset: See: Reference dataset.

Categorical variable: see Qualitative variable

Characteristic vector analysis: See: Principal component analysis.

Chemometrics: The application of mathematical and statistical methods to chemical data.

Class: A subset of similar observations from a dataset.

Classification of observations: The process of identifying to which of a set of categories (sub-populations) a new observation belongs.

Client/server application: An application architecture where calculations are done in a central server and the results can be displayed one or more clients that connect to the server.

Cluster analysis: Techniques for dividing a set of observations into subgroups or clusters.

Coefficient: A regression coefficient indicates the numerical change in a response (Y-variable) when a factor (X-variable) increases from its midrange value to its maximum value.

Collinearity: A high level of correlation between variables.

Column space: Space spanned by the column vectors of a matrix.

Confidence interval: A specified range of values around an estimate to indicate margin of error, combined with a probability that a value will fall in that range. The confidence interval around a parameter (coefficient, loading, VIP, etc.) indicates the uncertainty of that parameter.

Continued process verification: The need to keep all critical attributes under control after the production is complete.

Continuous process verification: The need to keep all critical attributes and their correlation under control during the production.

Continuous variable: A variable whose value can be any of an infinite number of values, typically within a particular range.

Contingency table: A table which contains counts or frequencies of different events or outcomes.

Contribution plot: A bar chart used in multivariate data analysis to diagnose out-of-control points and show which variables contribute to the distance between the points and sample mean of the data.

Correlation: Measure of association of two variables.

Correspondence analysis: A special double scaled variant of PCA, suitable for some applications, e.g. analysis of contingency tables.

COST (change-one-separate-factor-at-a-time) approach: Also called OVAT (one-variable-at-a-time) or OFAT (one-factor-at-a-time), this is an intuitive method of “eye-balling” data to determine which factors may be influencing each other by calculating their average and standard deviation one at a time (an inefficient and error-prone method).

Covariance: Similar to correlation but not normalized which makes it influenced by the magnitudes of the variables and therefore hard to interpret.

Cross-validation: A technique to evaluate the predictive ability of models by partitioning the original sample into training set(s) to train the model, and test set(s) to evaluate it.

CUSUM: CUmulative SUM. A control charting technique used in multivariate statistical process control (MSPC) applications.

D

D-optimal design: A computer-generated design for non-standard conditions or when the experimental domains is distorted. The D in D-optimal stands for determinant.

Dataset: A dataset is the base of all multivariate data analysis, often also called a data matrix. It is made up of values of several different variables for a number of observations. The data are collected in a data matrix (data table) of N rows and K columns, often denoted X. The N rows in the table are termed observations. The K columns are termed variables.

Data analytics: The process of examining large data sets to uncover hidden patterns, unknown correlations, trends, customer preferences and other useful business insights.

Data science: A discipline that combines statistics, data visualization, computer programing, data mining and software engineering to extract knowledge and insights from large and complex data sets.

DCrit: The critical limit with confidence interval where the correlation pattern is considered normal for the model in the DModX statistic.

Deep learning: Deep learning is part of a broader family of machine learning methods based on learning data representations.

Dependent variable: Another name for a Y-variable or response variable.

Design of experiments (DOE): A rational and cost-effective approach to practical experimentation that allows the effect of variables to be assessed using only the minimum of resources. A DOE protocol generates maximally informative experiments.

Discrete data: Data that exist sporadically during production, such as laboratory data (IPC, at-line or daily data).

Discrete variable: A variable that can only assume certain settings or levels (as opposed to a continuous variable that can have a value anywhere between two numerical limits).

Discriminant analysis: A statistical analysis technique used to predict class membership from labeled data.

DModX: Distance to model in the X-space. Expresses the row-wise residual standard deviation as a distance measure to the model for that particular observation (row).

D-optimal design: An approach in DOE that is used when the experimental region is very irregular or there is a need to estimate a particular (non-standard) regression model.

Drill down: The procedure of model interpretation through inspection of multivariate parameters, followed by zooming-in on certain parts of the underlying data by double-clicking in plots or charts to open up visualizations of relevant parts of the real measurements. This procedure is used to corroborate that what is seen in model parameters is indeed expressed or encoded in the underlying data.

Duration: The number of points in the batch.

DWT: Discrete wavelet transform option used in the wavelet transformation when the signal is fairly smooth, that is, the information is mainly contained in the low frequencies. See also Best basis.

Dynamic lags: Calculates and aligns delays in the system based on the speed of the system or time.

E

Eigenvalue: The length change when an eigenvector is projected onto itself. This is equivalent to the length of a principal diameter of the data.

Eigenvector: Eigenvectors exists only for square matrices. An eigenvector to a square matrix has the property of being projected onto itself when projected by the matrix. The degree of elongation or diminution is expressed by the eigenvalue.

Eigenvector analysis: See: Principal component analysis.

Electronic signatures: A mandatory sign-off to changes in or to the system that is part of the FDA 21 CFR part 11 guidelines.

Endpoint: The last maturity value for the batch.

Euclidean distance: Geometric distance in a Euclidean space (isomorphic with orthogonal basis vectors).

Empirical model: An empirical model is model that is based on experimental data.

EWMA model: Exponentially Weighted Moving Average model. The EWMA is usually used as a control charting technique in MSPC. See also CuSum.

Execution interval: Set for each continuous project or batch phase to indicate how often data should sampled for that specific part of the production.

Explanatory variable: Variables (x) used to 'explain' the variation in the dependent variables (y). Also often called predictor variables or independent variables.

F

Factor: A term often used in experimental design. It signifies controlled and varied variable. See: Predictor. Also a term for one model dimension in factor and bilinear models.

Factor analysis: Has an aim similar to PCA, but assumes an underlying model with a specified number of factors which are linear combinations of the original variables. Also see Principal Component Analysis.

Forecast: Gives the best guess of how the future production of the process will be based on the model and the currently existing historical data.

G

Golden batch: The average evolution batch for all produced batches for each vector.

H

Histogram: A column (bar) plot visualizing the distribution of a variable.

Hotellings T2: A multivariate generalization of student’s t-test. Used to compute a distance measure of how far away an observation point is from the origin of the model in the score space.

Hotellings T2 crit: The critical limit with significance level, within which we have the normal region of the model. Any observation point inside this limit is well explained by the model.

I

Identifiers: Labels on variables and observations indicating usfeul properties or meta-data or external information that can be used to enrich the model interpretation. Variable and observation identifiers are displayed in plots and lists. The Find function searches the identifiers in the Workset dialog. In the Observations page of the Workset dialog the identifiers can be used to set classes.

Independent variable: Often misleading connotation. See: Predictor variable or Explanatory variable.

Inner vector product: The product of two vectors that produces a scalar.

Input variables / output variables: Input variables are the factor (X) values and output variables are the responses (Y) in data analytics.

Interaction: Also interaction coefficient, the strength of the relation between an independent variable and dependent variables, as a function of another indepdenent variable.

J

Jack-knifing: A method for finding the confidence interval of an estimated model parameter, by iteratively keeping out parts of the underlying data, making estimates from the subsets and comparing these estimates.

K

K-dimensional space (K-space): The size of the variable space. K equals the number of variables in the dataset.

K-means clustering: A data mining algorithm to cluster, classify, or group observations based on their attributes or features into a certain number of groups (or clusters).

Karhunen-Loève transformation: See: Principal component analysis.

L

Latent variable: Variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

Least squares estimate: A method to estimate model parameters by minimizing the sum of squares of the differences between the actual response value and the value predicted by the model.

Leverage: Observations in the periphery of a dataset might have great influence on the modeling of the dataset. That influence is termed leverage, based on the Archimedian idea that anything can be lifted out of balance if the lifter has a long enough lever.

Linear regression: A statistical method used to summarize and show relationships between variables.

Loading vector: Direction coefficients of a PC or PLS component axis.

Local centering: A way to realign variables that are drifting.

M

M-space: Measurement space, or: multivariate space. Synonym: K-space. See: K-space

MA model: Moving Average model. Used in the analysis of time series data.

Machine learning: Algorithms such as MVDA that can model a system based on historical data.

Mahalanobis distance: Eigenvalue-weighed Euclidean distance.

Manipulated variable: Variable that can be controlled and steers the system in some way, for instance set points in batch production.

Matrix: A two-way datatable where data are arranged as rows and columns

Mean: The average value.

Mean centering: A preprocessing method used in MVDA. Often combined with scaling to unit variance (UV-scaling).

Median: When values are size-sorted, the value in the middle.

Mechanistic models: Modelsbased on a theroretical understanding of the behavior of a system's components.

Megavariate analysis: A term used to describe a method of modeling large quantities of data containing multiple latent variables rather than expressed variables to give results that are multivariate. Increasingly used in life science and biology.

Metabonomics: The study of excreted metabolites of a species or an individual organism, involving measurements of the response to an influence.

Missing value: Element in a data matrix without a defined value. As a rule of thumb, each observation and variable should have more than five defined values per PC. Observations (or variables) with missing values that show up as outliers should be treated with suspicion.

MLR: Multiple Linear Regression.

MOCA, Multiblock Orthogonal Component Analysis: Generalization of OPLS to cover multiple blocks of data and search for their joint and unique variablities.

Mode: In a set of numbers, the value that occurs most often

Model: A mathematical expression that describes relationships among variables in a historical data set to estimate or classify the data. In essence, a model draws a "line" through a set of data points that can be used to predict outcomes.

Model management: The method to trace, track, and version models that represent a system.

Model update: A method to automatically, or semi-automatically re-calibrate the model after updates have occurred in the process that the model is not previously fitted for.

MSPC: Multivariate Statistical Process Control: The use of multivariate methods to characterize the state of a process with respect to known states. The state is determined from model score plots and distance to model plots. See also: SPC.

Multidimensional scaling: Roughly corresponding to a principal component analysis of a matrix of ‘distances’ between observations.

Multiple linear regression: Used as a means of predictive analysis to explain the relationship between one continuous dependent variable and two or more independent variables.

Multivariate data analysis: A set of statistical techniques used to analyze data sets that contain more than one variable.

MVDA: Multivariate Data Analysis.

N

Neural network: A framework for many different machine learning algorithms to work together and process complex data inputs.

NIPALS: Non-linear Iterative Partial Least Squares

Nonlinear Iterative Partial Least Squares: Algorithm for calculating principal components.

Normal distribution: A probability distribution which, when graphed, is a symmetrical bell curve with the mean value at the center.

Notification system: A system that can send a message to a specific or several receivers when something predetermined has happened in the system.

O

Observation space: The space spanned by the observation vectors of a data matrix. Each variable vector is represented as a point in that space. See also: Row space.

OLS: Ordinary Least Squares, equivalent to MLR.

Omics: The study of a group or system of biomolecules.

Ordinal data: A discrete variable is called ordinal if its data can be arranged in some numerical order.

Ordinal number: Showing order or position in a series, e.g. first, second, third.

Orthogonal Projections to Latent Structures (OPLS): Modification of the classical PLS method bringing about simplified model interpretation.

OPLS: Also Orthogonal PLS, a modification of PLS in which systematic variation in independent factors is divided into two parts; either related or non-related to the dependent responses.

Outer vector product: Product of two vectors that produces a matrix: M = t * p' where mij = ti * pj

Outliers: Extreme values that might be errors in measurement and recording, or might be accurate reports of rare events.

P

PAT: Process Analytical Technology.

Process Analytical Technology: Systems for analysis and control of manufacturing processes based on timely measurements, during processing, of critical quality parameters and performance attributes of raw and in-process materials and processes to assure acceptable end product quality at the completion of the process.

Partial least squares (PLS) regression: A statistical technique that combines features from principal component analysis and multiple regression, but instead of finding hyperplanes of maximum variance between the dependent and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new lower-dimensioned space.

PCR: Principal Component Regression.

P value: A probability value returned from formal statistical testing of some test statistic, e.g a t-test or an F-test.

Phase: A part of the process that has a specific chemical or physical interpretation.

Phase conditions: Phase conditions pertain to the whole phase and are therefore used in the batch level model.

Phase iterations: The modelling and monitoring of complex phases that can happen more than once or be split and then merged again.

Phase iteration conditions: Phase iteration conditions pertain to the whole phase iteration and are therefore used in the batch level model.

PLS: Projections to Latent Structures.

PLS-DA: Also PLS Discriminant Analysis, a PLS analysis involving a dummy variable for classification.

Prediction: A statement (usually quantitative) about what will happen under specific conditions, as a logical consequence of scientific theories.

Predictive modeling: The development of statistical models to predict future events.

Power method: An iterative projection method for finding eigenvectors.

Prediction set: A dataset used together with an established model in order to obtain model predictions for each of the observations in the set.

Predictor variable: See: Explanatory variables.

Principal component analysis: A technique used to provide an overview of the information in a dataset.

Principal Component Regression: A regression technique that combines principal component calculations with MLR.

Projection methods: A group of methods that can efficiently extract the information inherent in MVD. They give results that are easy to interpret because they can be presented as pictures. Such methods are efficient for pattern recognition, classification, and predictions. The most commonly used methods are PCA, PLS and OPLS.

Projection to Latent Structures: See Partial Least Squares (PLS) regression.

Q

Qualitative variable: A non-numerical variable describing a property (setting) of an observation. The qualitative settings do not have a natural order and can therefore not be converted into a single numerical (quantitative) variable.

QSAR: Quantitative Structure-Activity Relationship.

Quantitative Structure-Activity Relationship: Estimation of the strength of a mathematical relation between chemical structure and pharmacological activity for a series of compounds.

R

Real-time data processing: Real time data processing involves a continual input, process and output of data and allows an organization to take action right away. Data must be processed in a small time period (or near real time).

Rectangular Experimental Designs for Multi-Unit Platforms: A set of designs, for experiments in 96-well plates using multi-pipettes.

REDMUP: Rectangular Experimental Design for Multi-Unit Platforms.

Reference dataset: This term is used for datasets with known properties and origin, often used to define models. Synonyms: Calibration dataset, training dataset, workset.

Regression: The fitting of a curve to data points, expresses the mathematical relationship between variables.

Regression analysis: A modeling technique used to define the association between variables. It assumes a one-way causal effect from predictor variables (independent variables) to a response of another variable (dependent variable). Regression can be used to explain the past and predict future events.

Regressor variable: See: dependent variable

Residual: Left-over; un-modeled part. The mismatch between the observed and modeled values.

Response variable: See: dependent variable

Root-cause analysis: A method of problem solving used for identifying the root causes of faults or problems..

Root Mean Squared Error (RMSE): A measure of the differences between values (sample or population values) predicted (RMSEP) by a model or an estimator (RMSEE) and the values observed.

Row space: The space spanned by the row vectors of a matrix.

S

Scaling: Scaling is a pre-process step in multivariate data analysis used to align different types of data with a standard set of numerical values. Common methods include scaling to unit variance and Pareto scaling.

Score: Distance from the origin, along a loading vector, to the projection point of an observation in K- or M-space. Or: the coordinates of a point when it is projected on a model hyperplane.

Score space: The space spanned by the score vectors of a model. Each observation is represented as a point in that space. See also: Variable space, K space, and M space.

Score vector: Observation coordinates along a PC or PLS component axis. Scores for all observations for one model dimension (component).

Semiconductor: material with low conductivity, and hence an intermediate between a conductor and an insulator.

SIMCA^®: Soft Independent Modeling of Class Analogy.

Singular value decomposition: See: Principal component analysis

SPC: Statistical Process Control: The behavior of a process is characterized using data when the process is operating well and is in a state of control. In the monitoring phase the new incoming, measured, data are used to detect whether the process is in control or not. See also: MSPC.

Spectral filters: Pretreatment of data per observation specifically aimed at spectral type of data. Can for instance calculate derivatives or remove the average per row.

Standard deviation: The square root of the variance, and a common way to indicate just how different a particular measurement is from the mean.

T

Test dataset: A dataset with unknown properties, often subjected to projections to models.

Time series data: A sequence of measurements taken at different times, and often, but not necessarily at equally spaced intervals.

Time series filters: Pretreatment of data per variable. Can for instance calculate derivatives or wavelets per column.

Time warping: A method to realign batch evolution data according to a process maturity instead of time to compensate for reaction rates that differ between different production runs.

Training dataset: See: Reference dataset.

U

Unit: A production vessel, or reactor, where raw material are processed.

Unit group: A set of units that are similar enough thath the same model can be used for all of them.

V

Validity: Term stemming from logical argument, stating that an argument is valid if, for every model, all premises in the model are true, then the conclusion in the model is true.

Variable space: The space spanned by the variable vectors of a data matrix. Each observation vector is represented as a point in that space. See also: K space, and M space.

Variability: The variation between samples in the same condition, without systematic error.

Variance: A way to measure how large the differences are in a set of numbers by comparing them to the mean (average) value.

Variables: A data table can contain observations and variables. The observations are sometimes called objects, samples, case or items. The variables are the measurements that are made in order to capture the properties of the observations.

Vector: A quantity having both a direction and a magnitude, often represented by an arrow or coordinate on an axis.

W

Wavelets: Small oscillating wave functions that are used for data filtering or data compression.

Web API: An interface based on web technology to read or set data