Can MVDA Improve Acceptance of Chemometrics as a Forensic Tool?
In crime scene investigations, establishing an accurate timeline is always a critical component of any case, yet it can often be one of the most difficult to prove when it comes to forgery, documentation theft or questions of originality. Chemometrics is one tool that can aid in identifying characteristics of paper and ink without destroying the originals, but there has been little universal acceptance of a process or methodology to accurately date ink age this way. New studies, however, have shown how Multivariate Data Analysis (MVDA) can be used to establish conclusive timeframes for ink aging using spectroscopic data.
This article is posted on our Science Snippets Blog.
In the paper, "Noninvasive dating method for hand-written documents based on the dye aging processes,” as part of doctoral thesis work, Dr. Laura Ortiz Herrero and Dr. Luis Bartolomé Moro of the University of the Basque Country (UPV/EHU) highlight the usefulness of multivariate regression methods in the forensic fields to establish a non-destructive dating methodology.
Establishing the Question of Time
Whenever a person commits a criminal offense, there is always a question of time, place and action. However, the question of time is the most difficult to answer. Time has been largely unexplored due to the complexity of applying cutting-edge statistical analysis techniques effectively.
In this study, data analytics techniques have been combined with chemometrics, improving the process of interpreting high dimensional data, and providing additional information even from incomplete scans. This allows more objective and meaningful results to be obtained in a fast time frame and decreases human error in subsequent decision-making.
The methods outlined focus on using a Partial Least Squares Regression (PLSR) and its extension, Orthogonal Partial Least Squares Regression (OPLSR). PLSR is a quantitative multivariate analysis method – it models the linear relationship between the X matrix, constituted by the experimental data, and the Y vector, represented by time, with the aim of creating a predictive model (to predict the time of the unknown).
Using Spectroscopy for Forensic Dating
The introduction of chemometrics in forensic dating is very recent. That is why the potential applicability of this tool remains poorly understood and underexploited. If we add to this the statistical knowledge required for its optimal application, it’s no wonder that experts are reluctant to use chemometrics and implement it as a standard methodology in day-to-day forensic practice.
With the aim of reversing this situation, this research highlights the usefulness of multivariate regression methods in the forensic field through the development and validation of dating methodologies in which non-destructive and micro-destructive techniques have been applied together with the (O)PLSR method, using spectroscopic techniques.
Document forgery and the fight against this type of illegal activity is a daily reality that infringes on the rights of and can have serious, far-reaching negative consequences for companies, individuals and political entities, hence the importance of developing the dating methodologies that determine the legitimacy of documents.
These methodologies must:
- Estimate an accurate date
- Be applicable to wide time frames
- Preserve the integrity of the document due to its legal value
- Provide reliable estimates to be defensible and acceptable in court
The fulfillment of these requirements is conditional on the paper support (type of paper), the writing tool (type of ink) and the preservation conditions of the documents (how they were stored, temperature-controlled, humidity, etc).
Taking this into account, the aim of the study was to develop an innovative dating methodology for hand-written documents that would estimate an accurate date of the ink strokes for the largest number of writing tools in the widest possible timeframe by means of a non-destructive technique that would enable reliable results to be obtained while minimizing the influence of the paper support and preservation conditions.
Creating the Study
For the study, 11 types of writing tools from seven different brands were selected. Ink strokes of each writing tool were made every month between 2017 and 2019 on white paper and left to age under the local room conditions. A total of 20- 30 samples were obtained for each writing tool, reaching a maximum age of 27 natural months for the oldest one.
Ink strokes for each writing tool were analyzed using visible micro spectrophotometry. One of the advantages of this technique is the possibility of analyzing the dyes and pigments that constitute the ink.
When we encounter a case of document forgery, we do not know the writing tool used to sign the document. We find ourselves with the inconvenience of not knowing which is the optimal OPLS model to apply for its dating. So, pre-classification and pre-clustering of our unknown ink would allow us to correctly assign it to its corresponding OPLS model so that its date could be predicted. Therefore, principal component analysis and hierarchical cluster analysis were performed on the spectroscopic data of all the ink samples of each writing tool with the aim of grouping and classifying them.
For this work, the formatting steps of data preprocessing and multivariate regression analysis were performed using SIMCA® MVDA software.
Grouping Writing Tools by Clusters
The PCA score plot grouped the writing tools into seven clusters. The Inoxcrom® Gel was strongly differentiated from the others, at the far-right end of the first quadrant (image below). The Uni-Ball® and Paper Mate® Gel brands were grouped at the top end of the SIMCA® plot. The rest of the pen ink brands were classified in the central area of the graph with more or less differentiation between them.
Once pre-classification was completed, an OPLS model per writing tool was built and validated using SIMCA® software.
Building the OPLS Model
First, for each writing tool, the sample set was split into two sets: 80 percent of the samples constituted the training set, while 20 percent of the samples comprised the test set. The Kennard-Stone algorithm was used for this purpose.
The training set was used for the construction of the OPLS model. In other words, the linear relationship between the X matrix constituted by the spectroscopic data of the training samples and the Y vector represented by the aging time of samples was modeled. The test set, on the other hand, was used for external validation of the OPLS model, which means that the OPLS model was used in the prediction of the aging time of the test samples based on the spectroscopic data.
The validity of the OPLS model was subsequently evaluated by comparing the real unpredicted aging time of subsamples. Once the two sets of samples were obtained, the data was preprocessed. When performing multivariate modeling for dating purposes, we look for information in the data that captures the modifications that ink components undergo throughout the aging process while removing all irrelevant and detrimental, unrelated information. For example, the data from BIC France were out of scale and adjusted through the standard normal variate filter. In this way, the robustness and performance of the OPLS model was improved.
The performance of the OPLS model built with a training set when applying to selective preprocessing was then evaluated. For this purpose, some statistical parameters were taken into account. The root mean square error for estimation represents the model fit and the root mean square error for cross-validation indicates the model accuracy. The value of these two parameters was intended to be as low as possible.
Likewise, the robustness of the model was evaluated with a test set in which the statistical parameter called root mean square error for prediction was considered. This parameter expresses the predictive ability of the OPLS model, in other words, it determines the specific prediction error when the model is applied to new unknown samples.
As a supplementary quality parameter, the accuracy error was calculated, in which the real and predicted time of the samples within the set were compared. The value of these two parameters was also intended to be the lowest possible.
Blind Testing to Validate the Methodology
Once the OPLS models were validated, blind testing exercise was done in order to validate the developed methodology. Thirteen blind ink examples were provided by the Basque police and the workflow (below) was followed in all cases. Starting with Q7 and F1 documents, the first step was ink clustering and classification.
The second step was to select the OPLS model for each round. The further step was to check whether both replicas fell within the ellipse of the predicted score plot and met the RSD% criteria. Meeting this requirement, an exact date for each document could be assured with 95 percent confidence.
This step showed that the OPLS model not only responds to its own ink brand but also to other brands with a similar ink formulation.
This process was done for the rest of the documents. By increasing the number of OPLS models built from the most widely used representative inks, researchers were able to give an age prediction for a larger number of documents.
Through the blind testing exercise, researchers concluded that an exact date can be predicted with 95 percent confidence whenever:
- Both replicas fall within the ellipse of the predicted score plot
- The age of the ink is within the time application range of the corresponding OPLS model
- The ink is well classified/clustered into one of the classes/groups of pen brands studies
This means that inks meeting these criteria have the same or a similar formulation, and they have been stored under the same or slightly different conditions than those of the pen brands studied.
In the case of the last six documents, they were well classified and assigned to the corresponding OPLS model. However, both or one of the replicas of each document were out of the ellipse of the model’s predicted the score plot, in addition to not meeting the precision criteria. Thus, a reliable date for the documents could not be predicted.
A common feature of all six documents was that they were out of the temporal scope of application of the corresponding model. We therefore conclude that the OPLS models offer the additional information of being able to detect with 95 percent confidence which inks are temporarily above or below the application limit, thus providing valuable information regarding the lower and upper age limit in spite of not being able to provide an exact date.
On the other hand, complementary information that chemometrics offered us in this study was to identify the regions of the visible spectrum that were modified throughout the ink aging process, taking into account that the areas delineated in light blue, correspond to young inks and shifting over time to higher wavelengths.
The Pilot, Bic France, Staedtler, Uni-Ball, and Paper Mate ballpoint pen brands and the Uni-Ball Gel Pen brand were characterized over time by spectral modifications that moved toward higher wavelengths. Whereas the Paper Mate, Pilot and Inoxcrom Gel brands shifted towards lower wavelengths. The BIC USA and Faber-Castell ballpoint pen brands were characterized over time by modifications in both the upper and lower wavelengths of the visible spectrum. The fact that the ink of each writing tool has characteristic aging, hinted at universal models as demonstrated in the blind testing exercise.
Want to know more?
If you are interested in collaborating in the next steps of this project, please do not hesitate to contact the researchers at luis.bartolome@ehu.eus.
Conclusions
- Multivariate regression methods have demonstrated great versatility, both in the scope of application, in different forensic fields and in ease of coupling with the most commonly used analytical techniques.
- The preparation of custom-made synthetic laboratory samples aged under accelerated conditions has made it possible to overcome the unavailability of samples and enabled broad temporal scope methodologies in a rapid manner.
- Multivariate regression models can be easily updated as new samples are added, rendering more robust and reliable models.
- Chemometric-based dating methodologies have been able to respond to complex real-life claims that conventional methodologies fail to cope with. In addition, new methodologies have improved and overcome the disadvantages or objectives not reached by their predecessors.
- The statistical basis of chemometrics has made it possible to provide reliable and objective age estimates, indispensable for acceptance and decision-making in court.
- The guideline developed will be a valuable tool for use by the forensic science community in dating new evidence.
- And finally, the testing for the applicability of the methodologies to real-life scenarios should involve the participation of the corresponding experts: scientific police units, artists, forensic experts, anthropologists, and so on.