Customized Data Analytics with Python in SIMCA® and SIMCA®-online
The Python integration in SIMCA® and SIMCA®-online is paving the way for customized, flexible data analytics in a GMP environment. This popular open source scripting language is easy to use, has a large array of libraries and is backed by an active community of data scientists.
This article is posted on our Science Snippets Blog
How Data Analytics Can Unlock Hidden Potential
The use of big data and robust data analytics is crucial in the Biopharma world of today. From Design of Experiments (DoE) to continuous manufacturing, the advantages of data analytics are garnered throughout the product lifecycle. Robust data analytics can help detect deviations in real-time and uncover problems before they even occur, bringing about superior pharmaceutical formulas, optimized scaling and production processes, as well as process monitoring and control in real-time.
The large and complex datasets generated by processes along the biopharmaceutical lifecycle are a treasure trove of insights and information with the potential of providing valuable time-to-market advantages. On the quest to accessing the power of big data, data scientists are getting creative in their development of individualized analytics solutions.
The Top 3 Challenges faced by Data Scientists
Data scientists charged with unlocking the hidden potential of big data in the biopharmaceutical industry are faced with three main obstacles:
- dirty data
- lack of management or financial support
- colleagues with limited data science expertise
The large and complex datasets obtained along the biopharmaceutical lifecycle are often heterogenous and incomplete. In order to utilize such datasets in Multivariate Data Analysis (MVDA), the data must be cleaned and preprocessed. How to normalize data or deal with missing values heavily depends on the source and properties of a dataset, and often requires customized analytical solutions.
Incorporating open source scripting in the pretreatment process provides flexibility, a large array of data cleaning techniques, and automation solutions otherwise out of reach for many data scientists in the industry.
Open source scripting also equips data scientists with the tools to customize automation in the workflow, enabling less experienced colleagues to perform preprocessing procedures and data analyses, while reducing the risk of error. Automating crucial steps in the data analysis workflow and enabling less experienced colleagues to take on these tasks, comes with the added bonus of freeing up valuable time in the data scientists’ schedule.
Ultimately, the free libraries and plentiful online resources available for open source data analysis tools empower data scientists to build flexible and highly specialized custom applications at no extra cost.
Why Data Scientists Choose Python
Python is one of the most popular open source scripting languages. It is free and compared to other programming languages, code written in Python is lean and short, reducing the risk of errors and the resources needed to eliminate errors once they occur. Since scripts written in Python are operating system independent, their integration in most environments is fast and simple. Python is a high-level programming language, making it easy to read, and quick to learn, as well as adaptable for any data scientist.
The vast number of available Python libraries and online resources, combined with an active online community of scientists and engineers, makes Python the scripting language of choice for data scientists.
The Benefits of Python in the SIMCA® framework
Open source tools offer great flexibility, but the strict rules and regulations that apply in industry settings can hamper this flexibility. Commercial software, on the other hand, follows the strict data processing guidelines required in industry settings, but is limited by rigidity. Utilizing the core functionality of validated software as a framework and integrating customized applications built with open source tools enable flexibility in a validated environment.
With the Python integration in SIMCA® and SIMCA®-online, Python scripts can be created and customized for specific applications.
The scripts are then validated individually and integrated within the SIMCA® framework, saving time and resources. Taking advantage of validated Python scripts allows data scientists to customize GMP conform data analytics and even feed into online applications.
Replacing Missing Values and Automated Model Building With Python in SIMCA®
Building spectroscopy models by hand is a time consuming and repetitive process. For example, the typical manual workflow process for glucose concentration in a bioreactor would be to trim the spectrum, and then select several filters to test and compared to a metric, such as Q2. This process of selecting filters and comparing the outcome is then repeated iteratively, until a model that is good enough is found. To speed up the workflow, a Python script can be employed evaluating dozens of models within minutes, taking over the repetitive tasks of selecting filters and comparing the results, while also reducing the risk of errors.
For time series data, SIMCA® offers the option of replacing missing values with either a specific value or the latest non-missing value (last observation carried forward (LOCF)). However, when working with data without linear correlation, this approach is not suitable. Using Python scripts allows data scientists to integrate e.g., polynomial regression to replace missing values during data pretreatment, widening the scope of SIMCA®.
How to get Started with Python in SIMCA® and SIMCA®-online
Getting started with Python in SIMCA® and SIMCA®-online is fast and simple, as Python version 3.7.9 comes pre-installed in SIMCA® 17 and SIMCA®-online 17, including NumPy, SciPy, and Pandas. NumPy, SciPy, and Pandas are fundamental scientific computing tools, making Python particularly useful for MVDA and preprocessing data for machine learning models.
In addition, any of the most popular Python libraries can be installed in SIMCA® and SIMCA®-online.
Example scripts shared with the SIMCA® installation provide non-validated boiler-plate code, allowing data scientists familiar with Python and experienced in SIMCA® and SIMCA®-online to build customized applications with minimal lead time.
The help file in SIMCA® contains useful documentation and examples for data scientists to start exploring the versatile analysis options provided by the Python integration.
Integrating open source scripting solutions in the validated framework of SIMCA® and SIMCA®-online allows data scientists to create tailor-made data analysis solutions that are fully GMP compliant. As a result, the potential of big data can be harnessed at no extra cost. Customized preprocessing applications to handle dirty data can be easily used by less experienced colleagues, while automation reduces the risk of errors.
Python scripting in SIMCA® and SIMCA®-online enhances functionality, from automating routine data analysis to implementing custom analytics functionality, translating to valuable time-to-market advantages.
Watch Online Webinar about Enhancing Multivariate Models Through Native Python Scripting