Introducing R-based open source software and web services into traditional company structures
Creating reliable NIR predictive models is essential for customer satisfaction when using any NIR spectrometer. However, it is extremely challenging to get the right predictive model for each customer individually:
- Obtaining optimal training data is expensive and difficult to obtain.
- NIR predictive models require fine-tuning for each customer and product.
- Customers are mostly concerned about the performance of the predictive models.
The time span from instrument installation to customer satisfaction can range up to 6 months or even more, which can create frustration.
We have developed internal tools that allow to obtain a well-performing predictive model within a few minutes. However, in order to deploy the tool into an environment accessible to BÜCHI affiliates, we have had to overcome several challenges.
In this post, we’ll talk about how was it like integrating our R-based web services and libraries to the Software Department at BÜCHI.
About BÜCHI and NIR spectroscopy
Among other more than 50 products, BÜCHI is expert in the Near Infrared technology.
But what is that?
[PHOTO1][P3]
This is a sample of animal feed.
[PHOTO2][P4]
This is a sensing system setup.
[PHOTO3][P5]
You irradiate the sample with light using a normal light bulb.
[PHOTO4][P6]
The sample absorbs light (or energy), and in this case, most of the light that is not absorbed, it is then reflected.
The way in which the sample absorbs energy is determined by the components of the sample and their amount. This is the principle that allows us to later quantify such components.
[PHOTO5][P7]
There is a detector that capturates the light reflected.
The data of the reflected light is transmitted to a processing unit. The reflected light is recorded as a signal or spectrum (near infrared spectrum).
This is what the system measures, only the signal (which is meaningless for our end users)
[PHOTO6][P8]
The challenge is to translate that signal into something meaningful.
[PHOTO7][P9]
To make the NIR predictive models, they translate the NIR signal or spectrum into relevant property values.
In summary, NIR sensing technology comprises:
- Hardware
- Software
- NIR predictive models
In contrast with conventional lab methods that are very expensive and it takes days or weeks to get results, this technology allows us to build models to reduce the analysis of a sample to only 15 seconds for all the properties or response variables you need like, moisture, protein or fat content of a sample. And it is cheap, you only need light.
The predictive models are part of the sensing system, they need to be shipped to end users/customers along with the necessary hardware and software required to operate the instrument.
Building reliable NIR predictive models are very important for our business, our instruments cannot do much without them for our end customers.
Challenges
1st challenge: “why R? Why don’t you rewrite the code in X language?”
In some cases, the Data Science and Software departments have different tools and use different programming languages. This is the case at BÜCHI. One of the challenges we faced is people questioning why we, Data Scientists and ML Engineers use R. Following by a “Why don’t we rewrite the code in X?”.
While rewriting code can be advantageous under very specific circumstances such as:
Complicated on-boarding: When the process of onboarding new developers on a project is a hassle and takes a couple of months.
Legacy system: When there’s a legacy system that still works, but that no one can touch it because otherwise you could create a chaos.
Unreadable code: When no one, not even the most skilled person can understand the code.
Non reusable code: When you can’t add new features without a complete rewrite of the existing source code or provoking a domino effect.
Lack of documentation: When the documentation is not good enough, or worse case, there’s no documentation.
Infinite bugs: When you cannot solve a bug without introducing another one.
Performance issues: When rewriting the code in another language would considerably improve the performance of the system.
Unmaintainable: When some of the technologies or dependencies of the system got oudated and are no longer maintained.
Non scalable: When you are planning to increase the number of users, and the old system won’t be able handle it.
CI/CD complicated: Or when you find it hard to set Continuous Integration or Deployment.
In our case, there was no sound reason to rewrite the R packages we had developed. Our code was very well documented, tested and maintained. If you ever put R based systems into production, you may agree that setting CI/CD pipelines for R code can be challenging, but it is definetely not impossible. So what was behind the “Why don’t you rewrite the code?”?
We, both the DS team and the SWE team had different development and documentation guidelines, and even different methodologies and culture.
Eventually, we found that the real issue was a lack of trust between the DS and SWE departments. This mistrust manifested in various ways, including resistance to using each other’s tools and code.
Addressing this lack of trust is key to resolving conflicts and improving collaboration between the teams.
2nd challenge: putting R into production, “that language is not suitable for production”, “R is single threaded”
Every time we faced a downside, or a huge question, there was always someone telling us “That language is not suitable for prodyction” or “It is single threaded”.
This concern comes from the belief that R systems cannot scale. The fact that R is single threaded does not mean that you cannot scale.
3rd challenge: Lack of information, very generalist
Another challenge we encountered is that, at the time of developing the libraries and investigating how to put R into production, the information available was scarce and very shallow. Because of this, it was also more complicated to facilitate information to the software team where they could validate or contrast what we were doing.
Actually, there were more blogs against putting R into production than blogs advocating for it. This is changing tho. You can still find people warning you about putting R into production, but you can also find more resources and tips to do it.
Something I would like to add is that we noticed that most of the blogs are either biased towards an specific language rather than R, or they are wrongly informed.
This is an extract of a blog, from last year. Let’s quickly address these concerns:
- There’s many ways of handling versions. You can package your tools, configure renv and even create a docker container for an extra layer of isolation.
We strongly suggest to be careful with what one publishes in a blog because it can really mislead people.
4th challenge: Open source is not safe
One of the misconceptions about open-source code is that it is not safe because it is available to the public and free for anyone to use, modify, or inspect. On the contrary, open source:
- helps building trust in the chemometrics solutions
- enables companies to stay on top of new scientific advancements
- accelerates the modernization of the NIR industry
- is more cost-effective than a proprietary solution
- gives enterprises the ability to attract better talent
- enables transparency, accessibility and customizability for end users
Any kind of code (closed source or open source) brings security threats to a project. It’s the developers responsibility to develop secure and reliable code, because security breaches can happen due to a number of mistakes such as:
- not following security guidelines
- improperly setting up software
- using easy passwords
- lack of data validation processes
- absence of data encryption techniques
A commercial licence doesn’t guarantee security. Unlike proprietary software, open source projects are transparent about potential vulnerabilities. Whereas with paid software you simply have to trust the vendor. With open source code you can also take part in code review, make suggestions, raise issues, collaborate with other people to use or adapt your own version, release your own patch, or even disable certain functionality.
5th Challenge: Software Systems vs ML systems
Deploying Machine Learning systems requires evaluating and monitoring performance metrics in addition to traditional system operational metrics.
The tricky part with Machine Learning systems is that you have to take care of both system operational metrics and Machine Learning performance metrics. It can represent a huge learning curve for SWE teams with no experience in deploying ML systems.
In traditional software, we mostly care about a system’s operational expectations: whether the system executes its logics within the expected operational metrics such as latency, throughput, availability, etc., whereas in ML based software you can encounter things like data collection and preprocessing problems, poor hyperparameters, data distribution shifts that can cause a model’s performance to deteriorate over time.
Operational expectation violations are SORT OF easier to detect, because they normally tell you what is wrong, and based on those clues you can then act. You kinda know what to do if you have a timeout, or an out of memory error, etc. HOWEVER, ML performance expectation failures are harder to detect as it requires measuring and monitoring the performance of ML models in production.
6th challenge: Cross team collaboration
One can succesfully develop R packages, convince decision makers about business value, and provide enough information to follow best practices for interoperability and building a multi-lingual team of teams, but if not all people gets onboard, it just won’t work.
It can take more time to define the collaboration than actually collaborating to build the systems because of differences on work methodologies (like agile-based frameworks), guidelines and conventions.
On top of that, experimentation comprises a huge part of developing ML systems, whereas it is not of the same magnitude in SWE teams.
We all had to get out of our comfort zone.
Tips
There’s no fixed recipe to follow to achieve that, but here are some of the things that helped us to build a multi-lingual team in a traditional software company:
Data science Team
- Rapid prototyping to present solutions and show how it would look like in the future
- Following good software engineering practices
- Providing detailed documentation
- Getting involved and learn about the tools the software department uses
- Proposing options for integration with software tools
Software Engineering Team
- Being open to experiment, fail and iterate.
- Understanding how data science tools can create value.
- Introducing a data culture.
- Switching mindset from a closed software engineering way of work to a more holistic one.
- Expanding from a narrow focus on technical engineering to a more inclusive and collaborative approach that values diverse perspectives and beliefs.
Both teams
- Putting business value first.
- This can be achieved by developing a strong knowledge domain.
- Conducting code reviews and workshop sessions
- Defining and standardize communication language and terminology.
- Establishing API contracts.
- Involving external experts in the process to get nonbiased feedback.
- Creating an open and safe environment for team bonding.
Conlusions
Projects shouldn’t be language-specific. Focusing in the value generated rather than in the tooling used improves cross-team collaboration. Teams must be willing to experiment, communicate effectively, and adopt a more inclusive and collaborative approach that values diverse perspectives and beliefs.
Our collective efforts, –which involve rapid prototyping, adherence to software engineering best practices, and embracing diversity– have illuminated a path forward. The knowledge gained from these experiences enables us to enhance instrument performance, drive innovation, and ultimately contribute to the progress of the NIR ecosystem.
The NIR industry would highly benefit from open-source chemometrics software, novel collaboration methods, and a greater emphasis on data transparency, explainability, and reproducibility. By doing so, we can establish a strong foundation for building reliable NIR predictive models that meet customer needs and contribute to the modernization of the NIR industry.
Embracing open source can greatly contribute to flexibilize and modernize the NIR business ecosystem.
Practitioners, regardless of their institutional affiliations or resource constraints, can partake in this transformative journey and push the boundaries of NIR’s potential.
References
[1] Blind, et al. (2021). The impact of open source software and hardware on technological independence, competitiveness and innovation in the EU economy. European Commission.
[2] Robbins, et al. (2018). Open source software as intangible capital: Measuring the cost and impact of free digital tools. 6th International Monetary Fund Statistical Forum.
[3] Christensen, Ghose & Mathur. (2020). Economic impact of open source software on competition, innovation, and development in India. National Conference on Economics of Competition Law 2020, Competition Commission of India.