SafeOne Machine Learning model to predict industrial incidents in Chemical and Gas Industries

Understanding activity incidents is one of the necessary measures in workplace safety strategy. Analyzing the trends of the activity incident information helps to spot the potential pain points and helps to scale back the loss. Optimizing the Machine Learning algorithms may be a comparatively new trend to suit the prediction model and algorithms within the right place to support human helpful factors. This research aims to make a prediction model spot the activity incidents in chemical and gas industries. This paper describes the design and approach of building and implementing the prediction model to predict the reason behind the incident which may be used as a key index for achieving industrial safety specific to chemical and gas industries. The implementation of the grading algorithmic program including the prediction model ought to bring unbiased information to get a logical conclusion. The prediction model has been trained against incident information that has 25700 chemical industrial incidents with accident descriptions for the last decade. Inspection information and incident logs ought to be chomped high of the trained dataset to verify and validate the implementation. The result of the implementation provides insight towards the understanding of the patterns, classifications, associated conjointly contributes to an increased understanding of quantitative and qualitative analytics. Innovative cloud-based technology discloses the gate to method the continual in-streaming information, method it, and output the required end in a period. The first technology stack utilized in this design is Apache Kafka, Apache Spark, KSQL, Data frames, and AWS Lambda functions. Lambda functions are accustomed implement the grading algorithmic program and prediction algorithmic program to put in writing out the results back to AWS S3 buckets. Proof of conception implementation of the prediction model helps the industries to examine through the incidents and can layout the bottom platform for the assorted protective implementations that continuously advantage the workplace's name, growth, and have less attrition in human resources.


Introduction
All workman who leaves their home for the work ought to return to home safe and sound. Thinking of the state of affairs otherwise, forever showing emotional sensitivity. Particularly within the field of chemical and gas industries, the incidents not solely affect the individual also the environment terribly. The impact would be there for years, typically decades. Generic machine learning algorithms, most of the time, demands a lot of parameters and have shortfalls to implement the precise want that doesn't work for all specific industries and organizations to supply the expected leads to a given timeline. As well as the assorted industry-specific factors into machine learning algorithms will offer advantageous impact for chemical and gas industries by reduced expenses, exaggerated productivity, improved work strategies. Analysis of business incidental safety measures seems to be the weakest part of the economic safety management system.
The Categorial Scoring Model and SafeOne Prediction Model based on Support Vector Machines (SVM) developed for prediction of incidents, positively want a design to urge through the suitable implementation. Inflow information ought to perpetually monitor to work out the precise score and supply the expected output. Developing the proof of construct (POC) can facilitate the organization to see-through the potential outcome of the answer and additionally helps to spot the gaps in it. It will additionally offer the stakeholders to internally measure the promising resolution that helps to scale back the gratuitous risk. Design expectations and potential timeline can even be determined before the all-out implementation. Applying an outlined algorithmic program is not a simple task. As a section of POC, it is necessary to create a visual interface to check the most effective attainable results. The approach is needed to be quantitative so that to describe the usefulness of the measurement rates towards the calculation of precision and accuracy. The accuracy score focuses on the outcome of the measurement rates to help the organizations in decision-making and also paves the path to eliminate occupational incidents.
The remaining of the paper is organized as, Section 2 lists out the review of the key literature work done by researchers and scholars in the field of workplace safety. Section 3 defines the research methodology and Section 4 explains the development and implementation process of the work. Section 5 discusses the results and compares the performance of the model against the other models and concludes the paper by describing the summary and directions of the future work.

Review of Literature
The internal-external locus of control theory, developed by Rotter in 1966, was one of the first psychological constructs examined as a possible predictor of accident potential. There has been much success in using this construct as a means for predicting involvement in accidents. Christopher A. Janicak, in 1996 published a study about predicting the accidents at work with measures of locus of control and job hazards [33]. The study analyzed the accident locus of control scale items which are very useful to measure the level of a job hazard. Christopher A Janicak resulted in his findings through the locus of control score combined with the level of job hazard score which produced 89% accuracy on accidents and 70% on non-accidents. When using the level of job hazard only as a parameter, it produced 79% accuracy. As same, when using locus of control score only, the model produced 86% accuracy on accidents and 43% on non-accidents. Ronza A. et al, in 2003, developed a methodology to describe a frequency value to the sequence scenarios, by multiplying the probability of occurrence by the frequency of the root event. The reliability of this procedure is proved by a wide range of historically documented accidents [56].
Martine Reurings and Theo Janssen, in 2006, developed a project Infrastructure and Road Safety aimed to find(mathematical) relations between characteristics of the Dutch road infrastructure and road safety. Research describes the relations between characteristics of the Dutch road infrastructure on the one hand and road safety, on the other hand, using risk and exposure measures. Models are the subject of Work package 2 of RIPCORD-ISEREST, which started with making an overview of the state-of-the-art on accident prediction models and road safety impact assessments [54]. Dipo T. Akomolafe and Akinbola Olutayo, in 2012, explained the use of the data mining technique to predict the cause of the accident and accident-prone locations on highways. Experiments were done using decision tree algorithms Id3 and FT (Function Tree) to determine the cause. From the detailed accuracy by class and confusion matrix, Id3 attained an accuracy rate of 0.777 and FT attained an accuracy rate of 0.703 [66]. Jan K. Wachter and Patrick L. Yorio, in 2013, theoretically and empirically develop the ideas around a system of safety management practices, ten practices were elaborated to test their relationship with objective safety statistics such as accident rates, and to explore how these practices work to achieve positive safety results which are accident prevention through worker engagement. Results indicated that there is a significant negative relationship between the presence of ten individual safety management practices, as well as the composite of these practices, with accident rates; there is a significant negative relationship between the level of safety-focused worker emotional and cognitive engagement with accident rates; safety management systems and worker engagement levels can be used individually to predict accident rates; safety management systems can be used to predict worker engagement levels, and worker engagement levels act as mediators between the safety management system and safety performance outcomes [69].
The study of the related papers provided clarity on the work and approaches that have been done earlier not specifically on the computation field but also in the various fields including psychology, mathematics, civil, human studies, reveals that there are limited works in extracting classified knowledge of incidents from the semistructured inspection data. Reviewing the literature provides insight on the incidents but statistically did not contribute much to determine the prediction model. This research attempts to bridge the gaps of using semistructured, multi-variate inspecting data by leveraging a vector-based classification model inspired by the principles of grid-search as a powerful tool to determine the prediction model for the given inspection data. This research intends to contribute to occupational safety by determining workplace safety to predict workplace incidents.

Research Methodology
This paper describes the methodology of predicting workplace incidents through the three processes in general. Data refinement, Categorial Safeness Score, and Prediction Model. Lack of an effective data analytics procedure from industrial historical incidents data, inspection/auditing dataset, and risk assessment dataset, leads to an unsafe workplace. Since data analytics and machine learning algorithms are too linear and too sparse where most of the time makes the algorithm overfits or underfits the requirement. To prevent fatal/non-fatal incidents, an efficient implementation of predictive analytics is expected. Predictive analytics not only includes the optimized data mining algorithm but also the identification of the right factors to make the precise prediction.

Data Refinement Model
Problem solutions should provide a viable course of action and form the basis for implementing and achieving the objectives and control measures. Employees and workers who are too close to the sequences may no longer perceive and recognize the hazards, or perhaps judge the incidents as trivial because to their knowledge no one has been harmed. The aim of the problem-solution should be that everyone tackles the scenario with a fresh pair of eyes and a questioning approach.
Main data cleansing is required to determine the cause of the incident for the incidents which are defined as unknown causes. Text Mining would be the best candidate as a problem solution to figure out the best match for predicting the cause of the incident. A multi-factor classification algorithm should be implemented to determine the weightage against the possible causes of the incident and the same can be used to decide upon the cause of the incident. Occurrences of the past data with the factors and defined cause serve as a base and the model will be trained with the same to determine the result. Cause factors are classified, and weightage will be calculated for each cause. Update the learning parameter at every calculation of the incident. Predict the cause of the incident based on the weight calculated using the learning parameter.
3.2. Categorial Safeness Score Methodology to determine the safeness score involves the analysis of the historical safety incident data. The analysis crawls through the data and determines the score of the category which will be used as a key parameter in the prediction model. An improvised vector-based data mining algorithm inspired by the principles of coordinate descent can be implemented for each category of the safety measure to determine the score of the safety category. The score should be calculated in percentage with a value between 0 and 100. Scores are calculated based on the data collected for the past 30 days. Scores are calculated in a specified frequency (once a day, off-peak hours) as determined by the organization at their requirements and convenience. Overall Score is a weighted average of every category identified for the industry. Safe Score can be defined and classified as per the following ranges: 86 -100: Overall Safe (Green), 60 -85: Situation/Category requires attention (Amber), Below 59: Potential Safety Issue (Red).

SafeOne Prediction Model
The primary objective and the ultimate goal of this research to derive the prediction model to attain the unsafe percentage of the department or the industry with the use of the safety scores being transformed from information to knowledge. Analyzing the trends of the occupational incident data helps to identify the potential pain points and helps to reduce the loss. Optimizing the grid search-based machine learning model will fit in the right place to support human beneficial factors. Implementation of new algorithms and new models have a similar step-up process to verify and validate the real-time scenarios. Understanding the mathematical calculations of measurement rates helps to place the model that the industry demands. Incident rates, Lost time cases rate, Severity rate, and lost workday rate are important calculations to make sure the model complies with. Chemical and gas industries worldwide have got the potential risk of occupational hazards which leads to the incidents. A model to predict the incidents based on the inspection and incident data helps the industry by eliminating and/or at least reduce the incident rates. The SafeOne prediction model has been constructed to predict the potential unsafe percentage value of the occupational incidents in the industry. The trained model takes care of applying the verification and validation of the model to write the prediction value. Safety scores are the inputs to this model and are weighed in different categories. Impact factors are applied to predict the Unsafe Percentage. SafeOne prediction model delivers the prediction for the next 3 to 7 days according to the industry under implementation through an expert assessment.
where wp1, wp2, wp3 represent the weights defaults to 9, 3, 1 respectively. A represents the data set, the score is an algorithm applied value to the data points determined using the Scoring algorithm. The SafeOne model predicts the "Unsafe" percentage concerning all the factors considered during the machine learning process.
SafeOne prediction model is being continuously trained using data streams which are from the industrial inspection data, observation data from reports feed. The factors considered for the classifications and models are given for the manipulation of data. Based on the data, the SafeOne prediction model has been trained to predict the potential value of occupational incidents in the industry. Once the model is trained, a test set of data has been applied for verification and validation of the model to write the prediction value. The outcome of the Prediction model should be validated against the known results to verify against the obtained result in the past. The machine learning model has been trained in such a way to produce the result from the scores through the learning parameters for the number of the specified next days.

Implementation 4.1. Working Model Architecture
The architecture of the working model, as shown in Figure 2, detailed the components involved for a better outcome. The integration between these components is aligned in such a way to establish a scalable solution for the future data load. Starting from the data stream through getting the outcome of the prediction model, cloud infrastructure helps to deliver a reliable solution. Figure 2. High-level architecture Inspection data, sensor data from the gas detection instrument, and historical incidents should be streamed through Kafka. Data Streaming is a method of posting a continuous stream of data that can be processed through the algorithms to obtain structural data. Multiple sources can send the data simultaneously to meet the requirements of real-time data analytics. The continuous stream of data is put in a bucket called a topic. Topics in Kafka can be subscribed by the consumer program to stream for processing. These topics are partitioned based on the size and volume of speed and scalability. Data are sent by various data sources to topics and subscribed consumer application takes care of relaying it. Each partition is assigned to a Kafka Broker for parallel processing. Messages are typically key-value pairs to construct the structural data. The stream is divided into RDDs (Resilient Distributed Datasets) which is a fundamental data structure of Spark. RDDs are divided into partitions which consist of tuples. The worker node takes care of processing the data in the Spark. Kafka-Spark connector allows mapping partitions between RDD and Kafka topic.
Processing of the data takes place through Spark Jobs. Spark Jobs is the small set of programs that cleans up, manipulates, and applies the specific algorithm to the data streamed and stored into the data lake. A data lake is a collection of data frames stored in the storage bucket. Spark Jobs written using Scala language in Notebook executes the Scoring algorithm to refine and restructure the data which should be used as an input for the SafeOne prediction model. Lambda functions serve the purpose of executing the logic using the structured data to provide the expected outcome. The approach of incremental algorithms can be used to manipulate the history data and realtime data. Heatmap representation of the data can be generated from the algorithm to visualize the results. The data dashboard displays the required heatmap and also keeps the data live through push notifications.

Results and Discussions
Graphing methods vary according to the scales of measurements and presentation. Evaluation of the categorized inspection score determines the safe score from the model where the radar graph, also called a spider web graph, is used to plot the scores of each inspection type. Value 0 is the safest zone and value 100 being the unsafe zone on the radar. A "Safe" and "Unsafe" radar graph representations of the processed inspection data with its score value obtained from the SafeOne prediction model are shown in    table data Training data was split into 70% which is 754,660 rows of training set data and 229,000 rows for the test set. An overall error has also been calculated from the results as a part of the prediction model calculation. These results are significant and provide a positive look forward solution to practice safety in organizations to prevent potential incidents. This working model provides the basic idea of visualizing the results in real-time by setting the base platform to smoothly walk through the workflow from raw data to prediction results. The outcome of the Prediction model is validated against the known results to verify against the obtained result in the past. Historical data of the prediction score is plotted to visualize the Safe and Unsafe values for the organization as shown in Figure 5. This has been leveraged to extrapolate the trend.  Total  107  87  194   Accuracy  97%  85%  92%  Table 1. Accuracy Score Summing across rows yields the number of total incidents with an actual positive state while summing in a column yields the total number of times that the corresponding decision was made. High results indicate the occurrence of the incident but the distributions of test result values in UNSAFE and SAFE incidents overlap then increasing the threshold value will make both false positive and true positive predictions less frequent. But the model will consider both true negative and false negative predictions more frequent. ROC curve appears much closer to the upper left corner means that the model has a highly efficient accuracy score. Table 1 shows the accuracy score obtained from the measurement done from the prediction model.
A threshold value should be determined to yield a compromise among these trade-offs which is between SAFE and UNSAFE incident results. Results closely match the objective of bringing up the accuracy value to predict occupational incidents in the chemical and gas industries. ROC Curve looks as shown in figure 6 which is an expected accuracy score of 92% as determined from the prediction model.

Conclusion
Contributions of the research, progressively, predicting the cause of the incident which helps the industries to have the cleaned-up data for the better placement of the score. The scoring algorithm helps to identify the safe score in each category and the SafeOne prediction model takes care of calculating the overall Safe Score based on the category scores. A score set of the inspection data has been obtained through the Scoring algorithm which serves as an input for the prediction model. Result visualization has been depicted using the heatmap, scattergram, and radar graph along with the prediction data defined to determine the safe and unsafe percentage. Inspection and Incidents are tightly coupled data chunk which brings the prediction mechanism to the spotlight. The objective of this study is to eliminate occupational incidents by providing an efficient and robust prediction model with high accuracy. Successful implementation of the accuracy score into the SafeOne prediction model proves that the prediction model works well in predicting incidents. Comparison study upon several models determines the accuracy score between them and put the SafeOne at the top.
The outcome of the research provides insight towards the understanding of the patterns, classifications, and also contributes to an enhanced understanding of quantitative and qualitative analytics. Cutting edge cloud-based technology opens the gate to process the continuous in-streaming data, process it, and output the desired result in real-time. The research contributes helps the industries to see through the incidents and will layout the base platform for the various safety-related implementations which always benefits the workplace's reputation, growth, and have less attrition in human resources. The number of auditing/inspection reports is directly proportional to workplace safety. Organizations that make employees in identifying unsafe workplaces have fewer incidents. The results of this research are to determine the best-suited algorithm, applied to workplace safety, which intends to send every employee home safe, at the end of every day. After all, if workplace incidents can be predicted, they can be prevented.