A Proposed Integration Architecture for University Research Data Repository to Support University and University Hospital on Medical Digital Image Management and Analytics using Hadoop

: Big Data has been used in university and hospital due to its enormous potential in managing large volume and many types of data. However, university that also has hospitals may need to integrate their data repository to have a single site access for easier system administration and management. The needs of image analytics for both researchers in the university and physicians in the university hospital demand the need of Big Data platform such as Hadoop framework. Based on the literatures, there are no papers that describe in detail the integration of big data for university, which include its own teaching hospital. Therefore, this paper focuses on the proposed research data architecture for university and university hospital to support data repository for both with capability of image analytics using Hadoop technology.


Introduction
In recent years, the use of Big Data technology has been touted as the significant component to improve the management of organization. One of the sectors that include the use of Big Data technology is higher education sector. Higher education sector, which falls under the education umbrella, has a slight but obvious difference as it also includes research and innovation. Managing research and innovation provide greater challenges as it involves funds, expertise, research teams, research outputs and also research funder that may include government agencies as well as private agencies. Some universities also have its own hospitals and research and practices of research findings are directly applied to the hospitals. This increases the complexity of the university management as it include the critical services from the university hospital and thus increase the complexity of the IT services such as data analytic and large amount of health data that include digital images.
The inclusion of Big Data capability have improved many aspects in managing the complexity of university management through data analytics by providing better understanding towards the organizational processes through data-driven decision-making (Mjoohl et al, 2019). For university, Big Data is very important to support smart campus initiative by providing powerful computation infrastructure for Learning Management System analysis such as finding correlations across multiple data sources, predicting an entity behavior, or analyzing social networks (Banicaet al., 2014).In healthcare domain, Big Data is used as compared to traditional data storage due to the fact that traditional data storage for patients is not scalable enough for the increasing number of patients and applications and Big Data approach have since taken over this roleand implemented in hospitals (Belle et al., 2015;Sobhy, El-Sonbaty and AbouElnasr, 2012). Big data is defined as massive collection of shareable data originating from any kind of private or public digital sources, which represents on its own a source for ongoing discovery, analysis, and Business Intelligence and Forecasting(Banica et al., 2014; Hussain et al., 2020).
Hadoop is a component in Big Data computing that provides a powerful solution in managing and processing large amount of data coming from multiple data sources and in different formats. It is a distributed architectural platform that comprises a name node and many data nodes (Farhan. Z, et al, 2020). In recent years, Hadoop has frequently been used in the field of health services (i) to develop the framework, (ii) to develop medical large data processing systems, and (iii) to analyze large-scale medical images (Erguzen and Erdal, 2018).
However, there are some issues regarding the implementation of Big Data in university and hospitals due to its technically demanding of experts as well as demanding cost in terms of the equipment. Big Data implementation demandshigh cost in acquiring the infrastructures and software as well as its maintenance. Also, Big Data physicians needs to be able to acquire new knowledge such as data science to allow them to blend in as a person who can use the Big Data system in processing the data and involved in developing or applying machine learning algorithm in creating prediction models and data visualization. This requires cost and time to prepare them to be expert in data science for healthcare and can work with Big Data application.
With the inclusion of teaching hospital within a university in developing countries such as in Malaysia, (Universiti Putra Malaysia and Universiti Kebangsaan Malaysia) the harnessing of Big Data is very important to serve two purposes; to serve the hospital in terms of data analytics to support decision making in patient treatment and to support research activity of the university through research data repository that allowing collaborative research through sharing of research datasets and image analytics. In order to implement Big Data approach to the two medical expertise needs to be equipped with Big Data-related knowledge to build related data models and to interpret all the results produced using Big Data platform.
There are a number of literatures thathave implemented hadoop-based approach to support hospital information systems ( .IBM InfoSphereGuardium provides database activity monitoring and auditing capabilities that enable user to integrate Hadoop data protection into existing enterprise data security strategy (Hom, 2014). User can configure the system and use InfoSphereGuardium security policies and reports for Hadoop environments. It does not involve wireless sensor network security communication.HDSM is a Hadoopbased distributed sensor node management system, which uses Hadoop MapReduce framework and distributed file system (Jung, Kim, Han, and Jeong, 2014). Each sensor node imitates DVR (digital video recorder) for sensing video data. All sensor nodes are connected to HDSM manager via gigabit ethernet. So HDSM is not suitable to lightweight sensor node and application. Cloudwave platform is proposed to access and query large volumes of electrophysiological signal data using the Hadoop Distributed File System (HDFS) storage module. Cloudwave allows users to search for clinical events using ontology and semantics reasoning (Sahoo, 2014). However, it does not involve biomedical data security communication. Erguzen and Erdal(2018) proposed a Hadoop-based system for healthcare digital imaging. The system managed to improve the medical image compression method that we have been developed before to create a middle layer platform that performs data compression and archiving operations. With this study, a platform using MapReduce programming model on Hadoop has been developed that can be scalable.Based on the literatures, it is imperative that Big Data has already played an important role for hospitals to support healthcare services.

Materials and Methods
In this section, the proposed architecture for university data repository that supports both university and university hospital is presented. In order to integrate, a number of steps need to be taken in designing the proposed integration architecture for data repository. In order to design the integrated architecture, requirements of the proposed architecture are developed based on the approach by Eri et al. . The requirements gathered are on data type, network as well as business functions of both university and university hospital. In terms of data, medical images play an important role to determine patients' condition for example through x-ray images. In such university, these data are used both by as research as well as for health service.

One of the most common image formats for medical health is Digital Imaging and Communications in Medicine (DICOM) format(Genereaux et al., 2018). A number of x-rays and MRIs (Magnetic Resonance
Imaging) use DICOM format images. Each DICOM instance is fully usable since it is fully self-describing. Each instance contains a complete set of meta-data, including: i. Study-level attributes, such as the study unique identifier, the study description, and the patient demographics. ii. Series-level attributes, such as the series unique identifier, the modality, and the body part imaged. iii. Instance-level attributes, such as the instance unique identifier, the image resolution, and the table position where the image was acquired.
Each attribute of the metadata has a defined data type and cardinality and is identified by a 32-bit tagusually shown as two sets of four hexadecimal digits. For example, the tag (0010, 0010) identifies the data attribute containing the patient's name, which has the data type of a person name, and can occur at least once.
Each DICOM instance is identified using a global unique identifier (meaning, no two instances ever use the same identifier, worldwide) (Genereaux et al., 2018).
Universities and university hospitals use DICOM format images to perform analysis as well as research in medical domain. These DICOM images are the main focus of the integration part in which it provides challenges in terms of the formats and volumes. Therefore, ample storage needs to be prepared to store these images to support the data repository integration. In terms of network, multiple considerations are given such as on network resilience, congestion mitigation, performance, scalability and partitioning. Since the proposed integration of data repository serve both for university and hospital university, network latency may not affect big data processing over the Hadoop platform, but any large decreases in network performance may trigger failures in the outcome, since different jobs of these applications need to be executed in parallel in order to assist in accurate analysis.

Results and Discussion
In this section, the integration architecture of the university hospital and university research data repository is discussed. Since the research data repository for university is designed in a way to cater management of research data amongst university researchers, the inclusion of university hospital in a university needs data from this hospital to be on a same platform for easy management of data and to save cost by having single point of access and single site of data repository.  Figure 3 shows the proposed architecture of integrating university hospital data images system to be integrated with the university data repository system with the support of Hadoop. In order to integrate both systems, the imaging system from the university hospital will be provided an access to the data repository system as an external system or external data source through a specific API. Through this integration, validated image data from the university hospitals devices can be stored in the data repository database as research datasets. The integration will involve two sources of data from the university hospital that is image files from devices in the hospital, and image metadata taken from related patients through the university hospital information system. This will ensure that the images to be stored are supplied with relevant information so that it can be analysed and processed accordingly by other researchers. An efficient approach of processing skyline queries will be adopted in this architecture 2017;Saad el al., 2014;2016). Allowing users to access the data repository using their analytical application platform direct to the Big Data platform will allow analytical processing to be performed over the images. This is to ensure the usage of Hadoop for data analytic operation over the data images stored in Hadoop. Some data images will also be kept in the data repository system through the data repository system modules with semantic schema matching technique (Hossain et Figure 1 will be used to store these data on the data repository.

Conclusion
In this paper, integration architecture to integrate university research data repository with university hospital image data is proposed. This is done to support research activities for health data image for university through the university's data repository and to support the university hospital health image diagnose and analysis especially on the machine learning capability.The proposed architecture will provide seamless integration of data repository for health image in terms of data management with image analytics capability data to be used both by university and university hospital. Security is one aspect that is needs to be put into focus as access of the data may involve various access levels. Openness of research data, which is supported by a number of platforms may increase security risk of the system with proposed layout architecture. Cost is another issue of setting up hadoopbased data repository for university and its own teaching hospital as cloud-based platform incur cost for the long run.