Using Open Remote Sensing Data to build an Agriculture Big Data System

Landsat, MODIS, and Sentinel satellites are continuously producing multispectral sensor data with different spatial, temporal, and radiometric resolutions. This raw sensor data is calibrated and processed further, and additional data products are derived, which greatly reduces the burden for downstream applications from preprocessing these data. These petabyte-scale datasets are available to anyone free of charge. Remote sensing plays a key role in modern Agriculture. We can extract information about Soil, Weather, Water, and vegetation from these datasets. By processing historical remote sensing data, we can build temporal profiles of soil, weather, water, and agricultural conditions of the land. Deep learning and Spatio-temporal data mining algorithms can be applied to this data to extract hidden information. Having access to all this information via an agriculture information system, farmers will understand their land better and they will be empowered to make better decisions on a day-to-day activity. Although it looks simple from the surface, collecting, analyzing, and deriving insights from these sensor data and other data products from a multitude of sources is a big data and high-performance computing challenge. In this paper, we discuss the current open datasets and how these datasets can be used to solve various problems in agriculture. Also, we discuss implementing a cloud-based scalable agricultural information system which provides actionable insights to farmers.


Introduction
The world population is increasing rapidly and reached 7 billion. It is estimated that it will reach 9 billion by the year 2050. Feeding everyone on the planet with nutritious food is a major challenge. Climatic conditions, water resources vary across regions and countries and it poses serious threats to agriculture and food safety. Though there are many factors involved in increasing agricultural production. Using technology in various phases of agriculture will ensure food safety and sustainable agriculture.
Empowering Governments, policymakers, and farmers need solutions that will help them to make better decisions in their day-to-day life. To achieve maximum production with limited resources, technology can be employed in every phase of farming like preparing the land, selecting the right crops, seeding at the right time, minimizing fertilizer use, pest monitoring, crop health monitoring, utilizing water resources, harvesting and selling produce. Droughts should be identified earlier, monitored, and managed effectively.
The past and present crop, land, and weather data are the keys to understand agricultural capabilities and to achieve agricultural sustainability. Manually collecting agricultural data from all the agricultural lands is virtually impossible. The ability to collect these data at a massive scale from each farm and deriving actional insights from those data is a big challenge. In today's world, there are many tools and technologies available to collect agricultural big data, like, Internet of things (IoT), Crowd Sourcing, Unmanned aerial vehicles (UAVs), and Remote Sensing Satellites.
The biggest advantage of remote sensing data is that we already have a good historical dataset collected by the Landsat, Terra/Aqua (MODIS), and Sentinel satellites. National Oceanic and Atmospheric Administration (NOAA) satellites monitor weather and ocean. These satellites are continuously generating multispectral data sets known as Level 1 (L1) datasets [1][4] [5]. Additional analysis is performed on L1 datasets and useful information is extracted and released as Level 2 (L2) datasets [2][4] [5]. And for Landsat, further analysis is performed on L1 datasets and released as Analysis Ready Datasets (ARD) [3]. These datasets are available globally to anyone (except ARD) for analysis for free which unlocks any country in the world to improve its agricultural production and monitoring.
The volume of remote sensing data products is increasing every day. Landsat program alone has produced petabytes of data during its last 40 years of operation [6] and it will continue. Every remote sensing satellite measures different parts of the Electromagnetic Spectrum and produces a variety of datasets. Terabytes of data are added daily and in most of the cases, Level 1 datasets are available within hours of acquisition. Apart from the default datasets, we can employ additional algorithms and processes to extract more features, perform analytics, and produce insights using data mining and machine learning techniques.
This raises the need for a big data analytics platform, which operates on remote sensing big data that originates from a multitude of sources, processes the multispectral, multi-resolution datasets cohesively for the geospatial context, and produce meaningful insights and alerts, which can benefit Governments, policymakers, and farmers, to achieve sustainable agriculture. Challenges in handling remote sensing big data are discussed in this work [7] This paper aims to present open remote sensing datasets, explore various components required to build an agricultural information system using open remote datasets, and discuss the functional requirements of that system. Many systems are already implemented by leveraging this open remote sensing data to support Governments and farmers. China Agricultural Remote Sensing Monitoring System (CHARMS) is being used to monitor the growth of Wheat, Corn, Rice Soybean, Cotton, Canola, and Sugar Cane [8]. This is a more sophisticated system that also performs crop yield estimation, soil moisture monitoring, cultivated area change monitoring, and disaster monitoring. It also provides day to day decision support information to policymakers. CropWatch, developed by the Institute of Remote Sensing and Digital Earth (RADI), Chinese Academy of Sciences (CAS), leverages cloud infrastructures to monitor and to provide key agricultural information [9]. The ERMES agro monitoring system makes use of earth observation datasets, crop models, and user-provided in situ data in a unified system, to provide information about the current season at the regional/rice district scale, and to provide best crop practices to farmers [10].

Remote Sensing Datasets
Spaceborne, air borne, and ground-based remote sensing platforms are used to acquire remote sensing data. Spatial, spectral, temporal, and radiometric resolution of these platforms vary due to various reasons including the vehicles carries these platforms and how close they can place the sensor near the object of interest. In this paper, we will discuss the active data collection missions which provide open data datasets with global coverage.
Landsat suite of satellites are producing images of the earth since 1970 and has produced more than a petabyte of data to date. Landsat 7 and Landsat 8 are the active satellites producing earth observatory data using the Operation Land Imager (OLI) and Thermal Infrared (TIR) sensors. OLI measures the visible, near-infrared, and short-wave infrared portions of the electromagnetic spectrum whereas TIR measures land surface temperature. Data from old landsat mission is also available for download from USGS data archives.
Terra and Aqua Satellites carry the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument which provides 12-bit radiometric resolution in 32 spectral bands. MODIS views the same place on earth every two days, which makes it suitable for agricultural monitoring.
Sentinel satellites are part of the European Union's Copernicus Earth observation program. Sentinel-1 satellite provides all-weather, day-and-night radar imaging. Sentinel 2 satellite provides multispectral high-resolution imaging. Table 1 summarizes the spatial, temporal, and spectral resolutions of the Landsat, MODIS and Sentinel 2 datasets which are freely available for download.  Table 2 summarizes various data products derived from Level 1 datasets. These datasets can be thematically grouped as Surface Reflectance, Surface Temperature, Vegetation Indices, Leaf Area Index, Evo transpiration and Atmospheric water content products. Data from the Landsat program alone crossed the petabyte scale. It is estimated that new satellite sensors will generate petabytes of data per year. It is costlier in terms of money and time to move, store, and analyze this massive data. These datasets are readily available in various cloud providers. By leveraging cloud infrastructures, we can process these datasets without moving them to on-prem systems. Cloud-based big data platforms are gaining more attraction for remote sensing research. Datasets are readily available within the cloud provider's infrastructure and can be accessed from multiple availability zones. The computing power required to analyze those datasets on demand can be provisioned and destroyed on-demand which will help to reduce the cost significantly.
Google Earth Engine Cloud Computing Platform, provides datasets, algorithm libraries, and computing power to analyze remote sensing data collected over a long period of time [11] [12]. Using the compute and network bandwidth capacity available in the GEE, it is possible to process 40 years of Landsat data in petabytes of data in one day [6]. AWS Open Data Registry hosts Landsat, sentinel, Modis, and other remote sensing datasets in S3 buckets. Applications deployed in AWS can leverage the machine learning capabilities provided by AWS to analyze these data [13]. Azure [14] hosts MODIS and Harmonized Landsat Sentinel-2 data products [47].

Extracting Features from Remote Sensing Data
The scenes produced by the remote sensing satellites are more than 100 km in length and width. To provide a farmer with information about his land, we have to split this dataset into smaller tiles. Processing petabytes of remote sensing data is a high-performance computing problem. So, we have to apply optimizations at various levels to improve performance and reduce cost. Splitting the scene into wider tiles will produce less accurate information about the land, whereas splitting the scene into smaller tiles (1-5 m) will increase processing power and storage. All datasets are not produced in this resolution and the data should be resampled to match the tile size [15].
We need past and present data to gain a good understanding of geolocation to understand its land type and to improve agricultural productivity. Once we build the temporal profiles of each location, we can use Spatiotemporal data mining techniques to analyze the data. Here we discuss extracting key features like Land Use/Land Cover, soil properties, weather information, water information, and various indexes.
Based on the spectral resolution of sensors, some of these features are automatically derived from L1 datasets and produced as L2 datasets. Challenges in using L2 datasets include the delay in dataset publishing and global availability. Many kinds of research have been conducted in this area and there are many approaches proposed in the literature to derive these features directly from L1 datasets and to improve the accuracy using alternative processing algorithms. Also, numerous studies are performed to validate these datasets by comparing other satellite and ground-based datasets.
Generated models must be updated continuously when new data arrives. The accuracy of the models can be improved by combing data from other satellites, ground-based systems andin situ datasets.
A. Land Use/Land Cover Estimation (LULC) Land cover describes the land features like forest, wetlands, agriculture, water bodies. Land use describes how land is used by humans. Modelling historical LULC helps us to understand how land is used over time and how it is used for agriculture.
MODIS 500m Land Cover maps [16] are produced every year, which can be readily used as a reference. To increase the temporal frequency, Landsat [REF 25], MODIS [17], Sentinel [18] datasets can be used to derive LULC data. Deriving LULC maps using multiple sources and multiple classification algorithms is presented in [19].
B. Soil Properties Crops are sensitive to salt content in the soil (Soil Salinity).Soil salinity changes due to droughts, surface temperature, fertilizer use and irrigation methods. Soil Salinity information can be obtained from Sentinel [20] and Landsat [21]datasets.
Measuring soil moisture is key to efficient water management. Though ground-based sensors are mostly employed in monitoring soil moisture, it is also possible to extract this information from remote sensing datasets. [22] presents how soil moisture information can be derived from Sentinel datasets.
C. Weather information Water content in the atmosphere and the land surface temperature plays a major role in hydrological cycles, drought, climate and affects agricultural production.
MODIS Total Precipitable Water product (L2) [23] is essential to understand hydrological cycles and climate. Apart from this product there are other approaches discussed to derive this product [24] [25].
Surface temperature dataset it produced by Landsat [26] and MODIS (27). Deriving Surface temperature from Landsat is discussed in [28] and sentinel-3 is discussed in [29] D. Water Water is an essential component for crops. Efficient water management is required for sustainable agriculture. Measuring surface water, water level and water quality will enable us to estimate and optimize water usage.Various methods are available to map surface water from Landsat [30], Sentinel [31] and MODIS [32].
E. Vegetation Indexes Chlorophyll present in the leaves reflects energy in the green portion of the electromagnetic spectrum. For the human eye, the color of all the vegetation almost looks green. But the spectral response of plants varies when they are looked at under near-infrared sensors (NIR). This is key to understand various aspects of plant growth and to classify vegetation types. NDVI is derived from visual and NIR portions of the electromagnetic spectrum. Various vegetation indices are produced by the Landsat, Modis, and Sentinel satellites. Compares of the NDVI indexes generated by Landsat and MODIS datasets is presented in this work [33].

Fusing Data from Multiple Sources
The spectral, spatial, temporal, and radiometric resolutions of Landsat, MODIS, and Sentinel satellites vary and as a result, the derived data products (ex. NDVI) also vary. Though it is possible to select a specific dataset to solve a problem, another trend is to merge all the datasets into a single dataset for easier use using machine learning techniques. [34] discusses the spatiotemporal fusion of Landsat and MODIS data via deep learning and [35] discusses fusing Landsat and Sentinel data. [36] discusses fusing NDVI data from multiple sensors to identify olive trees.

. Crop Growth & Condition Monitoring
Monitor crop from sowing to harvesting helps to identify crop health issues earlier and take actions. NDVI changes during different stages of the crop growth which is a challenge in growth monitoring. A successful growth monitoring system will provide farmers with periodic crop reports, which includes, water requirement and vegetation healthiness. [37] discusses about using Crop Growth monitoring system to monitor wheat growth in China. [38] discusses about using monitoring maize grown and condition.
B. Crop Classification From remote sensing data, it is possible to identify which crop is cultivated using the plant's spectralresponse. We can retrieve existing datasets for a geo location and apply various classification and deep learning algorithms to identify which crops are cultivated in that land historically. This information is also useful for policymakers to understand the agricultural productivity of a country. [40] discuss about large scale crop classification using GEE.
[50] discusses vegetation classification using K-Means, Support Vector Machines and Artificial Neural Networks.
C. Crop Yield estimation Knowing the crop types and expected yield is a key to policymakers to ensure food safety. [37] discuss about winter wheat yield estimation. [39] discuss how yield can be estimated using deep network and additional weather input.
D. Crop Pest,Disease & StressMonitoring A healthy leave has good spectral response for NIR spectrum. When the plant is under stress NIR spectral response reduces. By monitoring this, we can predict if a plant is healthy or not. [41] using Landsat and sentinel data to detect habitat of rust and locust in farmland. [42] proposes new indexes to identity diseases in wheat. [43] discusses about identifying frostbites in cotton plant.
E. Drought Identificaiton& Monitoring Climate and water imbalance can cause droughts. Vegetation is affected during drought which reflects in the NDVI index. Land surface temperature also changes during droughts. Various drought indexes are derived from NDVI, and Land Surface datasets to identify droughts. [44] uses NDVI, Vegetation Condition Index (VCI), Temperature Condition Index (TCI) and Vegetation Health Index (VHI) can be used to identify drought. [45] compares various drought indices. [46] uses Landsat and Sentinel data to identify and monitor droughts.

Proposed Agricultural Big Data System
Various big data systems are presented in the literature. Discussion about cloud native big data platforms using Kubernetes can be found in [48] [49]. [51] discusses about datacube based approach to agro-geo informatics.
In this section, we propose a big data analytical system to provide insights into every stage of farming [ Figure  1]. This system leverages cloud infrastructures and the open datasets available in the cloud service provider's infrastructure. Details about various components in this system are discussed here.
A. Distributed Containerization Technology plays a major role in running modern distributed applications. Applications are developed and distributed as containers. Containers are immutable and can be deployed to production without worrying about configurations [54]. Container Orchestration platforms like Kubernetes manage multiple worker virtual machines and helps to schedule application services across multiple nodes and manages these applications to make sure they are up and running with required memory and CPU resources.
Distributed task management and workflow systems like Apache airflow can be to orchestrate machine learning and data mining workflows. Each algorithm or step can be implemented in a separate container which enables us to modify one part of a system without impact any other part of the system. B. Scalable The proposed system is scalable to meet the needs of a large number of farmers. Kubernetes can manage thousands of work nodes and can schedule the containers efficiently across nodes. Horizontal and vertical autoscaling mechanisms are available, enables us to increase the infrastructure capacity when needed and scale down the resources if not used. This also helps in saving the cost.
C. Building Temporal profles MODIS landcover data is leveraged to identify agricultural lands. Then for the agricultural lands,we will use data mining and machine learning classification algorithms to build temporal profiles for Soil Properties, Weather information, Water profile, and Vegetation indices. The indices will be selected in such a way that we can use them for the applications we discussed. Cloud providers also provide machine learning & data mining algorithms to train models that can be leveraged. Spatio-temporal data mining algorithms can be applied on these temporal data profiles to find hidden associations [52] [53].
The temporal profiles can be stored and accessed in cloud storage, without additional requirement for a GIS database. For example, Landsat data is stored in AWS S3 bucket with a proper naming convention so that a particular scene can be accessed by its row path identifier. A similar mechanism can be used to store the temporal profiles.
D. Data Analytics A scheduler will poll periodically poll for new Landsat/Modis/Sentinel scenes or data sets. When it detects a new dataset, it will automatically pull the scene and the dataset will perform two actions. First, it will trigger analytics. Based on the trained historical models, new information will be extracted from the new scenes and stored in the GIS database (PostgreSQL) in a format suitable for end-user consumption. Second, it will update the existing historical models.
E. Functionalitites provided to end users The main goal of the system is to assist farmers by providing useful information like historical context about their land, which crops they can grow, how they can efficiently manage water, and how they can improve land productivity.
The user will define his areas of interest either by specifying points, or polygons of geo-locations. Based on this input, a database search will be performed to retrieve analytical information about the areas of interest.

Summary and Future Work
In this paper, we discussed how agricultural big data systems can provide insights to farmers to make better decisions. We highlighted various studies that used data mining and machine learning techniques to extract features from datasets and how these algorithms are applied to solve problems in the agricultural field. We proposed architecture for building a modular and scalable agricultural big data system leveraging cloud infrastructures using open remote sensing datasets.
As an enhancement, the proposed system should also provide options to collect in situ data and remote sensing data from IoT devices. Mobile applications can be leveraged to collect farmer's logs. Numerous studies are performed to validate remote sensing data by comparing them with the data collected from ground-based sensors. Also, this ground truth can be used as a-priori information for the data mining algorithms, which will improve the overall accuracy of insights.
In our further research studies, we are working towards building a prototype to generate time series models for LULC, water, weather, soil, and vegetation indices data.