Performance Analysis of Big Data with Data models using Artificial Intelligence
Main Article Content
Abstract
Proliferation of new information sources such as medical images, financial data, sales data, radio frequency identification and web tracking data, there is a challenge to decipher trends and gain sense of data that is orders of magnitude larger than ever earlier. One of the technologies most often associated with the era of big data is Hadoop. Although in that respect is much expert information about Hadoop, there is not much info around how to effectively structure data in a Hadoop environment. Though the nature of parallel processing and the MapReduce system provide an optimal environment for processing big data quickly, the structure of the big data itself plays a vital role. This paper explores doable used for data modeling in a Hadoop environment. Specifically, the purpose of the experiments described in this paper was to figure out the best structure and physical modeling techniques for storing data in a Hadoop cluster using Hive to enable efficient data access. Although other software interacts with Hadoop, the experiments focused on Hive. The Hive infrastructure is most felicitous for traditional data warehousing-type applications. The experiment does not focus on HBase. This paper explores a data partition strategy and investigates the role indexing, data types, file types, and other data architecture decisions play in designing data structures in Hive. To test the different data structures, it focused on typical queries utilized for analyzing web traffic data. These test included most referring sites, web analyses such as counts of visitors, and other typical business questions used by weblog data. The primary measure for culling the optimal structure of data in the Hive is predicated on the performance of web analysis queries. For comparison purposes, it was quantified the performance in Hive and the performance in an RDBMS. The reason for this comparison is to more preponderant understand how the techniques that we are habituated with utilizing in an RDBMS work in the Hive environment. It explored techniques such as storing data as a compressed sequence file in Hive that are particular to the Hive architecture. Through these experiments, it endeavored to show that how data is structured (in effect, data modeling) is just as consequential in an immensely colossal data environment as it is in the traditional database world.
Downloads
Metrics
Article Details
This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.