In cooperation with Cloudera, Penguin is designing a scalable Hadoop appliance that includes all software functions and services to run Hadoop in a production environment including management tools for monitoring Hadoop, cluster load and individual servers. To manage the substantial task of Hadoop installation and configuration, Penguin and Cloudera are developing a network topology as well as a deployment and configuration methodology for setup and configuration of Hadoop clusters.
The ability to store and quickly analyze large sets of data is getting to become important for organizations. Market trends are anticipated through online data; sites track and analyze the behaviors of visitors; and social networking sites monitor individual preferences.
Hadoop is a scalable and flexible open source software framework for the distributed processing of large data sets on commodity systems. It is now the leading software solution for analytics on large sets of unstructured data being used by web giants such as Yahoo!, LinkedIn, Facebook, eHarmony and eBay.
- Unstructured data needs to be collected in a fault-resilient, scalable data store where it can be quickly indexed and analyzed.
- Large amounts of unstructured or semi-structured data need to be analyzed through batch processes.
- Data stored in a relational database or data warehouse need to be archived on a comparatively low-cost storage infrastructure to meet data retention policies or compliance criteria.
Foundation of the Hadoop’s data processing architecture is the MapReduce programming model. In the ‘map’ step, a master node takes the input, partitions it into smaller tasks and then assigns these tasks to distributed compute nodes. In the reduce step that is typically also run distributed across nodes in a cluster, the results of the initial map tasks are aggregated. For data storage, Hadoop includes the Hadoop File System (HDFS), a distributed, scalable data storage framework that includes a redundant storage model for overall reliability.
Hadoop’s scalable architecture, built-in fault tolerance, scalability and comprehensive set of complementary tools make it a great solution for fast, cost-effective analytic processing of big data.
Talk to us and find out how we can make Hadoop work for you.