Improving Apache Hadoop - MapR Technologies M7

Big Data is emerging as an important tool to help organizations learn more about their business operations, product performance, and customer purchasing behavior. It is misunderstood by the media making it difficult for organizations to determine if investing in this tool will bring results and make it possible to improve efficiency, bring out better products and services or better understand customer requirements. Big Data tools, however, can be intimidating. It is often necessary to acquire, install and integrate a number of different tools in order to address organizational requirements. MapR Technologies announced M7 as a way to reduce the complexity of Apache Hadoop environments, improve overall performance and increase solution reliability.

Why Big Data?

Traditional transactional systems are designed and implemented to track information whose format and use are known ahead of time. Big Data systems are deployed when the questions to be asked and the data formats to be examined aren't known ahead of time.

The goal of Big Data systems is to allow analysts and decision-makers to sift through massive amounts of data to learn something new rather than tracking known outputs of operational systems.

The promise of Big Data

Big Data solutions promise to help organizations move decision-making from a seat-of-the-pants exercise to a systematic and repeatable process. It can also make it possible to uncover hidden trends and reduce the chance that the organization will be blindsided by rapidly moving events. Another promise of Big Data is that organizations can learn more about their customers, their requirements and how they make purchasing decisions. Taken together, these promises mean that organizations can reduce costs by being able to better choose which products and services should be brought to market and which should be abandoned because customers are not interested.

The challenge of Big Data

Several open source communities made up of people and organizations that need to gather, analyze and report on Big Data repositories have developed and now offer projects that make the process of using Big Data processes much simpler.

Many Big Data implementations are based upon tools from Apache Software Foundation including the following:

  • Hadoop — Distributed processing framework designed to harness together the power of many computers, each having its own processing and storage, and provide the capability to quickly process large, distributed data sets.
  • Hadoop Distributed File System (HDFS) — a distributed file system designed to support large data sets made up of rapidly changing structured and non-structured data.
  • HBase — A distributed database that makes it possible to deal with HDFS data as if it was a structured set of very large tables.
  • Cassandra — a multi-master database designed for high availability
  • Other tools including Chukwa, Hive, Mahout, Pig and ZooKeeper

It is clear that a Hadoop solution has many moving parts, each of which must be properly installed, configured and optimized for the organization's application. This is beyond the capabilities of some organizations that wish to use Hadoop.

MapR Technologies M7

MapR Technologies has been a major proponent of Apache Hadoop since its inception. The company has sought to improve the performance, the reliability and the ease of use of Hadoop. The recently announced M7 takes this several steps farther than the last product, M5.

Here's what MapR has to say about M7

MapR Technologies, Inc., today announced at O’Reilly Strata Conference + Hadoop World 2012 that it is bringing unprecedented Hadoop and NoSQL capabilities together on an easy, dependable and fast platform. With MapR M7, Big Data operations ranging from batch analytics to real-time database functions can be performed with enterprise-grade reliability and protection.

One of the core benefits of M7 is making HBase enterprise grade with instant recovery from hardware and software failures, disaster recovery and full data protection with snapshots and mirroring. Even with multiple hardware or software outages and errors, applications will continue running without any administrator actions required.

M7 increases the performance of HBase to unprecedented levels. First, by eliminating the need for compactions, M7 provides uniform and consistent performance. Second, by utilizing innovative data structures that minimize the read- and write-amplification factor, inserts and updates are much faster. M7 also supports in-memory columns, providing more options to increase database performance.

HBase scalability has also been improved dramatically. M7 users can create more than a trillion tables. With M7, HBase has more than 20X the number of column families and has increased row and cell sizes to handle large data objects.

M7 greatly simplifies HBase administration by ensuring there are no separate processes to monitor and manage, no manual compactions, no manual region merges, no pre-splitting, no manual database repair operations and no downtime for standard maintenance.

MapR M7 is binary compatible with Apache HBase. Customers do not need to recompile or change code to take advantage of the enterprise-grade features. M7 also supports Apache HBase within the same cluster.

M7 also expands HBase use cases and applications. "The complexity of deploying and optimizing Apache Hadoop has inhibited organizations from integrating it into their business intelligence ecosystems,” said Manoj Goyal, senior director, Converged Application Solutions Engineering, Enterprise Group, HP. “HP solutions for Hadoop are built to enable rapid deployment, and innovations such as the HBase enhancements in MapR M7 further help customers integrate Hadoop into their data centers.”

Snapshot analysis

Big Data is an outgrowth of high performance and technical computing architectures. It offers promise of helping organizations make better decisions based upon real data. MapR appears to be helping move Hadoop from the realm of being a computer science project to being a useful tool for organizations needing this type of insight.

MapR's M7 addresses some of the challenges the Hadoop family of software projects impose on users by increasing levels of integration, improving performance, improving reliability and simplifying Hadoop.

While not the only player pushing Hadoop into the enterprise, MapR has a track record of making the technology better. The competition, which includes suppliers such as EMC Greenplumb and IBM, is also moving to make Hadoop more accessible to organizations. MapR needs to demonstrate why their approach is better. I believe if organizations take the time to learn what MapR is doing, they will be impressed.

Note: MapR informed me that EMC is an OEM partner.  They license MapR as their Hadoop distribution platform for Greenplum MR. 

Note: MapR Technologies is a Kusnetzky Group client.

Kusnetzky Group LLC © 2006-2014