While relational databases remain well-used analytical tools, open-source databases that can easily handle unstructured data are becoming essential for most businesses.
Image by Photo illustration by getty / Phil Leo / Michael Denora
By Shirley S. Savage10/02/2017
Data is critical for business decisions. Today, companies are demanding faster access to analytics that rely on data coming from all kinds of sources containing structured and unstructured data. Databases and database technologies are evolving to better serve these needs. While traditional relational databases remain well-used tools for analysis, open-source databases (OSDBs) that can easily handle unstructured data are becoming essential for most businesses.
Relational databases use the relational model of tables, columns and rows to store data. SQL and stored procedures are used to query the data. Traditional relational databases are developed, maintained and supported by an ISV or similar software development companies. Relational databases continue to be an excellent way to store transactional and structured data. However, enterprises must process growing amounts of diverse data, which requires different sorts of database queries.
“In many cases, based on client requests, we look to optimize a database or other application for POWER8, often shooting for a goal of 2x performance on comparable POWER8 processors as compared to Intel.”—Gerrit Huizenga, STSM, Power open-source ecosystem lead
The traditional database architecture is hitting limits in performance, causing enterprises to examine alternatives. “Traditional databases were not designed to cope with the scale and agile demands required by modern applications; nor were they created to leverage storage and processing power that are readily available today,” says Beth L. Hoffman, IBM executive IT specialist and big data and analytics ISV solution architect.
OSDBs, which have existed for a while, provide broader capabilities. Some support the traditional relational database model, while others support data using approaches beyond SQL-like query languages. For instance, NoSQL databases have dynamic schemas that provide more flexibility.
The proliferation of data types is driving the move to OSDBs, says Gerrit Huizenga, STSM, Power* open-source ecosystem lead. The standard Oracle or Microsoft* SQL server is focused on the relationship between defined data fields, which works well for some workloads. However, analytic workloads, which are a predecessor to cognitive and big data, require databases that operate differently and can yield effective answers or insights quickly. OSDB models include everything from key value databases to content store databases to multivalue databases that focus on multidimensional or freeform search. “Traditional database development isn’t necessarily as nimble as open source,” he explains.
The Linux Connection
The demand for open source and the Linux* OS is another reason organizations are employing OSDBs. Unlike their proprietary comrades, the source code for OSDBs is publicly available and customizable. The OSDB model relies on community support, which means you may get better response for less cost, Huizenga says.
Most OSDBs are supported on the Linux OS, which means enterprises that already use it can leverage their existing IT staff skills to maintain databases in Linux environments. Enterprises that have invested in big data environments on the Linux OS can extend their capabilities by running OSDBs on it. As enterprise IT budgets come under more scrutiny, OSDBs can help trim expenses. Traditional database software licenses can be costly, so enterprises are taking open source seriously, Hoffman adds. Corporate cultures are becoming more accepting of open-source technologies.
All industries are looking to harness structured and unstructured data and use it for business decision-making. What company isn’t interested in harvesting and understanding the data available about them online these days? Or using the internet to identify and reach more clients? Web applications that interact with users in real time and then leverage this data are becoming common, Hoffman says. Some data is streamed in and analyzed in real time. Doing so requires different capabilities from what traditional databases can handle. Real-time web applications needing to access data from documents or glean information about relationships from large amounts of data are best suited for NoSQL databases.
Typical workloads that benefit from OSDBs include analytics, better visualization of data and big data environments like Hadoop and Spark. Quite often, a new workload may spawn a new and sometimes experimental database to address its unique needs, Huizenga says. Cloud is a good example of this. Access to Hadoop, Spark and NoSQL solutions provides enterprises with the flexibility to dynamically grow and shrink environments as the data size changes. Hybrid cloud provides the use of secure data that resides in-house as well as public data, which is processed in off-premise cloud environments. The results are then brought back in-house for final processing.
Hadoop and Spark, which were developed by the Apache open-source community, are helping businesses get an edge on their competitors. Hadoop was invented to handle big data requirements to support improved analytical capabilities and better decision-making, and can process large volumes of unstructured data because it processes multiple jobs in parallel in a distributed environment. Spark provides similar capabilities but adds value through improved performance. While Hadoop processes data on disk, Spark processes data in-memory. Parallel processing is achieved through a cluster of small servers.
Hadoop’s core processing engine along with a set of optional components, including Hive and Hbase, can extend the capabilities, Hoffman notes. Hbase is an open-source NoSQL database on top of Hadoop and the Hadoop Distributed File System. It’s a scalable, distributed database that supports structured data storage for large tables. Hive is a data warehouse that supports SQL-like ad hoc querying and managing of large datasets.
Beyond Hadoop and Spark, NoSQL solution categories include column-based, document-based, graph and key-value-based. Each has benefits for certain workloads. Some commonly used NoSQL databases are Cassandra (column-based); MongoDB (document-based); Neo4j (graph); and Redis (key-value-based). Log data manipulators Elasticsearch and Solr are also in demand. (For more information, see “Enterprise-Ready Open-Source Databases”.)
Although most traditional databases are proprietary, some are open source, such as MySQL and PostgreSQL. These solutions provide enterprises with the structured, controlled SQL environments they need with the added benefit of being open source.
Impact on the Data Center
The emphasis on performance and real-time results is putting pressure on IT infrastructure and staff. For many companies, hardware requirements are changing from enterprise scale-up systems to clusters of scale-out servers. That’s giving rise to auto-sharding, which is supported by NoSQL environments (see “Auto-Sharding Explained”). “This new deployment model requires expertise in managing a number of smaller servers and being able to scale out dynamically and deal with more complex network configurations,” Hoffman explains.
OSDBs require that data center personnel learn the ins and outs of the new software, as each one has its differences. If it’s the same database that’s used elsewhere, it’s a benefit. If your support staff can use the same database for multiple workloads, that’s ideal, Huizenga says.
Data center support staff may have to learn the new models, paradigms and APIs for accessing that database. Performance, performance tuning, methods for configuring your data and your data ingest will be different from traditional databases. Some OSDBs may be simpler than traditional ones because they’re fit-for-purpose, i.e., they have a predefined document that describes what’s needed. While the learning curve probably isn’t as high as it is with pre-existing proprietary databases, OSDBs will be new to many people.
If you’re using new OSDBs, the data center must work with the open-source community. “Many data centers benefit from the responsiveness of the community and collaboration with developers who are trying to solve the same problems you are,” Huizenga says.
Because basic versions of OSDBs are typically free, companies may believe they’re the lowest cost solution. However, organizations must examine a variety of factors to determine cost. Those factors include incidental costs, support model costs, performance-tuning expenses and impact on your database administrators or infrastructure personnel. If a proprietary database doesn’t run on your architecture but an open-source one that does is available, that’s a reason to look at open source. It’s important to find the types of databases you need for your workload.
License models run the gamut from free with self-support to an ISV startup that’s going to give you its full attention. “Those both lead to being less expensive than support for traditional databases,” Huizenga says.
Companies must also consider how employing open databases will enable them to meet new business goals that require using big data. “OSDBs enable enterprises to achieve better business results through more effective and efficient use of the data,” Hoffman notes.
Data is the foundational element of machine learning, deep learning, artificial intelligence or cognitive capabilities. Having data readily available is key to obtaining the next step in the journey from the use of open source to the exploitation of open source in a cognitive, machine learning or deep learning capacity.
Nurturing Open Source
IBM is committed to helping clients achieve improved results, and its long-standing support of open source reflects that goal. The company is an active contributor to OSDB development. Offering Linux on IBM Power Systems* provides additional advantages to the OSDB solutions, especially in the area of runtime performance, Hoffman adds.
The company also works with several ISVs and OSDB communities that provide open-source solutions on Linux on POWER*. IBM is doing this in conjunction with the ISVs that package these community solutions into fee-based enterprise solutions that include product support and value-added enhancements from the ISV. For example, Neo4j announced support of its NoSQL database on Linux on POWER in 2015. The enterprise version of MongoDB that exploits the architectural advantages of POWER for high levels of concurrency, such as 8-way symmetric multithreading, is supported by MongoDB. IBM also works closely with Hortonworks, EnterpriseDB and RedisLabs.
In May, IBM unveiled its Open Platform for Database as a Service (DBaaS) on Power Systems, supporting highly scalable, private cloud rapid deployment of virtual databases. The turnkey, on-premises hardware and software solution includes OpenPOWER-based compute servers, block and archive storage servers, JBOD disk drawers, OpenStack control plane nodes, network switches and the Open DBaaS Toolkit. This solution allows application developers to provision their choice of popular OSDBs in minutes, including MongoDB, EnterpriseDB, PostgreSQL, MySQL, MariaDB and Redis. Future plans include support for Neo4j, Cassandra and other databases.
In June, IBM announced an expanded partnership between Power Systems and Hortonworks that includes the IBM Analytics team. IBM is adopting Hortonworks Data Platform (HDP) for its Hadoop distribution and fully integrating it with Data Science Experience, machine learning and BigSQL capabilities. As a result, this solution will offer users the rich data security, governance and operations functionality provided by HDP, as well as the advanced analytics and management of the Data Science Experience—all on IBM Power Systems.
Further, IBM is helping to make OSDBs easier to deploy. In conjunction with RedisLabs, IBM created an integrated offering called IBM Data Engine for NoSQL that includes Redis Labs Enterprise software, which is built for easy deployment.
Clients can rely on IBM to assist them with incorporating OSDBs into their mix. “We are working with a long list of enterprise clients to architect OSDBs into their existing environments or create a new solution to tackle a leading-edge goal that requires the innovation and technology provided by open source,” Hoffman says. IBM helps with proof of concept or proof of technology tests as well as planning deployment details. IBM Benchmark Centers are helping many clients with these new environments and hosting proof of concepts for them.
For example, IBM is working with a national cable, internet and voice service provider to prototype using a NoSQL database as a new approach for getting to know their customers better, Hoffman says. The information gained about clients can then be used to appropriately bill them and make decisions about which additional services to offer. When a client opens an account, that information is stored in MongoDB on Power Systems servers. Additional data is collected and stored as the client uses various services, such as cable TV.
Finding a fit-for-purpose database that meets the client’s needs is the right way to go, according to Huizenga. “IBM wants to find the best solution for the client. When an OSDB is the answer, we’re going to be there to help and support it,” he says.
Support includes making certain the OSDB ports and making any modifications necessary. If performance issues occur, IBM typically will provide patches and give those back to the open-source community.
IBM also runs performance comparisons, which can vary depending on the workload. “We’ll solve significant issues and we’ll strive to make an OSDB run better with that workload,” Huizenga says. “We help make the OSDB meet their purposes as best we can.
“In many cases, based on client requests, we look to optimize a database or other application for POWER8*, often shooting for a goal of 2x performance on comparable POWER8 processors as compared to Intel*.”
Because the POWER8 processor uses little endian, there are few problems. Ninety percent of the OSDBs will build without modification and pass tests on POWER8. The other 10 percent typically require optimization, a configuration or minor tweak, Huizenga says. And because IBM offers support for many OSDBs, clients have a one-stop shop for their hardware, software and support needs.
As more OSDBs are developed, IBM assesses them and provides assistance to clients who want to add them into their IT environments. IBM is working with a number of NoSQL solution providers to enable their solutions on Linux on POWER, Hoffman says. IBM also is building the Hadoop and Spark ecosystem to bring those solutions to the Power Systems platform.
Enterprise-Ready Open-Source Databases
While there are myriad open-source databases, consider these six for their enterprise versions and technical support.
Classification: NoSQL document store
Optimized for: Document model and document stores; semi-structured or unstructured data
Technical support: docs.mongodb.com/manual/support
Classification: Open-source relational database
Optimized for: Transactional SQL-based queries and updates
Technical support: Community support available
Classification: NoSQL in-memory key value store
Optimized for: Data queues, strings, lists, counts, caching, statistics, text, session IDs, videos
Technical support: redis.io/support
Classification: Open-source object relational database
Optimized for: Variety of transactional work; relational structured queries to object store and retrieval
Technical support: enterprisedb.com/services/support
cassandra.apache.org; Enterprise version available at datastax.com
Classification: NoSQL wide column store
Optimized for: NoSQL environments with high data volumes that require high performance and scalability
Technical support: datastax.com
Classification: NoSQL graph store
Optimized for: Graph database, data stored as edges, nodes or attributes
Technical support: support.neo4j.com
Information contributed by Rick Murphy, migration solution architect, IBM Lab Services, and Mark Short, lead migration consultant, IBM Lab Services Migration Factor
Shirley S. Savage is a writer and communications strategist. She's fascinated by tech, science, finance, energy and the way innovative people think.