A World of Not-Only SQL
On June 28th StreamScape sponsored an annual conference put on by the great folks at MongoDB, one of the industry's leading NoSQL database vendors. A great event with strong attendance by technologists across many disciplines. Our team met with consultants, visual analysis tool providers, Data Architects, Analysts, Application Developers and Operations staff interested in learning about MongoDB, NoSQL technologies and how they fit into the world of Data Integration and Analysis.
As we shared opinions, philosophies and jokes several recurring themes emerged. With alternative data management solutions becoming mainstream, architects and developer teams are faced with familiar challenges.
Return of the Hierarchical Database
Although we currently refer to non-tabular, non-SQL databases as Document-Oriented or Schema-Free, this description is not entirely accurate. Before relational databases with their tabular schema arrived on the scene, allowing you to define data dependencies dynamically, using a declarative language; most data management systems used hierarchical (tree structure) schema, organizing data elements into static objects, indexed records or documents.
Pioneered by IBM in the 1960's and widely adopted, hierarchical databases like IMS, IDMS, MUMPS or FOCUS dominated the landscape well into the 1990's. These technologies were considered "schema-less" and non-SQL and mostly ran on large mainframe-style computers and were accessible via Time-Sharing facilities offered by Service Bureaus that provided computing software as a service. Look at how far we have come. Wink wink.
Today's non-SQL databases have a lot more in common with the older, non-relational technologies then they care to admit. There is unquestionable value in being able to work with data hierarchies expressed as XML or JSON. Schema definition and querying of data hierarchies is quite another story. Transactional control, concurrency and partial document update becomes a challenge. And working with nested-documents gets quite complex. Which is why none of the older technologies fully supported these capabilities.
Still, developers welcome any opportunity to work with data in application-native format and presenting data to Web Applications as JSON makes a lot of sense. It reduces the amount of data-munging that needs to occur in applications. And it dramatically reduces the number of network trips to the back-end system, making for faster and more responsive apps.
Much like their older cousins, MongoDB and other NoSQL databases eliminate the need for JOINs, trading query flexibility for performance. For simple data aggregation and storage, few technologies come close to the performance and price point of a non-SQL database.
MongoDB has gone to great lengths to take the complexity out of hierarchical databases. Proving that Big Data does not have to mean Big Software. But there are some practical challenges and limitations that make global adoption difficult.
Modeling Schema-Free Data
Marketing aside, all data management systems have schema. Whether it's a table, a tree or a collection of files in a directory, it's a schema. All data must have a schema because if you cannot describe it, you cannot query it. The big deal with Schema-Free is that you can define data structures that can be changed on the fly.
Relational databases make column removal hard and adding a column is typically only allowed at the end of a row, imposing strict limitations on schema definition. Adding nested elements requires use of an ARRAY type in databases that support it, or the creation of another table with a foreign key. Complex stuff.
MongoDB and similar NoSQL technologies give users the flexibility to define and change schema on a per-document basis. So you can have a collection of similar data elements or just a bunch of unrelated docs.
The flexibility of this approach is also its greatest weakness. Without a common schema, a document collection is like a drawer of mis-matched socks. Analysis becomes difficult. As document schema evolve, indexing, query and management of missing fields becomes a Sisyphean task. Creating a unified schema that includes other data sources also becomes impossible.
A common feedback from architects and database administrators is that NoSQL technologies empower application developers while making data integration, maintenance and migration a challenge. As such there is an increasing need for technologies that can assist with NoSQL data modeling, schema enforcement and other data governance capabilities that will make the NoSQL database a first-class citizen.
Common Query Language
Another recurring theme was the polyglot nature of current database technologies. One of the things NoSQL shares with the hierarchical databases of old is a lack of a standardized query language. This limitation was one of the key reasons why relational databases became popular in the first place. Providing standards around hierarchical data query is critical to long term success of NoSQL. The industry seems to be adapting by promoting a hybrid model that allows users to query portions of documents (segments) using SQL. But we are a long ways off from any standardized query that a non-programmer type can understand.
Of course the current technologies are evolving and I expect that we will see similar debates on the subject as we had in the the 1980's about SQL. That should be quite exciting. For now we will have to make do with tools that complement MongoDB and other NoSQL technologies; and provide a common query engine on top of various disparate data sources.