Data Mobility Group, LLC - High Definition Analytics and Technology Market Insight

The Structured-Unstructured Information Continuum

If you’ve fallen into the trap of thinking of databases as “structured” information, and files as “unstructured” information, you’re not alone.

That misleading binary categorization—structured or unstructured information—can be attributed to hundreds of presentations, research reports, and magazine articles prepared by people who simply do not understand the complexity of today’s information assets.

Admittedly, even after a decade of experience designing, developing, and using systems that manage complex information assets, I sometimes use the adjectives “structured”, and “unstructured” loosely in casual conversation or situations where brevity is important (as in my recent letter to the editor of InfoStor).

However, I believe it is essential for you to understand that structured and unstructured are not mutually exclusive sets, but a continuum from one type of information to the other as illustrated in Figure 1 below.

Figure 1: Structured - Unstructured Continuum

Databases are not always 100% structured…

I wouldn’t recommend storing large amounts of very unstructured data in your database—for admin, performance, and backup reasons—nevertheless it is doable. Modern databases are perfectly capable of storing semi- and unstructured data as binary large objects (blobs), large character, or image datatypes depending on the database vendor.

A frequently used alternative is to store pointers to assets (that reside elsewhere) rather than the assets themselves. Either way the assets, taken as a whole with the database, are neither structured nor unstructured, but somewhere in between.

And files are not always 100% unstructured…

XML documents and Excel spreadsheets are two excellent examples of queryable, sortable semi-structured information contained in a file.

What about other ubiquitous file formats such as Microsoft Word docs? Many people would incorrectly classify Word documents as onlyunstructured information. How many times have you inserted a few data tables into a Word document? As researchers we find ourselves doing this quite often. How many writers format Word document text using predefined styles such as title, subtitle, caption, and body? All, arguably, add degrees of structure.

Most users aren’t aware that many applications expose their own proprietary object models. Microsoft Office and Visio are two examples. Though proprietary, these object models enable developers to automate applications, and perform operations (such as searching, aggregation, and analysis) on information contained in native files. Thankfully, newer versions of some applications, such as MS Word 2003, can natively output their own flavor of XML—and that additional layer of structure makes the information much more accessible to other non-native applications.

Needless to say files are not just wholly unstructured.

Asset management systems shrink the continuum

By layering structured metadata over assets of any degree of structure, today’s asset management systems (i.e., content management, digital asset management, and so forth) bring the endpoints of the continuum closer together.

For example, these systems may store and manage dozens of metadata about an individual video file, where the metadata and the file are managed as a single entity. Further, the systems manage relationships between multiple entities to place the content of the file in a larger context. A great example is a video, and its corresponding screenplay or storyboards.

Alone, the video file could be classified as highly unstructured. However, as an entity in the context of an asset management system, it does indeed have some structure—imposed on it by its environment.

Keep this in mind

Numerous enterprise applications mix and match information assets with varying degrees of structure. So your asset management strategy must address the handling of information across the continuum (there is no one-size-fits-all solution to this problem). For example:

  • Should you leave existing spreadsheets in their native format? Or, should you import and manage the spreadsheets in a database? Or, should you import the spreadsheet data into a database and discard the originals? And given today’s compliance requirements, do you have a choice?
  • Should you save information in XML documents, or as XML in a database? Or should you generate XML on-the-fly from your data source when XML is needed for data sharing?
  • Today you’re combining data from several sources and inserting the results into tables in MS Word. You’re converting those documents to PDF and distributing them to clients. Is that acceptable, or should you also save the result set in a database? What about the original Word document and the PDF? Are you required to save those too?

If you make the mistake of treating all of your information assets as either 100% structured or unstructured you are going to severely limit your ability to manage, protect, and eventually tap those resources.

Take it from someone who has been down that road before, and lived to talk about it.

This post was originally published in Data Mobility Group’s first blog, “Perspectives on Storage”, on March 7th, 2004.

Leave a Reply

You must be logged in to post a comment.

  © 2002-2009 Data Mobility Group, LLC. All Rights Reserved. terms of use privacy copyrights