Data Mobility Group, LLC - High Definition Analytics and Technology Market Insight

The Structured-Unstructured Information Continuum

If you believe databases contain “structured” information, and files contain “unstructured” information, you’re not alone.

The idea that information is either structured or unstructured can be attributed to countless presentations, research reports, and magazine articles prepared by people who simply do not understand the characteristics and nuance of information in its many forms.

Admittedly, even after a decade of designing, developing, and deploying systems that manage complex information assets, I still find myself using the adjectives “structured” and “unstructured” loosely in casual conversation with my peers. However, I believe it is essential for others to understand that structured and unstructured are not mutually exclusive sets, but a continuum from one type of information to the other as illustrated in Figure 1 below.

Figure 1: Structured - Unstructured Continuum

Finding chaos in structure

I wouldn’t recommend storing large amounts of very unstructured data in a database—for admin, performance, and data protection reasons—nevertheless it is doable. Modern databases are perfectly capable of storing less structured data as binary large objects (blobs), large character, or image datatypes depending on the database vendor.

A frequently used alternative is to store (within a database) pointers to assets (that reside elsewhere) rather than the assets themselves. Either way the assets, taken as a whole with the database, are neither wholly structured nor unstructured, but somewhere in between.

Finding structure in chaos

XML documents and Excel spreadsheets are two excellent examples of queryable, sortable semi-structured information contained in a file.

And what about other ubiquitous file formats such as Microsoft Word docs? Many people would incorrectly classify Word documents as only unstructured information. How many times have you inserted a few data tables into a Word document? As researchers we find ourselves doing this quite often. How many writers format Word document text using predefined styles such as title, subtitle, caption, and body? All of the above add degrees of structure.

Most users aren’t aware that many applications expose their own proprietary object models. Microsoft Office and Visio are two examples. Though proprietary, these object models enable developers to automate applications, and perform operations (such as searching, aggregation, and analysis) on information contained in native files. Thankfully, newer versions of some applications, such as MS Word 2003, can natively output their own flavor of XML—and that additional layer of structure makes the information much more accessible to other non-native applications.

Needless to say files are not just wholly unstructured.

Asset management systems attempt to shrink the continuum

By layering structured metadata layered over assets of any degree of structure, today’s asset management systems (i.e., content management, digital asset management, and so forth) bring the endpoints of the continuum closer together.

For example, these systems may store and manage dozens of metadata about an individual video file, where the metadata and the file are managed as a single entity. Further, the systems manage relationships between multiple entities to place the content of individuals information assets into a larger context. A great example is a video and its corresponding screenplay or storyboards managed as a collection of assets.

Alone, the video file could be classified as highly unstructured. However, as an entity in the context of an asset management system, it does indeed have a greater degree of structure imposed on it by its environment.

One strategy does not fit all

Numerous enterprise applications mix and match information assets with varying degrees of structure. So your asset management strategy must address the handling of information assets across the continuum (there is no one-size-fits-all solution to this problem). Challenges include:

  • Should you leave existing spreadsheets in their native format? Or, should you import and manage the spreadsheets in a database? Or, should you import the spreadsheet data into a database and discard the originals? And given today’s compliance requirements, do you have a choice?
  • Should you save information in XML documents, or as XML in a database? Or should you generate XML on-the-fly from your data source when XML is needed for data sharing?
  • Today you’re combining data from several sources and inserting the results into tables in MS Word. You’re converting those documents to PDF and distributing them to clients. Is that acceptable, or should you also save the result set in a database? What about the original Word document and the PDF? Are you required to save those too?

If you make the mistake of treating all of your information assets as either 100% structured or unstructured you are going to severely limit your ability to manage, protect, and eventually tap those resources.

Take it from someone who has been down that road before, and lived to talk about it.

This post was originally published in Data Mobility Group’s first blog, “Perspectives on Storage”, on March 7th, 2004.

Leave a Reply

You must be logged in to post a comment.

  © 2002-2009 Data Mobility Group, LLC. All Rights Reserved. terms of use privacy copyrights