Data modeling

June 10, 2009

The Seven Top Data Delusions

The world of data is full of delusions - false beliefs or ideas about data. These are fueled by the mountains of data related white papers, articles, blogs, and marketing material. If I "google" any data topic, like master data or BI, millions of hits are returned. As I skim through these, nearly all are regurgitations of the last – thus the data delusions continue to grow. It is interesting how much is assumed to be true if we read it in print.

Below are the seven most popular I continue to see:

Data Delusion One

: If the data is there then it must have been deemed good data. There are not secret data police monitoring the data in most organizations. A large percentage of incorrect data lives within the data stores.

Data Delusion Two

: If it looks right then it must be. Typically, data is considered "poor quality" when it obviously looks incorrect or is known to be incorrect. Often data can "look" right, when it is not. How do you know if the answer returned when you ask a question, using a computer system, is correct - you would not need to ask if you knew the correct answer?

Data Delusion Three

: A new tool/technology will fix the data problems. There continues to be a belief that the tools/technology will auto-magically figure out if the data is correct or belongs together. Unfortunately success is always dependant on the quality of what goes in– garbage in, garbage out is still true.

Data Delusion Four

: Data is a computer phenomenon like software or hardware. Many of the definitions support this, but data has existed for longer than before computers were ever imagined. Data is a representation of the real-world organization, its things, people, locations and events. Computers help to automate the processing of data.

Data Delusion Five

: "Cleaning" the data fixes it. There is always a reason data becomes corrupted. It just does not magically happen. Data errors or poor quality data are a symptom of a problem, rarely the problem itself. Fixing a symptom does not fix the problem - it’s like taking an aspirin for a brain tumor.

Data Delusion Six

: The data meaning can be deduced from its name/definition. Even in the rare case when a data store has been diligently modeled from a business standpoint and implemented accordingly, the data system deteriorates over time. Many of the data stores in our organizations have never been designed / modeled in the first place. The data field names and sparse definitions were often the best guess by the programmer at the time. `

Data Delusion Seven

: Data can be managed/integrate/cleaned at an individual attributes/columns level. The data attributes/ columns are intended for description purposes. They are relative to what they are describing, as well as to the relationships/ dependencies of the things they are describing. When data attributes/columns are taken out of this context and treated indiviually, they can lose much of their meaning, and thus integrity.

April 17, 2009

Unfortunate Issues for a National Health IT Network

Noah Stokes comments on GBN about my life threatening encounter with a poorly designed health information technology system (HIT). His thoughts give us second pause to consider seriously the cost in human life for HIT failures. Noah opines that the UK's NHS Programme for IT presents a wonderful vision. I reply, so does the US version. Noah also believes that the "UK now reaps the benefits" of a fully integrated national HIT network. Unfortunately, serious challenges persist with the UK's National Health Service programme for IT at cost and error scale I posited will be true for Mr. Obama's version of an National HIT Network ("Dear Mr. President ...").

Richard Woods ("Darling swings the axe", The Sunday Times, April 12, 2009) wrote, "Or take the grandiose plan to create a central NHS computer system. Originally budgeted at £2.3 billion, it is now expected to cost £12.7 billion. It is years behind schedule, may never work as promised and is seen by many doctors as a waste of money." In an earlier article, The Sunday Times' Jonathan Ungoed-Thomas and Lois Rogers reported their concerns with the massive NHS IT programme ("Focus: Anatomy of a £15bn gamble", April 16, 2009). The issues with NHS' IT programme exist although the IT program is run centrally, therefore, is controllable to reasonable data engineering standards. Contrast the UK's NHS situation with that in the US where HIT is highly fragmented, IT systems are variable within single organizations, and no one authority owns this stuff.

It would be wonderful if HIT could deliver President Obama's, or former PM Tony Blair's, vision. It would also be wonderful if HIT was about "saving lives". Tragically, the opposite is too often true. Poorly implemented HIT can do great harm (e.g., see Drexel University's Medical Informatics Director's, Dr. Scot Silverstein, posts: http://www.ischool.drexel.edu/faculty/ssilverstein/failurecases/).

As IT professionals, we must resist the allure of wealth, prestige, and "solving big problems" when the cost of our exuberance is measurable in human lives. We must, in the case of HIT, proceed as trusted engineers, as builders of the great bridges over which our loved ones drive. We must rationally and soberly assess the HIT challenges we face before we start coding. If we are not up to this challenge, then we have a professional obligation to walk away from the fun and the money. While a national HIT network may someday deliver "life-saving" technology, achieving that vision will be a long, arduous, and expensive journey that requires of IT professionals a renewed emphasis on user interface design and data engineering fundamentals.

April 10, 2009

Dear Mr. President, a Data Model for my Electronic Health Records Nearly Killed Me

JoeBugajski-2x2-014 Blogger: Joe Bugajski

Mr. President, your historic economic stimulus package (The American Recovery and Reinvestment Act of 2009), appropriated $19 billion for health information technology ("Technology Gets a Piece of Stimulus", New York Times, January 25, 2009). This week, your Director of the Office of Management and Budget (OMB), Peter Orszag, shockingly held that half of the US operating deficit can disappear with lower healthcare costs and these will obtain through electronic healthcare records (Daily Show, 6 April 2009). Today, the Wall Street Journal wrote that you proudly proclaimed that electronic healthcare records for the members of the US military, like my youngest son, and continuing through Veterans Affairs "will provide a 'seamless system' to facilitate information sharing and cut red tape, ending the need for veterans to transfer military records to receive benefits". Whereas Star Wars and Star Gate movie fantasies provide great fun, witnessing you, a world leader, spew delusional visions of a nation-covering, interoperable, secure, private, reliable, accurate, and instantaneous electronic healthcare data network is at best terrifying and at worst pernicious.

Two months ago and for 100 hours I battled for my life with a networked, state-of-the-art, secure, electronic healthcare record system. It connected two of arguably the most advanced medical facilities in the world. One was a modern clinic built by generous benefactors from the computer industry. The other was a world renown hospital at a top ten university. Both facilities bristled with brain-power and hi-tech gadgets. All records were "computerized" - that was what was so very wrong with the care that I did not receive, albeit competent and attentive nursing staff managed my care as they shuffled me around the emergency room (ER) and intensive care unit (ICU). (For more about what went wrong and how it went wrong, read my personal blog.)

The reason things went so wrong during my clinic, then hospital, stay, and the reason, Mr. President, your grand vision of a universal IT health data network is so screwy, is that healthcare data cannot be reliably modeled. An unreliable data model for health records; like those at the two world-class, completely electronic facilities, that "cared for me"; accelerate prescription errors, prevent staff from efficiently delivering services, heighten life-threatening risks, and dramatically increase costs. Data models are the technical instructions for software to make "computerized health records" possible. Indeed, data models make all computer records possible. According to my friend, colleague, and data modeling guru, Joe Maguire, "data modeling is a family of techniques used to describe the kinds of information that are important to an enterprise". Healthcare data is clearly important but that fact alone does make for reliable data models. Good models require stable data and good data modelers. Unfortunately, healthcare data is unstable. Sadly, good data modelers are scarce.

The first problem with modeling healthcare data is that models must represent certain concepts (and not others) that will remain stable and true long enough to be built into computer software then used by healthcare providers and patients. Mr. Obama, have you noticed just how much knowledge has, is, and will be accumulating in the medical sciences? Knowledge is codified using words - medical knowledge uses copious quantities of difficult words taken from several languages. Words that recur frequently in a particular context become imbued with a meaning that includes the context (e.g., the White House). Such words then come to symbolize a bigger idea than originally intended (i.e., a "house" that happens to be "white", versus your administration and not the house). In medicine, how many stable words exist? These words - nay, well-formed concepts, repeatable, agreed by the medical community - can be modeled and added to computers to store records. Go one step further. Specialization in ER, ICU, cardiac care, pulmonology, oncology, radiology, and other medical subjects exists because the cumulative knowledge defines a large ontology. The ontology, taxonomy, skills, and knowledge of an medical subject area then can be referenced with one word - the name of the specialty. Unfortunately, words that refer to a concept in one specialty often mean something different in another specialty.

The second problem is the lack of good modelers. These people, specialists in data engineering, a subfield of software engineering, transform concepts into graphical and lexical patterns that are used to create computer records. The concepts they model are words (nouns and verbs) plus concepts used by practitioners to describe a patient's medical condition, or a critical care pathway, medications, instructions to patients and nursing staff, tests, and diagnoses. Who among us has the modeling skills to encode this data? As information varies across specialties, how should it be encoded? Empirical evidence suggests that engineers who built the electronic health records network at the two facilities that "cared for me" tried to do this, but they failed. Their data model had irreconcilable silos of information spread across specialties and expressed as incomplete taxonomies (entities), inadequate ontologies (attributes), and poor associativity (relationships). Hence, when programmers added those bad data models to the health information systems, those systems later lost critical information about patients' condition, listed wrong medications,isolated prior diagnoses from current observations, in short, made very bad medicine.

Please do not misunderstand, Mr. President, the medical personnel at the clinic and the hospital were professional, competent, and knowledgeable. It is just that when they interacted with me and other patients, then translated that interaction into electronic health information systems, there was always a fight. Indeed, healthcare professionals wasted between 40% and 60% of the time they had allotted to patient care with making electronic health records work very poorly.

Since the time of my illness, I met and spoke with a dozen medical professionals and healthcare IT experts. They unanimously confirmed my sickbed analysis of the faults with electronic health records. Most longed for handwritten charts hanging at the foot of every patient's bed (see, Professor Dr. Armstrong-Coben’s New York Times Op-Ed ) - now, so do I.

Mr. President, before your administration pours billions of our grandchildren's yet to be hard earned dollars into the biggest, scariest, and most wasteful boondoggle of an IT project the world has ever seen, please instruct your health IT experts to carefully analyze the strengths, weaknesses, opportunities, and threats (SWOT) associated with building a national heath information network using today's technology. Tell them to take the simplest steps first. Make them prove results in small projects. Insist that your experts read my paper, "Data Integration: Fantasies and Facts". It explains how to start and manage a large scale data integration project.

If our nation simply accepts your vision while healthcare IT vendors collect a lion's share of the $20 billion stimulus bounty, individuals and businesses will pay higher medical costs, patients will receive inferior care, medical professionals will loose more of their precious time fighting IT systems instead of delivering better care, and you will be a one term president.

With profound admiration and respect,

Joe Bugajski

July 15, 2008

Oslo: Road to Microsoft's Cloud [eWeek]

"Oslo" is not exclusively data modeling-related, but it will, if successful, be a major modeling milestone.  Oslo will be a key theme at Microsoft's PDC (Professional Developers Conference) in October.  See the full eWeek article for more details

"Modeling today for most people is really about application workflow," Martin said. "But that misses the large point," he said, noting that as companies move into a services world where many of the components of applications they use were not written by the organization, there is a need for modeling to help bring those services together more easily than having to write code. Also, in a virtualized environment where an application outgrow its hardware capacity and has to move to cloud-based assets, modeling can help to set up how the application should scale. In addition, for managing these complex services- based applications, enterprises will need to get policy information for the services that is easier done via modeling rather than writing code, Martin said.

"When you move from service orientation to virtualization, modeling goes from a 'nice-to-have' to a 'must-have.'"

Oslo: Road to Microsoft's Cloud

July 09, 2008

DataFlux Community of Experts » Blog Archive » Is Data Modelling Dead?

A stark data modeling reality check; read the full post for details.  Via Microsoft IT CTO Barry Briggs, who noted "What is actually staggering is that the core intellectual asset of a company is encapsulated in its data models."

What staggers me in this day and age is the number of companies I am encountering that do not have a data modeling tool. In my opinion to not have such a product is a serious oversight. In these companies it seems that it is often up to development teams to do the data modeling for a specific system — be it operational or analytical. Yet when I speak to developers in these teams, many of them have never had any training in data modeling and almost all of them are not really focussed on that activity but instead have a primary skill in object oriented application development. So in these cases I tend to ask what tool they use for data modeling. The answer is either no tool at all or that they use an object oriented development tool to define objects such as customer, product, order etc.

DataFlux Community of Experts » Blog Archive » Is Data Modelling Dead?

July 07, 2008

Book review: A Developer's Guide to Data Modeling for SQL Server: Covering SQL Server 2005 and 2008

This book is a timely and helpful overview of the strategic importance of logical and physical data modeling. It also provides some useful insights into new modeling-related features in SQL Server 2008.

While some aspects of the book are likely to be somewhat controversial (e.g., the use of plural entity and table names [I prefer singular; the authors use plural], and the use of SQL views for logical/physical data independence [I agree with the authors on the role and power of views, but some people prefer other abstraction mechanisms]), I believe the book should be required reading for anyone responsible for data modeling and database design aspects of working with SQL Server.

For a more in-depth and vendor/product-independent resource on conceptual and logical data modeling, see Mastering Data Modeling: A User-Driven Approach by John Carlis and Joe Maguire. I may be a bit biased on the Carlis/Maguire book, as John Carlis was one of my graduate school professors in the mid-1980s, and Joe Maguire is a Burton Group colleague, but the modeling techniques described in their book have been very productive for me in a wide range of modeling endeavors over the last 20+ years.

You can also download a free sample Burton Group research document on data modeling (written by Joe Maguire), titled "Data Modeling -- a Necessary and Rewarding Aspect of Data Management", from this page. The document abstract:

Data modeling has evolved from an arcane technique for database designers into an entire family of interrelated techniques that serves many constituencies, including techno-phobic business stakeholders and users. The new maturity of modeling tools and techniques arrives in the nick of time, because new technical and regulatory realities demand that enterprises maintain scrupulous awareness of their data and how it is used. Data modeling is no longer for databases only, no longer for technologists only, and no longer optional.

Blogger: Peter O'Kelly

Amazon.com link: A Developer's Guide to Data Modeling for SQL Server: Covering SQL Server 2005 and 2008 (The Addison-Wesley Microsoft Technology Series): Eric Johnson, Joshua Jones: Books

June 19, 2008

DMS 2008 theme #2, part 2: XQuery to the rescue

(Oops -- sorry about letting so much time sneak by since part 1 on this topic; I got detoured preparing for our Burton Group Catalyst conference next week and attending Microsoft TechEd last week, but the DMS team is committed to making this blog a sustained and regularly updated conversation on data-centric topics, so expect the post frequency to increase as we get into and beyond Catalyst.)

Part 1 on this topic concluded with a summary of ideal attributes for an XML query (general-purpose XML content manipulation, actually) language.  XQuery does a great job of addressing those goals, and its role in the common XML processing pipeline is depicted below:

XQuery2

Specifically:

  • XQuery is an efficient and effective means of working with information resources including databases, documents, and programming language data structures.
  • It doesn't require developer brain transplants, but it is generally more accessible to people familiar with SQL than to people who have been working with content/document-focused systems in the past.
  • XQuery has strong potential to replace the use of multiple programming languages often used for XML query and structural transformation operations.  In this respect, it's a lot like the shift to SQL more than 20 years ago, in that a lot of difficult-to-maintain procedural code can be replaced with more a declarative, set-oriented (and easier to maintain) approach.
  • XQuery is a W3C Recommendation, building on other W3C work including XPath (updated in conjunction with XQuery) and XML Schema (the XSD data type model, although XQuery is not exclusively tied to XSD), along with other related standards such as SQL.
  • XQuery is not, despite its name, just for queries.  It is an XML data manipulation language, designed for declarative expressions that can be optimized by servers, but it also includes variables, conditional expressions, function and modular declarations, and extension points. The W3C XQuery working group is also expanding XQuery to include insert, update, and delete operations.
  • XQuery is not a replacement for SQL; it's designed to be used in conjunction with SQL, as the languages are designed for different data models -- SQL is for the extended relational model, and XQuery is for (ideally well-structured/schema-described) XML content.

Overall, XQuery has significant potential to simplify XML query and structural transformation concerns, and, as support for XQuery increases in a variety of software product categories, XQuery is poised to become, for XML content, what SQL is for relational databases.

Blogger: Peter O'Kelly

  • Burton Group Free Resources Stay Connected Stay Connected Stay Connected Stay Connected


Catalyst 2009


Blog powered by TypePad