External User-facing Search and Taxonomies

Search has become faster, cheaper and more intelligent since the days of inverted single word engines so why not just use a search engine.  Why bother with taxonomy? Let’s briefly revisit what search is suppose to do. A search engine needs to make a pretty good guess of what a user wants to find – an unspoken intent which is expressed in staccato keywords  and then search needs to  match the user query with some content (documents, data, digital objects, people with expertise, adserving, product information), and take some action such as  read, buy, forward, share, comment, browse…..Sometimes the match is exact, sometimes the query terms are a partial match and sometimes there is no match.

In other words, search is not a perfect art.  The units of this equation are not just search  engine.  It is also the quality of the content and the query.   Good search needs good content,  no matter how great the technology.

Someone responsible for search implementation  has limited control over two of the key ingredients of search – the technology and the content.  This is why taxonomy plays a role – it can help describe concepts not in the content or in the metadata about the content.  (metadata is particularly useful for digitizing non-digital objects).   Taxonomy is not always necessary –  If you can write custom content with very precise vocabulary using Search Engine Optimization (SEO) techniques might not need a taxonomy. But documents cannot be altered, such as emails or reports, where it would be a significant protocol violation, even illegal.  When is search not enough?

1)  Developing effective measures to assess when search is not enough – the 80/20 rule

As part of the some of the early work in faceted taxonomies I did, I spent some time at MIT working on a research project that compared results when we queried a system that was based on a search engine technology alone,  and when we queried one where the query could be enhanced by adding taxonomy terms. For this experiment, we had the advantage of using a system, that was the brainchild of Wendi Pohs,  in which we had 2 search engines  using the same technology processing similar documents that were made available to a user interface which had a simple search box like Google. One engine processed news feeds .  These feeds were added quickly with no intervention—directly loaded into a search engine.   What our research found was that search engines without a taxonomy, left unattended, flatlined. The recall never improved over 75-80%.   Lee Romero, who is a keen observer of search, has recently done an excellent blog post observing this same flatline phenomenon.

What to do when you want to do better than 80% and move the flatline

In the same experiment, we created second engine, using the same software, had a taxonomy function where we inserted taxonomy terms into the index. These terms were selected from query logs and analytic reports-  they were unmatched terms, misspellings,  abbreviations.   There was an added cost to add taxonomy terms, but there was no impact on speed or performance of search since search technology used the same engine.

The taxonomy  was divided into classes such as product, company, or subject. Each term was connected to another term by using user-defined cross- connections (associative terms) which was smart enough to infer other relevant terms.  At least one of these terms  in the linked sets had to be tagged by an indexer.  So,  if a product was tagged, then we could infer that the product was <made_by> a company, thus speeding the tagging process.  Taggers could override the automated suggestions, and/or add new rules, by the way.  This way we could ensure exhaustive indexing at a low cost and effort.  The taxonomy-controlled section  paid back this effort.  A search on this section  would recall content that matched user-query terms about 90% of the time.  The taxonomy-controlled part of the database  could be improved.  We also worked hard to acquire content- good content- in many formats that would improve the quality of the database and thus what goes into an engine.

By using reports, tools and measurements, we were able to proactively add equivalents and monitor emerging terms.    Dips in performance triggered action to understand what was changing in the user’s world – was it query terms, a search for emerging content, or other unmet needs.

Errors were due to 1) missing content 2) wrong application 3)new terms or spelling errors that could be quickly added to taxonomy and 4) new and emerging trends that users were identifying that had not yet been captured in the taxonomy – all issues that could be identified and corrected.For example,  in the recent flu season,  search engines would eventually learn that  H1N1 was the preferred term to Swine Flu,  but in some cases, it was much easier for a trained taxonomy editor to surgically make this connection (especially in a fast moving news and business cycle). In a search engine only scenario, these errors are not always identifiable not actionable.

Set realistic goals and explanations for what taxonomy can do

ROI discussion often mean conversations that start or end with “Taxonomy can increase sales by improving conversion”  or “lower costs.”   Here are a few reasons that might be more honest and even compellng

Help with Ambiguity and add Precision —  Use Faceted Navigation: Search engines have a hard time differentiating about very key concepts and terms.    I remember in the early days when a term like “ASK” would bring a search engine to its knees because it couldn’t tell the difference between the name of  system command or a computer company.   By sorting terms into facets, we could help differentiate and resolve ambiguity by navigating user to the right facet and by tagging more precisely.   A developer looking for information on Java applications shouldn’t be sent to Java the island.  Taxonomy can help keep users searching down paths that might lead to results that are useful.  That’s productivity.

Implement Universal Search: A taxonomy can be implemented independently from the content, which means it can be used across content types- blogs, videos, email – creating a common set of concepts from which to generate user-centered search.   That’s efficiency and smart use of limited resources.  You need to have common metadata or rdf to take advantage of universal search, but there are standards such as Dublin Core that can help jumpstart that conversation.

Think Scalability and Reuse: Taxonomy can be used across applications, which means a central, faceted taxonomy, can be reused by other applications.  The best practice however is to create smaller taxonomies that are divided into homogenous facets.   To design monolithic spaghetti-like taxonomies will, in the end, create more work, bad inferences, and sour you on the whole project.  Reuse and scalability avoid redundant efforts.  Cost savings.

Use Taxonomies to Manage Change: Since taxonomy is independent of the content, you can change the concepts in the taxonomy without impacting the content.   Taxonomies are NOT static. For example,  many organizations need to change organizational names.  These names can be subsumed in a taxonomy without impacting the existing content. It’s safer and more secure way to handle change.

Create a technical and cost plan to integrate taxonomy while maintaining speed and performance, and not adding to overhead costs.

Implementing taxonomy within search can be done at various price points —  a solution like Vivismo  is not within every budget but there are other options low cost  and effective alternatives  I’ve found include:   Here are some technical considerations in adding taxonomy.

  • You don’t need a high end faceted navigation tool to get benefits of faceted navigation. Faceted navigation allows a user to narrow or broaden or expand query at time of search. This can be done in many CMS systems including  Drupal.   WordPress, which is what I use for this blog, has a taxonomy module, allows multiple authors
  • Add custom fields or metadata  for tagging that could be loaded into the search engine to improve search (as SOLR does)
  • If you have the budget and requirement for high-thoughput as in  auto-classification and text analytics, as in nStein, Teragram or Vivismo, then taxonomy is  still very useful to improve precision of results and making collections within document sets.

The bottom line is that whether you use search engine, you should be confident that 80% of the time, the user will get what they want. If you need to find ways to improve the user experience, taxonomy is one highly viable, low-cost and effective option.  Taxonomy might be worth looking as a way to give a  insert a pacemaker into the heart of  a search engine that seems to have flatlined.

Once you have a backbone with classy taxonomies and metadata, you can then proceed to the creative activity of beautiful designs of navigation paths for your end users.  But keep your eye  For more on search and taxonomies, see also my prior book review of Peter Morville’s  Search Patterns.

~ Marlene Rockmore

Enhanced by Zemanta

Search Patterns and Faceted Taxonomies

Peter Morville and Jeffrey Callendar have produced a beautiful  manifesto calling to improve search  called Search Patterns: Design for Discovery (Oreilly, 2010). It is an ode to making complex data beautiful and navigable in user interfaces.  It’s nice to see O’Reilly produce a book with visual flair.

But once you journey through the many beautiful interfaces and design principles on how to present data,  you realize that there is still a need to understand that data presentation is related to data organization.  Morville hints at how data is organized to facilitate these interfaces.  In Chapter 2 on the anatomy of search, the authors write that sites should “embrace faceted navigation… Global facets might include topic, format, date and author.”   Morville downplays the role of formal hierarchies, focusing instead of the user experience of multiple interactions from “pearl growing” to browsing to managing your data to work towards a more immediate user experience.  Faceted navigation is described as “arguably the most significant search innovation of the past decade” (p 95), but there is only one short chapter on called Engines for Discovery that discusses how to create faceted navigation.

The data organization that combines the product taxonomy with other facets is called “unified discovery.”  The engines of this discovery (Chapter 6) and this is where we get into the expanded role of the taxonomists is to add facets for

  • Category: broad classifications that vary by application,
  • Topics:  the smaller areas of common interest  such as specific cars or books or recipes
  • Format: how data is formatted whether as content, video, or idea
  • Audience:  the fundamental activity of understanding the needs of who might need the data, from scholar and expert to novice browser

This global “one size fits all’  recommendation leaves out Time and Chance, which is when an object is produced, and the element of chance in that it is highly respected and relevant to the needs of users.  Date and date range is an important global facet.  Whether there is an “out of the box ” global taxonomy is probably up for debate.   Facets, and how many and how they are labeled,  needs to be validated by user need, application and content.   A global  model is a good starting point, but will probably need to be tuned.  Search across health care policies, for example, which probably requires facets on diseases, symptoms and treatments, and additional resources.    Determining the top categories can take some time so that these categories reflect common shared knowledge and vocabulary.  The top facets do not have to be 5 or 7 plus or minus 2, but rather what is needed by the application, users, and to organize the content.   Get over fixed universality rules and instead collect more data about user needs and content.

These navigations rely on separate and distinct data structures which allow users to navigate and refine queries before they are passed to underlying database or data structures.  These data structures  needs to be maintained, governed and analyzed. Over time, the richer this conceptual metadata, the better the search experience – better techniques for creating and using metadata are only around the corner.

On taxonomies and ontologies, the authors specifically argue that there may be other approaches to disambiguating terms (like Java the programming language from Java the island) based on clues like user and context rather than vocabularies:

“It’s not that there’s no value in parsing sentences for meaning or developing thesauri (or ontologies) that map equivalent, hierarchical, and associative relationships.  These approaches can add value, especially within verticals with limited formal vocabularies, like medicine, law and engineering.  It’s just that less obvious approaches like employing query-query reformulation and post-query click data to drive autosuggest – may deliver better results at lower costs. And we should be wary of claims that computers “understand meaning,” at least until they get a whole lot better at filtering spam.” (p. 162)

While these ideas are valid, it loses the essential wisdom of why librarians adapted taxonomies and spent so long building a body of standards for taxonomy creation. One thing librarians have long known about taxonomies is that they have a shelf-life beyond a specific application – that they can be used to share data across applications, communities and across the globe.

If we are to move the beauty of Morville and Callendar’s interfaces to uses beyond e-commerce and towards accessible, lower cost applications, we are going to have to understand the data structures behind these beautiful designs, and reach some shared understandings about how they should be built.  Search-side approaches to search are wise, but they depend on a good design for faceted navigation where it has validated user categories with user’s needs.  The skills of the taxonomist can be applied to search-side information design.

One discussion I enjoyed was on the under-appreciated role of color as a “quick way to reference the major categories and key players.” (p.15) I have often thought that it might be useful to have a color attribute when defining a facet or category so that all the terms and concepts within a facet share the same color.  That would help in visual sorting of ideas which is an idea Morville and Callendar explore more on the following pages.  Sites without a visual library of photos but only ideas and concepts could become more visual through the use of color-coding.  That would be useful if blogs and databases would look at ways of adding color so that similar concepts in a facet or category  can also be categorized by shared color.

To move to the next level, where we move search patterns from e-commerce to other uses, such as health care or better access to government information and more widely adapt better and more visual search designs,  we have to broaden the understanding of how to create and validate  faceted navigation and categories and what the supporting data structures need to be.  Perhaps O’Reilly’s next book should be on the common data structures for design for discovery such as the art of taxonomy and ontology.

Search Patterns is a valuable little  book  to stimulate creative juices.  The link  to buy Search Patterns is at http://searchpatterns.org/

Thank you to Andy Oram, a mensch of an editor at O’Reilly.

~ Marlene Rockmore

Enhanced by Zemanta

Everybody’s an Ontologist

Clay Shirky is right. “Here comes everybody.” WordPress has just released an amazing version that makes it easy for anyone to make a high-quality website that includes hierarchical search through topics. That means everybody can enrich their content with by enriching the concepts associated with their content and pages. There are several nice features in WordPress categorization widgets:

v Anyone who has the patience to play with the categorization widgets in the dashboard can build a topical or indented navigation that is more intuitive.

v Concepts are not exposed until content is added so “form follows content.”

v As best as I can tell from some research searches, the more specific the concept tags, you are more likely to be retrieved in a search (that is the more specific I am as a searcher, the more likely I am to find good content, and the more specific you are, as the content provider or blogger, the more likely you are to be found). This is an old indexer’s rule, which was to “index to the narrowest term.”

v You can set the postings so the parent category does/or does not retrieve the child categories, thus you can choose whether to enforce inheritance.

v If you index to concepts from different facets, you have implemented a faceted search.

There’s been a raging debate about the value of taxonomies and whether they should be implemented in RDF, custom XML, SOX or even as an RDBMS or have any value whatsoever. It doesn’t matter. Taxonomies are agnostic. They have a fundamental hierarchical structure. But the next step is to have taxonomies move towards ontology, and teach everyone to be an ontologist.

Why does this matter? Because semantic technologies don’t have to belong to the privileged. Anyone who understands <subject>-<object>-<predicate> construction that you learned in English can start to build a model, and enrich their content. WordPress has a clever implementation of something called XFN that will help build networks of FOAF files to find out who is linking to who, but there need to be many more templates to help locate and find critical information for our everyday lives. For example, take an information need as critical as elder care. For the sandwich generation, it is important to find out <who> < provides><service >for an aging parent as well as to rate the quality of the provider? For parents who have special education needs, what about content this is categorized by <Who ><provides > <services > that can help deliver better share information about expertise and programs that can support students with a range of individual needs? What if it was easier to link a teacher to <resources><in a> <content area>? What if an excellent teacher could put their lesson plans online and receive royalties by doing what publishers do and adding better indexing concepts in their metadata?

Taxonomy support in WordPress is very exciting but fundamental. WordPress allows hierarchical categories but has no synonym support. For example, you cannot have synonyms for concepts like “groups”, “organizations”, “communities”, and “clubs.” You cannot add “ Home Energy Efficiency” as a synonym to “Weatherization” (or the English spelling “Weatherisation.”). “Home energy efficiency” would have to be a “child to” the term “weatherization.” And voila, if you can add synonyms or concepts to your categorization in WordPress, you probably can improve the status of your content in your community.

However, with a modest level of categorization can achieve sophisticated results. By simply ensuring that content is categorized to multiple relevant facets, such as the <who> category, <services> category, geography or other attribute (or any category that matters in your domain), the content is enriched and potentially more “findable.” A fully-formed taxonomy should be categorized and have links between categories as in RDF. That’s not part of the taxonomist’s vocabulary, though that’s what the ontologists do.

One impediment to progress will be the inability to import and export taxonomies between applications. Like everything else, taxonomies are a commodity, so there needs to be work on data interchange in the tools provided by vendors. But there are existing standards that can help engineers in developing these interchange standards. Many years ago, I worked with Instructional Materials Standards (IMS) for interchange of educational content objects, but the XML technology was not evolved. Since then, I have been through many painful migrations of educational learning objects between learning platforms, as well as watched the exorbitant increases in textbook costs. The time has come to help encourage the exchange of information. It does not diminish the work of categorizing and analysis, just shifts the work from one cost center to “everybody.

The content management tools and blogs are evolving quickly. For example, according to Drupal is evolving to allow semantic support of RDF values. O’Reilly’s Kurt Cagle has pointed out that content enrichment is probably the hot trend in 2009.   O’Reilly has taken a great step forward to the librarian’s belief in metadata and descriptive cataloging  by adding metadata to its publications in RDF and aligning the RDF with  Dublin Core metadata standards.

But the next step is to improve metadata and keyword tagging so content, irregardless of format, is findable.

Categorization is an important step on the way to analysis, empathy and knowledge, so be intellectually adventurous and find out what happens when you try to categorize tags. All of this takes a bit of thought to research the current state of information in a field (market research), to find out what is the language of the field, how users search for information and then some creativity to brainstorm the possible terms and then to categorize them.

So for everybody to succeed, we need to talk about content and concept enrichment. It no longer takes any great skill to build a taxonomy, but it does take a little bit of thought and patience, some interface testing and some valuable content that is worth the analysis to preserve and share.

Is Taxonomy Dead?

Recently, Theresa Regli announced in a CMS Watch about predictions for 2009 that taxonomy is dead, and that metadata was the future. The argument for death sentence is that taxonomies are viewed as too authoritarian, that it might be possible to auto-generate topics and concepts through computer processes and finally, that the work of taxonomists is to police vocabulary, and not to invite a multiple views of information. So let’s examine this assumption.   So let’s confront a challenging information problem like health care insurance information systems. 

To begin, let’s take a look at some of the heavily-used consumer websites for health care information such as Medicare website (www.medicare.gov) and the widely-touted Massachusetts Health Care Program. In each system, take the challenge what you can find out about benefits for specific conditions like type of cancer, asthma or allergies. Try to figure out what coverage is for routine office visits.

What you will notice is that both Medicare and the Massachusetts State public-facing information sources are hard to search.

Medicare Home Page with Search Tools

Medicare Home Page with Search Tools

Buried in Medicare under “Search Tools – Find Out What Medicare Covers“ and under “Find Out What Medicare Covers” is a picklist of about 150 alphabetically-arranged terms. A picklist is not  a taxonomy.  Let’s see what the picklist offers:

  • · Multiple terms for Wheelchairs and Powered Operated Vehicles (POVS) and Motorized Wheelchairs, which are POVs.   There are also multiple synonymous terms for Office Visit
  • · No overarching concept for “Equipment.
  • One term for all Surgical Services, but no specificity of terms by Surgical Specialty. That might lead to an assumption that all surgical services are covered.
  • Important concepts are missing. There is no entry for “Asthma” or “Psoriasis” or “Dermatology or many other common complaints or hundreds of procedures.
  • Multiple terms for Lab tests and Diagnostic Procedures with no overarching category and none of these terms are linked to standard medical coding systems.
  • Over time, it’s difficult to scroll through hundreds of unorganized terms
  • Picklists are not compatible with web accessibility needs, particularly important for the audience of health care (or any) website.

One of the problems is that taxonomists have NOT been involved in solving these serious information problems. What would a taxonomist do? Taxonomists help design other ways that users, such as consumers, patients, caretakers, advocates, doctors, insurance companies, and policy analysts look for information. They group terms in meaningful categories based on proven methodologies that are used to analyze predictable categories of knowledge. Taxonomists perform gap analysis to identify missing concepts. Some taxonomists work with auto-classification and ontological tools to develop rules and semantic models.

Wouldn’t it be useful to have a health care information system that look at care based on a various levels of modeling such as ”point of need” such as Routine Care, Non-Routine Care, Emergency Care, Rehabiliation and Restorative Care, Chronic Care (including preexisting conditions), and Life-Threatening and Palliative Care. At the lower, concrete levels, this taxonomy would connect to the detailed services, which could then be connected to cost control data.

Look at www.cancer.gov, while not providing health care insurance benefits, at least promotes finding information by type of cancer. http://www.Cancer.gov has a taxonomy that is faceted in that it is organized by types of cancer. Here is a good example of taxonomy at work and an example of what taxonomy can do to help make these interfaces simpler and more friendly to its audience.

I am a fan of faceted taxonomies, but now I am of the belief that simply categorizing a term to a canonical form might be sufficient, because it captures the context of the term in one moment in time. But as many as 80% categories of knowledge are predictable based on our shared knowledge and can be suggested as part of the web interface design process.   But taxonomies also need to friendly to user terminology.  Who cares if an office visit to the doctor is called “Wellness Visit” “Routine Visit”  or “A day at the beach” as long as the terms link back to the same basic concept.

Is taxonomy dead. Old style authoritarian taxonomies are gone, but taxonomies as capturing models of how we think are very much alive and very necessary to improve public access to important information. Words matter. Long live taxonomy!

A pdf version of this article will be available on website  http://www.SynecdocheConsulting.com