External User-facing Search and Taxonomies

Search has become faster, cheaper and more intelligent since the days of inverted single word engines so why not just use a search engine.  Why bother with taxonomy? Let’s briefly revisit what search is suppose to do. A search engine needs to make a pretty good guess of what a user wants to find – an unspoken intent which is expressed in staccato keywords  and then search needs to  match the user query with some content (documents, data, digital objects, people with expertise, adserving, product information), and take some action such as  read, buy, forward, share, comment, browse…..Sometimes the match is exact, sometimes the query terms are a partial match and sometimes there is no match.

In other words, search is not a perfect art.  The units of this equation are not just search  engine.  It is also the quality of the content and the query.   Good search needs good content,  no matter how great the technology.

Someone responsible for search implementation  has limited control over two of the key ingredients of search – the technology and the content.  This is why taxonomy plays a role – it can help describe concepts not in the content or in the metadata about the content.  (metadata is particularly useful for digitizing non-digital objects).   Taxonomy is not always necessary –  If you can write custom content with very precise vocabulary using Search Engine Optimization (SEO) techniques might not need a taxonomy. But documents cannot be altered, such as emails or reports, where it would be a significant protocol violation, even illegal.  When is search not enough?

1)  Developing effective measures to assess when search is not enough – the 80/20 rule

As part of the some of the early work in faceted taxonomies I did, I spent some time at MIT working on a research project that compared results when we queried a system that was based on a search engine technology alone,  and when we queried one where the query could be enhanced by adding taxonomy terms. For this experiment, we had the advantage of using a system, that was the brainchild of Wendi Pohs,  in which we had 2 search engines  using the same technology processing similar documents that were made available to a user interface which had a simple search box like Google. One engine processed news feeds .  These feeds were added quickly with no intervention—directly loaded into a search engine.   What our research found was that search engines without a taxonomy, left unattended, flatlined. The recall never improved over 75-80%.   Lee Romero, who is a keen observer of search, has recently done an excellent blog post observing this same flatline phenomenon.

What to do when you want to do better than 80% and move the flatline

In the same experiment, we created second engine, using the same software, had a taxonomy function where we inserted taxonomy terms into the index. These terms were selected from query logs and analytic reports-  they were unmatched terms, misspellings,  abbreviations.   There was an added cost to add taxonomy terms, but there was no impact on speed or performance of search since search technology used the same engine.

The taxonomy  was divided into classes such as product, company, or subject. Each term was connected to another term by using user-defined cross- connections (associative terms) which was smart enough to infer other relevant terms.  At least one of these terms  in the linked sets had to be tagged by an indexer.  So,  if a product was tagged, then we could infer that the product was <made_by> a company, thus speeding the tagging process.  Taggers could override the automated suggestions, and/or add new rules, by the way.  This way we could ensure exhaustive indexing at a low cost and effort.  The taxonomy-controlled section  paid back this effort.  A search on this section  would recall content that matched user-query terms about 90% of the time.  The taxonomy-controlled part of the database  could be improved.  We also worked hard to acquire content- good content- in many formats that would improve the quality of the database and thus what goes into an engine.

By using reports, tools and measurements, we were able to proactively add equivalents and monitor emerging terms.    Dips in performance triggered action to understand what was changing in the user’s world – was it query terms, a search for emerging content, or other unmet needs.

Errors were due to 1) missing content 2) wrong application 3)new terms or spelling errors that could be quickly added to taxonomy and 4) new and emerging trends that users were identifying that had not yet been captured in the taxonomy – all issues that could be identified and corrected.For example,  in the recent flu season,  search engines would eventually learn that  H1N1 was the preferred term to Swine Flu,  but in some cases, it was much easier for a trained taxonomy editor to surgically make this connection (especially in a fast moving news and business cycle). In a search engine only scenario, these errors are not always identifiable not actionable.

Set realistic goals and explanations for what taxonomy can do

ROI discussion often mean conversations that start or end with “Taxonomy can increase sales by improving conversion”  or “lower costs.”   Here are a few reasons that might be more honest and even compellng

Help with Ambiguity and add Precision —  Use Faceted Navigation: Search engines have a hard time differentiating about very key concepts and terms.    I remember in the early days when a term like “ASK” would bring a search engine to its knees because it couldn’t tell the difference between the name of  system command or a computer company.   By sorting terms into facets, we could help differentiate and resolve ambiguity by navigating user to the right facet and by tagging more precisely.   A developer looking for information on Java applications shouldn’t be sent to Java the island.  Taxonomy can help keep users searching down paths that might lead to results that are useful.  That’s productivity.

Implement Universal Search: A taxonomy can be implemented independently from the content, which means it can be used across content types- blogs, videos, email – creating a common set of concepts from which to generate user-centered search.   That’s efficiency and smart use of limited resources.  You need to have common metadata or rdf to take advantage of universal search, but there are standards such as Dublin Core that can help jumpstart that conversation.

Think Scalability and Reuse: Taxonomy can be used across applications, which means a central, faceted taxonomy, can be reused by other applications.  The best practice however is to create smaller taxonomies that are divided into homogenous facets.   To design monolithic spaghetti-like taxonomies will, in the end, create more work, bad inferences, and sour you on the whole project.  Reuse and scalability avoid redundant efforts.  Cost savings.

Use Taxonomies to Manage Change: Since taxonomy is independent of the content, you can change the concepts in the taxonomy without impacting the content.   Taxonomies are NOT static. For example,  many organizations need to change organizational names.  These names can be subsumed in a taxonomy without impacting the existing content. It’s safer and more secure way to handle change.

Create a technical and cost plan to integrate taxonomy while maintaining speed and performance, and not adding to overhead costs.

Implementing taxonomy within search can be done at various price points —  a solution like Vivismo  is not within every budget but there are other options low cost  and effective alternatives  I’ve found include:   Here are some technical considerations in adding taxonomy.

  • You don’t need a high end faceted navigation tool to get benefits of faceted navigation. Faceted navigation allows a user to narrow or broaden or expand query at time of search. This can be done in many CMS systems including  Drupal.   WordPress, which is what I use for this blog, has a taxonomy module, allows multiple authors
  • Add custom fields or metadata  for tagging that could be loaded into the search engine to improve search (as SOLR does)
  • If you have the budget and requirement for high-thoughput as in  auto-classification and text analytics, as in nStein, Teragram or Vivismo, then taxonomy is  still very useful to improve precision of results and making collections within document sets.

The bottom line is that whether you use search engine, you should be confident that 80% of the time, the user will get what they want. If you need to find ways to improve the user experience, taxonomy is one highly viable, low-cost and effective option.  Taxonomy might be worth looking as a way to give a  insert a pacemaker into the heart of  a search engine that seems to have flatlined.

Once you have a backbone with classy taxonomies and metadata, you can then proceed to the creative activity of beautiful designs of navigation paths for your end users.  But keep your eye  For more on search and taxonomies, see also my prior book review of Peter Morville’s  Search Patterns.

~ Marlene Rockmore

Enhanced by Zemanta

Everybody’s an Ontologist

Clay Shirky is right. “Here comes everybody.” WordPress has just released an amazing version that makes it easy for anyone to make a high-quality website that includes hierarchical search through topics. That means everybody can enrich their content with by enriching the concepts associated with their content and pages. There are several nice features in WordPress categorization widgets:

v Anyone who has the patience to play with the categorization widgets in the dashboard can build a topical or indented navigation that is more intuitive.

v Concepts are not exposed until content is added so “form follows content.”

v As best as I can tell from some research searches, the more specific the concept tags, you are more likely to be retrieved in a search (that is the more specific I am as a searcher, the more likely I am to find good content, and the more specific you are, as the content provider or blogger, the more likely you are to be found). This is an old indexer’s rule, which was to “index to the narrowest term.”

v You can set the postings so the parent category does/or does not retrieve the child categories, thus you can choose whether to enforce inheritance.

v If you index to concepts from different facets, you have implemented a faceted search.

There’s been a raging debate about the value of taxonomies and whether they should be implemented in RDF, custom XML, SOX or even as an RDBMS or have any value whatsoever. It doesn’t matter. Taxonomies are agnostic. They have a fundamental hierarchical structure. But the next step is to have taxonomies move towards ontology, and teach everyone to be an ontologist.

Why does this matter? Because semantic technologies don’t have to belong to the privileged. Anyone who understands <subject>-<object>-<predicate> construction that you learned in English can start to build a model, and enrich their content. WordPress has a clever implementation of something called XFN that will help build networks of FOAF files to find out who is linking to who, but there need to be many more templates to help locate and find critical information for our everyday lives. For example, take an information need as critical as elder care. For the sandwich generation, it is important to find out <who> < provides><service >for an aging parent as well as to rate the quality of the provider? For parents who have special education needs, what about content this is categorized by <Who ><provides > <services > that can help deliver better share information about expertise and programs that can support students with a range of individual needs? What if it was easier to link a teacher to <resources><in a> <content area>? What if an excellent teacher could put their lesson plans online and receive royalties by doing what publishers do and adding better indexing concepts in their metadata?

Taxonomy support in WordPress is very exciting but fundamental. WordPress allows hierarchical categories but has no synonym support. For example, you cannot have synonyms for concepts like “groups”, “organizations”, “communities”, and “clubs.” You cannot add “ Home Energy Efficiency” as a synonym to “Weatherization” (or the English spelling “Weatherisation.”). “Home energy efficiency” would have to be a “child to” the term “weatherization.” And voila, if you can add synonyms or concepts to your categorization in WordPress, you probably can improve the status of your content in your community.

However, with a modest level of categorization can achieve sophisticated results. By simply ensuring that content is categorized to multiple relevant facets, such as the <who> category, <services> category, geography or other attribute (or any category that matters in your domain), the content is enriched and potentially more “findable.” A fully-formed taxonomy should be categorized and have links between categories as in RDF. That’s not part of the taxonomist’s vocabulary, though that’s what the ontologists do.

One impediment to progress will be the inability to import and export taxonomies between applications. Like everything else, taxonomies are a commodity, so there needs to be work on data interchange in the tools provided by vendors. But there are existing standards that can help engineers in developing these interchange standards. Many years ago, I worked with Instructional Materials Standards (IMS) for interchange of educational content objects, but the XML technology was not evolved. Since then, I have been through many painful migrations of educational learning objects between learning platforms, as well as watched the exorbitant increases in textbook costs. The time has come to help encourage the exchange of information. It does not diminish the work of categorizing and analysis, just shifts the work from one cost center to “everybody.

The content management tools and blogs are evolving quickly. For example, according to Drupal is evolving to allow semantic support of RDF values. O’Reilly’s Kurt Cagle has pointed out that content enrichment is probably the hot trend in 2009.   O’Reilly has taken a great step forward to the librarian’s belief in metadata and descriptive cataloging  by adding metadata to its publications in RDF and aligning the RDF with  Dublin Core metadata standards.

But the next step is to improve metadata and keyword tagging so content, irregardless of format, is findable.

Categorization is an important step on the way to analysis, empathy and knowledge, so be intellectually adventurous and find out what happens when you try to categorize tags. All of this takes a bit of thought to research the current state of information in a field (market research), to find out what is the language of the field, how users search for information and then some creativity to brainstorm the possible terms and then to categorize them.

So for everybody to succeed, we need to talk about content and concept enrichment. It no longer takes any great skill to build a taxonomy, but it does take a little bit of thought and patience, some interface testing and some valuable content that is worth the analysis to preserve and share.