External User-facing Search and Taxonomies

Search has become faster, cheaper and more intelligent since the days of inverted single word engines so why not just use a search engine.  Why bother with taxonomy? Let’s briefly revisit what search is suppose to do. A search engine needs to make a pretty good guess of what a user wants to find – an unspoken intent which is expressed in staccato keywords  and then search needs to  match the user query with some content (documents, data, digital objects, people with expertise, adserving, product information), and take some action such as  read, buy, forward, share, comment, browse…..Sometimes the match is exact, sometimes the query terms are a partial match and sometimes there is no match.

In other words, search is not a perfect art.  The units of this equation are not just search  engine.  It is also the quality of the content and the query.   Good search needs good content,  no matter how great the technology.

Someone responsible for search implementation  has limited control over two of the key ingredients of search – the technology and the content.  This is why taxonomy plays a role – it can help describe concepts not in the content or in the metadata about the content.  (metadata is particularly useful for digitizing non-digital objects).   Taxonomy is not always necessary –  If you can write custom content with very precise vocabulary using Search Engine Optimization (SEO) techniques might not need a taxonomy. But documents cannot be altered, such as emails or reports, where it would be a significant protocol violation, even illegal.  When is search not enough?

1)  Developing effective measures to assess when search is not enough – the 80/20 rule

As part of the some of the early work in faceted taxonomies I did, I spent some time at MIT working on a research project that compared results when we queried a system that was based on a search engine technology alone,  and when we queried one where the query could be enhanced by adding taxonomy terms. For this experiment, we had the advantage of using a system, that was the brainchild of Wendi Pohs,  in which we had 2 search engines  using the same technology processing similar documents that were made available to a user interface which had a simple search box like Google. One engine processed news feeds .  These feeds were added quickly with no intervention—directly loaded into a search engine.   What our research found was that search engines without a taxonomy, left unattended, flatlined. The recall never improved over 75-80%.   Lee Romero, who is a keen observer of search, has recently done an excellent blog post observing this same flatline phenomenon.

What to do when you want to do better than 80% and move the flatline

In the same experiment, we created second engine, using the same software, had a taxonomy function where we inserted taxonomy terms into the index. These terms were selected from query logs and analytic reports-  they were unmatched terms, misspellings,  abbreviations.   There was an added cost to add taxonomy terms, but there was no impact on speed or performance of search since search technology used the same engine.

The taxonomy  was divided into classes such as product, company, or subject. Each term was connected to another term by using user-defined cross- connections (associative terms) which was smart enough to infer other relevant terms.  At least one of these terms  in the linked sets had to be tagged by an indexer.  So,  if a product was tagged, then we could infer that the product was <made_by> a company, thus speeding the tagging process.  Taggers could override the automated suggestions, and/or add new rules, by the way.  This way we could ensure exhaustive indexing at a low cost and effort.  The taxonomy-controlled section  paid back this effort.  A search on this section  would recall content that matched user-query terms about 90% of the time.  The taxonomy-controlled part of the database  could be improved.  We also worked hard to acquire content- good content- in many formats that would improve the quality of the database and thus what goes into an engine.

By using reports, tools and measurements, we were able to proactively add equivalents and monitor emerging terms.    Dips in performance triggered action to understand what was changing in the user’s world – was it query terms, a search for emerging content, or other unmet needs.

Errors were due to 1) missing content 2) wrong application 3)new terms or spelling errors that could be quickly added to taxonomy and 4) new and emerging trends that users were identifying that had not yet been captured in the taxonomy – all issues that could be identified and corrected.For example,  in the recent flu season,  search engines would eventually learn that  H1N1 was the preferred term to Swine Flu,  but in some cases, it was much easier for a trained taxonomy editor to surgically make this connection (especially in a fast moving news and business cycle). In a search engine only scenario, these errors are not always identifiable not actionable.

Set realistic goals and explanations for what taxonomy can do

ROI discussion often mean conversations that start or end with “Taxonomy can increase sales by improving conversion”  or “lower costs.”   Here are a few reasons that might be more honest and even compellng

Help with Ambiguity and add Precision —  Use Faceted Navigation: Search engines have a hard time differentiating about very key concepts and terms.    I remember in the early days when a term like “ASK” would bring a search engine to its knees because it couldn’t tell the difference between the name of  system command or a computer company.   By sorting terms into facets, we could help differentiate and resolve ambiguity by navigating user to the right facet and by tagging more precisely.   A developer looking for information on Java applications shouldn’t be sent to Java the island.  Taxonomy can help keep users searching down paths that might lead to results that are useful.  That’s productivity.

Implement Universal Search: A taxonomy can be implemented independently from the content, which means it can be used across content types- blogs, videos, email – creating a common set of concepts from which to generate user-centered search.   That’s efficiency and smart use of limited resources.  You need to have common metadata or rdf to take advantage of universal search, but there are standards such as Dublin Core that can help jumpstart that conversation.

Think Scalability and Reuse: Taxonomy can be used across applications, which means a central, faceted taxonomy, can be reused by other applications.  The best practice however is to create smaller taxonomies that are divided into homogenous facets.   To design monolithic spaghetti-like taxonomies will, in the end, create more work, bad inferences, and sour you on the whole project.  Reuse and scalability avoid redundant efforts.  Cost savings.

Use Taxonomies to Manage Change: Since taxonomy is independent of the content, you can change the concepts in the taxonomy without impacting the content.   Taxonomies are NOT static. For example,  many organizations need to change organizational names.  These names can be subsumed in a taxonomy without impacting the existing content. It’s safer and more secure way to handle change.

Create a technical and cost plan to integrate taxonomy while maintaining speed and performance, and not adding to overhead costs.

Implementing taxonomy within search can be done at various price points —  a solution like Vivismo  is not within every budget but there are other options low cost  and effective alternatives  I’ve found include:   Here are some technical considerations in adding taxonomy.

  • You don’t need a high end faceted navigation tool to get benefits of faceted navigation. Faceted navigation allows a user to narrow or broaden or expand query at time of search. This can be done in many CMS systems including  Drupal.   WordPress, which is what I use for this blog, has a taxonomy module, allows multiple authors
  • Add custom fields or metadata  for tagging that could be loaded into the search engine to improve search (as SOLR does)
  • If you have the budget and requirement for high-thoughput as in  auto-classification and text analytics, as in nStein, Teragram or Vivismo, then taxonomy is  still very useful to improve precision of results and making collections within document sets.

The bottom line is that whether you use search engine, you should be confident that 80% of the time, the user will get what they want. If you need to find ways to improve the user experience, taxonomy is one highly viable, low-cost and effective option.  Taxonomy might be worth looking as a way to give a  insert a pacemaker into the heart of  a search engine that seems to have flatlined.

Once you have a backbone with classy taxonomies and metadata, you can then proceed to the creative activity of beautiful designs of navigation paths for your end users.  But keep your eye  For more on search and taxonomies, see also my prior book review of Peter Morville’s  Search Patterns.

~ Marlene Rockmore

Enhanced by Zemanta
Advertisements

Facets=Classes=Sets

Rdf-graph3
Image via Wikipedia

I just returned from an intense training in semantic web technologies through Top Quadrant and I learned much more about what goes on “under the covers.” The course explained more about how semantic technologies can generate machine to machine applications. One important learning was that facets are similar to classes which is similar to the mathematical idea of a set and discusses why taxonomists and programmers need to think more in terms of classes, facets and sets as similar ideas.

Using semantic tools requires building a conceptual model — which is collection of classes.  To build useful models that are semantically-enable requires learning the basic semantic toolkit:

  • RDF (relational description framework). In RDF, one creates classes, and designs relations between individual members of a class and between classes. RDF comes in two main flavors:  RDFa which is for web-based applications  and RDFs which can be used to generate the ontology (concept mapping) as a schema to represent the underlying data.  RDF is used to create inverted graphs that can be converted to triples. Using RDF, one can read in a data store such as a spreadsheet and quickly generate a starter taxonomy (which still needs to be validated with use case scenarios )
  • SKOS (simple knowledge organization system) converts traditional taxonomies into rdf format. SKOS handles basic thesaurus-type relations such as broader/narrower concepts, alternative labels and related concepts. In SKOS the related concept would have its own unique resource identifier. SKOS can only describe a concept with broader, narrower and alternative labels and preferred labels, and cannot associate a concept with an OWL class.
  • SPARQL is a specialized query language, designed to query triple stores A semantically-enabled applications is one that is converted can be converted into an RDF graph, which can then be visually displayed as a graph and queried using SparQL.
  • OWL (web ontology language) is the underlying language for describing models. OWL is required to handle more complexity such as restrictions, cardinality, and inferencing.

Most everything conceptually in RDF, SKOS, and the underlying programming language OWL, once you get under the covers, will familiar to taxonomists. Some details can confuse you, but don’t let the lack of underlying naming conventions deter you. For example, a class in RDF is called an Owl:Thing. If a class is defined in RDF Schema language can be called an RDFS:Class. Oh well, confusing, but don’t let that deter you from appreciating the power of this approach. A thing is still a class, which is similar to a facet.

Here are some examples of how OWL and taxonomies are similar. The bolded print is the OWL property.

SubClassOf defines narrow term in a set

Inverse of creates reciprocal relations

Transitivity allows navigation of a hierarchy so that if A = B, B=C, the A=C. A SPARQL query that can chain through a hierarchy can potentially consist of 2 lines.

Restrictions are similar to slot facets or attributes which are o properties that limit the set

Here are some reasons to utilize classes in semantic technologies as a best practice.  Without implementing classes and modeling, these outcomes would be hard to achieve:

Form follows function: Instead of designing big monolithic hierarchical taxonomies, thinking in terms of classes or facets, which are groupings of individual members in a set. These smaller, faster sets (fasets, perhaps) will be easier to export, import, edit and share. Perhaps facets should be called fast sets or fasets! Plus the facets (classes) can become fields in a web form. The possibilities for reuse and design opens many options.

Scalability and Reuse: Since concepts and the associated classes are independent of data and content, the concepts and classes can be changed, such as changing an organization name, renaming key terms, or adapting new ideas, without changing underlying queries and systems architecture. This is scalable.

Change Schema Without Changing Content: Developing conceptual mapping can be done independently and designed and changed in the RDF schema or OWL language without changing the underlying data. Precision: Because an individual concept can be easily manipulated as a member of a set, or multiple sets, the concept can have a more accurate definition. For example, take a term like “Chevy Chase.” By associating “Chevy Chase” with a class:Person one can distinguish Chevy Chase, the comedian, from Chevy Chase, Maryland as part of the class: Location. Furthermore, ideally each unique concept of Chevy Chase would have its own namespace or unique resource identifier (URI).

Precision: The ability to create a concept independent of the content without tightly coupling into a hierarchy, but allowing the concept to associate in a clear way with the appropriate facet or class and to get more precision. This same logic can be applied to more amorphous, squishy terms like “Compensation” or “Performance” or “Management” or “Quality” which can be deconstructed into more specific variants like “Executive Compensation” vs “Non-exempt Pay and Benefits” RDFs can be used to link to more appropriate term with a unique URI

Facilitate Linked Data: If taxonomies and data can be shared, it is faster to build serious applications that can solve real and acute problems. In our class, we built applications that mapped free wifi hot spots were next to swimming pools and taquerias in geographic location, but we also did a serious social policy application where we mapped cities in the United States that had increases in complaints about housing due to sexual orientation, national origin, race and other discriminatory practices, taking data from multiple, reputable sources and applying a common conceptual model.

There are some new challenges for taxonomists especially in understanding the importance of inferencing. Developers who work with OWL is that many inferencing errors can be traced back to bad, messy taxonomies where there are too many broad terms — in other words, avoid complex polyhierarchies.

To create taxonomies that are ready for the semantic future, the better practice is to how to arrange concepts into facets (which can be equated with classes or sets and avoiding complex polyhierarchies (a concept with too many parents). This will allow taxonomies to play well with applications such as user interface design and machine readable applications. The first step is to stop thinking about taxonomies as a monolithic hierarchy, but rather to look at taxonomies as a collection of classes (or facets), where a class is a set with individual members. If models and taxonomies can be easily built and used to connect across data worksheets resolving issues, applications based on linked data can be quickly built.

To try  semantic tools such as SKOS editors, download a trial copy of Top Braid Composer Free Edition.

Enhanced by Zemanta~Marlene Rockmore

Is GoodRelations a Game Changer?

One  ontology  worth watching might be GoodRelations, which is being implemented by   Best Buy.      The central component of this architecture was an ontology called GoodRelations developed by Martin Hepp, who presented at SemTech in San Francisco last week via Skype from Munich, Germany.    GoodRelations is a retail ontology which uses RDFa from XHTML webpages to populate global ontology.   But why would a major retailer use this  architecture?

Best Buy discovered that it was impossible to be the top dog  in search engine optimization (SEO)  in every search category for every product.  To do this, they needed to have finely tuned individual pages.  They also wanted to provide immediate content about “open box” – returned items at local stores.    looking for a solution that could add more granularity, precision and localization, but still enable global search and have metadata that was controlled by the enterprise.

GoodRelations is a retail ontology, which offers facets or classes, metadata descriptions and attributes  that are common in the retail industry.   It is expressed in RDFa which is a flavor of RDF that works in web browsers.  Yahoo Search Monkey supports RDFa,  Facebook directed graphs will support RDF.  Google snippets also support RDFa.

Because there is common metadata, it is easy for employees or customers (who are called “user agents” in the semantic world) to tag content via templates which populate the RDF.  RDFa can be maintained in a corporate or enterprise repository which can be configured as needed for distribution in the enterprise.

In the GoodRelations RDF, the additional metadata might include price, color, dimensions, model and other attributes that interest consumers.  GoodRelations is an ontology that can be shared over any retail enterprise in any country.  The cost per webpage, once implemented, is minimal because “user agents” are familiar with how to complete forms over the web. The RDFa can then be appended to an HTML page written in XHTML or HTML5.  These HTML code for adding the specific metadata attributes is about 30-50 lines.  This creates HTML that has more granularity than a typical <keyword> metatag. The high costs are in the metadata management.

Adding RDFa as metadata to a webpage should be easy to adopt because it works in the current web paradigm.   Google is offering RDFa markup language that can be appended to a webpage called Google Rich Snippets.  Snippets is competing with the another format called Microformat.  The problem is that every domain needs a shared set of s metadata attributes to enable search across smaller domains.   Google is rolling out examples of RDFa for restaurants, currently only has 2500 markup pages. To see an example of snippets,  try a search on Google for “Baked Ziti.”  Drupal 7 also offers RDF, and has been implemented in http://www.whitehouse.gov, as part of the Obama Administration transparency initiative.

Why does this interest me as a  classy taxonomist (future ontologist)?  Clearly, this technology has evolved to a point of adoption, but further adoption depends on political and organizational work to get other applications to take the risk to try RDFa.    RDFa depends on common adoption of similar metadata  This requires political and organization skills to define and manage common metadata knowledge models.  First, taxonomists understand vocabulary and metadata as a way to capture common knowledge and shared metadata.  Second, if this innovation becomes more widely adopted and gains traction,  there may be interest in building similar process in other applications in making any information that has to be shared.

Further, if RDFa coupled with ontology and metadata management, makes data management and querying easier through SPARQL,  then more attention can be paid to the political and organizational work of working with local agencies to contribute good data and content.

There is a long way to go to make this vision a reality.. browsers have to adopt RDFa, applications have to prove the viability and ontologies in other domains need to be created.  But in the long run, this might be a more democratic way to extend information access on the web.

However,  to move toward this vision, faceted navigation and defining common metadata and taxonomies is  good intermediate step.  By creating faceted taxonomies and browsing, and collecting data, user communities are moving towards understanding what search fields, common language, and unambiguous terms that matter to their users.  A little semantics goes a long way.

~Marlene Rockmore

Thinking in Triples : Quick start for adapting taxonomies for semantic web

It’s time to get comfortable with ontologies, RDFa, SPARQL and OWL.    After a few days at  Drupal Design Camp at MIT and SemTech09  in San Jose,   I’m convinced more than ever that  it’s time to start thinking about ontologies.      It’s time to think in triples.

Why does RDF and Ontologies matter? To understand why RDF matters,  it might help to define ontologyAn ontology consists of concepts that are fully described and where  all the ambiguity has been resolved. Extracting meaningful links from databases and putting these concepts in a separate search structure solves so many problems.   It will make search engine indexes richer than standalone keywords that have no context,   it simplifies building indexes for programmers, allows filtering of data by facets, and enable visual interfaces (in the future – that’s the dream).   Thinking through the conceptual links  can  give a structure to unstructured data so it can be interpreted and analyzed by programs.     Yahoo experimented for the last year in RDF-enhanced search called Search Monkey.   Search Monkey users add structured metadata to their web using XHTML/RDFa which enhanced their search and changed how their data was displayed in the results.

What’s RDFa? RDF stands for Resource Definition Framework. What the ‘ a’ stands for is not clear.   Drupal’s  Benjamin Melancon said it might be the first version.  At SemTech in San Jose,  it was suggested that it stands for RDF+XHTML.   It might even stand for attributes.  No one really knows. RDFa  is an HTML-like syntax that will link to database schema or data definition for a concept.      RDFa is used to  create  a term that consists of a subject-predicate-object  (or a “triple”).  The important thing here is the predicate which is the link between a subject and object.  For example, in the triple, “<person> <has> <skills>” , has is the predicate. Take triples such as “<person> <works at> <company>” or “ <celebrity><isdating><celebrity>”.     The verbs  works at or  is dating are the predicates.  Instead of using the ANSI Standard  library language of broad and narrow term,  ontologies are implement using  XHTML/RDFa /XML as the enabling technology  and can created these lively predicates.

How does an ontology differ from an taxonomy or thesaurus? Simply put, ontologies allow hierarchical relations  just like taxonomies, but there is also some flexibility in defining links or connections between terms.  That’s the use of predicates.

CEOS and CIOs are recognizing that value of taxonomies and ontologies in managing information.   Times are changing when business managers start talking that adding or modifying a term in the taxonomy can be faster change than trying to modify a database.    Ontologies and taxonomies  are perceived as responsive to changes in concepts as opposed to databases that have static structure and query language that has to be modified through an IT process. Because the taxonomy can be modified by a “user”  or subject matter expert, without programming intervention.  That’s empowering.

Here are some simple ways to get started without learning any RDFa:

  • If you have a taxonomy, pay attention to ambiguous terms.  Create categories (also called facets) where terms can be placed comfortably.  Don’t put square pegs in round holes.  For example,  if you have a building products application,  you can classify  “Green”  under “Building Products”  and “Color.”      Green Building Products and The color Green are 2 separate, distinct concepts just as Lincoln, the American President, and Lincoln, Nebraska are distinct terms.    Don’t forget you are classifying concepts, unique ideas,  not keywords!    By classifying terms to a category,  you give terms context and meaning.
  • Connect terms with links between concepts.  No term should standalone without a relationship to another term or category  and every term should be disambiguated by being linked to larger concepts.   Try to have at least 3 touchpoints for your term, such as a broader category or a synonym and a link or predicate.  If you are uncertain about how to classify a term, put it in an “emerging concepts”  category while you get some more input about intent.    Simple relationships to look for are hierarchical relations such as a broad term, parent child, part-of, or a type-of,   and synonyms where terms are same-as or very closely related in meaning.
  • Research context and  intent! Find out how are your users looking for information?  How do they want to use information?  What types of  analysis are they doing?       Collecting this important user-centered research  to begin to  capture awareness of the situational and contextual process.  That means that the term has been placed in a context and also reflects intent or how term will be used. Context and Intent is important to resolving ambiguity.    Context is about location, process or role,  time, or situation.   That means that terminology  is in a context or data structure that captures meaning.   For example,  think of the term that has  local term variations such as “milk shake.”   In most of  the United States, a milk shake  has ice cream, but in Boston,  you need to order a “frappe.”  Otherwise, you’ll get shaken milk.   Intent looks at information from different user perspectives.  Got an upcoming  New Product Announcement?  The Engineer cares about how it is built, the CEO looks at the revenue, and  the lawyer looks at the contracts and licenses.    From each perspective, the term  New Product Announcement has different meanings.
  • Try  blogging tools to see how taxonomy works in user interfaces and how easy it is to add and modify concepts on the fly.  Typepad, Moveable Type and Drupal blogging software  all support RDFa.    Drupal can be downloaded from Acquia.com.
  • Try using taxonomy management tools: Test drive a taxonomy tool such as Data Harmony or Synaptica. Try one of the free ontology tools  Topbraid Composer is available for free as is Protégé from  Stanford University.    You might find that  traditional taxonomy tools such as Data Harmony and Synaptica are sometimes easier to learn and can product OWL and SKOS output which is compatible with XHTML.

Here’s the best part . Taking a step back  to use good methodology including  understanding information problems, capturing views of information based on user needs,  disambiguating and categorizing terminology is the best practice for taxonomies in whatever  form, independent whether the vocabulary is a list, taxonomy, thesaurus or ontology.

~    Marlene Rockmore

Reblog this post [with Zemanta]