Skills of a Classy Taxonomist

At SemTech in June 2010,  several speakers including Professor Deb McGuiness drew a very clear line was drawn between what a taxonomist does and what an ontologist does.  Taxonomists build hierarchies, and ontologists determine classes or categories.   In other words, ontologies are neat and unambiguous, and taxonomies are a bit messy.

Defining classes or ontology work  typically precedes building the taxonomy.  Defining the classes is like writing a specification for the taxonomy; in fact defining classes is the same as defining facets.   The goal of a taxonomist and ontologies should be to define a specific, unambiguous description of a term that helps manage how we find and organize content so the pathways are clear and specific; adding an ontology ensures that the term is placed in the most specific categories to help ensure clarity and lack of ambiguity. I would argue that no taxonomy is useful unless it is faceted – that is, has been divided into classes. Taxonomies work best when they share homogenous properties, and when they are smaller and focused.

By using class analysis, or facet analysis,  several problems are solved:

1)       Clarify specific terms by situation or functions: If I am interested in Java as a programming language, I want to see material related to Java as software, not as slang for coffee or  an island in the South Pacific.  If I am looking for “drill bits,”  it might be important to understand if the drill bits are for my home electric screwdriver  or for an oil rig.   Classes capture these distinctions, and help to create precise specific tagging and information retrieval.

2)       Ease longterm  maintenance issues: Christine Connors points to a simple but common example where taxonomies are built where people’s names are included as narrow terms under the role such as “Hillary Clinton” is “Secretary of State”  or “Charles Windsor” is the “Prince of Wales.” The problem is that when people filling these roles change, there is a maintenance headache.   A classy taxonomy recognizes that there is a separate class for <people> as an entity, as distinguished from <role>.  <People> and <Role>  can be connected by a predicate such as <isA>.  These distinctions are necessary for fast-changing information (such as who is dating whom in an entertainment application) or (who owns whom in a business application).

Abstraction <person> <has> <role>

Instance: Hillary Clinton <is>  Secretary of State

3)    Facilitate sharing  and importing taxonomies: Having taxonomies that are specified by a class description means the taxonomy will be more homogenous, have shared properties, and be more focused.  This will make it easier to import with less cleanup and review.  It will facilitate the use of SKOS for example. Messy taxonomies are harder to merge.

Anyone working with semantic technologies will tell you that most problems in inference happen when hierarchies in source taxonomies create odd associations by inserting a narrow or broad term. A taxonomist needs to be attentive to inferences in order to prevent false statements.   Professor Deb McGuiness calls this issue “truth maintenance.”

To keep these categories clear and distinct, ontologists rely on building a conceptual model or a picture of the domain (see earlier post on Taxonomies and modeling.)   Modeling strategies involve skills of most taxonomists.  Most taxonomists have been taught how to capture vocabulary and how to identify facets.  Check out the blog post Taxonomies and Modelling for more information.

Elaine Kendall  of Sandpiper Software, which is a concept-modelling tool.  suggested that “one could build an ontology in 2 hours.”   With new generation of tools that can create RDF/OWL from data and content,  this statement might be true.

    With good modelling tools that automatically generate RDF/OWL,such as TopQuadrant,  taxonomists might  be able to slide into the needed role as ontologists.  Taxonomists need to understand  some basic concepts in RDF/OWL to extend their skills such as what is a class, what is a property and what is a slot facet, what is class inheritance, what is meant by reciprocation and inverse properties and how to write a SPARQL query.  But more importantly,  a classy taxonomist can help become a facilitator to help build bridges between user and development communities and  to help diagnose and prevent technical problems.

    A taxonomist who is trained in ontologies  should bring the following skills:

    • Ability to create processes to identify the requirements for each class,
    • Develop  metrics to assess good results
    • Identify what vocabularies are needed and use skills to evaluate existing vocabularies, import and adapt these vocabularies to the current needs
    • Ensure the integrity and focus of vocabularies particularly when sourced from an outside vendor,
    • Develop processes to keep vocabularies current, and understand how to use metrics to “measure and improve” any vocabularies.
    • To be part of the development team to help identify if a source vocabulary might be part of false inference.

    The taxonomist works with different user communities as well as developers and helps bridge the gap between what users and experts know and what is needed to build a useful application.   A classy taxonomist has a well-rounded set of skills that can work with development teams and user organizations to build intelligent systems.

    Enhanced by Zemanta
    Advertisements

    Is GoodRelations a Game Changer?

    One  ontology  worth watching might be GoodRelations, which is being implemented by   Best Buy.      The central component of this architecture was an ontology called GoodRelations developed by Martin Hepp, who presented at SemTech in San Francisco last week via Skype from Munich, Germany.    GoodRelations is a retail ontology which uses RDFa from XHTML webpages to populate global ontology.   But why would a major retailer use this  architecture?

    Best Buy discovered that it was impossible to be the top dog  in search engine optimization (SEO)  in every search category for every product.  To do this, they needed to have finely tuned individual pages.  They also wanted to provide immediate content about “open box” – returned items at local stores.    looking for a solution that could add more granularity, precision and localization, but still enable global search and have metadata that was controlled by the enterprise.

    GoodRelations is a retail ontology, which offers facets or classes, metadata descriptions and attributes  that are common in the retail industry.   It is expressed in RDFa which is a flavor of RDF that works in web browsers.  Yahoo Search Monkey supports RDFa,  Facebook directed graphs will support RDF.  Google snippets also support RDFa.

    Because there is common metadata, it is easy for employees or customers (who are called “user agents” in the semantic world) to tag content via templates which populate the RDF.  RDFa can be maintained in a corporate or enterprise repository which can be configured as needed for distribution in the enterprise.

    In the GoodRelations RDF, the additional metadata might include price, color, dimensions, model and other attributes that interest consumers.  GoodRelations is an ontology that can be shared over any retail enterprise in any country.  The cost per webpage, once implemented, is minimal because “user agents” are familiar with how to complete forms over the web. The RDFa can then be appended to an HTML page written in XHTML or HTML5.  These HTML code for adding the specific metadata attributes is about 30-50 lines.  This creates HTML that has more granularity than a typical <keyword> metatag. The high costs are in the metadata management.

    Adding RDFa as metadata to a webpage should be easy to adopt because it works in the current web paradigm.   Google is offering RDFa markup language that can be appended to a webpage called Google Rich Snippets.  Snippets is competing with the another format called Microformat.  The problem is that every domain needs a shared set of s metadata attributes to enable search across smaller domains.   Google is rolling out examples of RDFa for restaurants, currently only has 2500 markup pages. To see an example of snippets,  try a search on Google for “Baked Ziti.”  Drupal 7 also offers RDF, and has been implemented in http://www.whitehouse.gov, as part of the Obama Administration transparency initiative.

    Why does this interest me as a  classy taxonomist (future ontologist)?  Clearly, this technology has evolved to a point of adoption, but further adoption depends on political and organizational work to get other applications to take the risk to try RDFa.    RDFa depends on common adoption of similar metadata  This requires political and organization skills to define and manage common metadata knowledge models.  First, taxonomists understand vocabulary and metadata as a way to capture common knowledge and shared metadata.  Second, if this innovation becomes more widely adopted and gains traction,  there may be interest in building similar process in other applications in making any information that has to be shared.

    Further, if RDFa coupled with ontology and metadata management, makes data management and querying easier through SPARQL,  then more attention can be paid to the political and organizational work of working with local agencies to contribute good data and content.

    There is a long way to go to make this vision a reality.. browsers have to adopt RDFa, applications have to prove the viability and ontologies in other domains need to be created.  But in the long run, this might be a more democratic way to extend information access on the web.

    However,  to move toward this vision, faceted navigation and defining common metadata and taxonomies is  good intermediate step.  By creating faceted taxonomies and browsing, and collecting data, user communities are moving towards understanding what search fields, common language, and unambiguous terms that matter to their users.  A little semantics goes a long way.

    ~Marlene Rockmore

    Taxo-ology

    This week, I am at the 201o Semantic Technology conference where there are technologists who have built ontologies.   So this seems like the location to find out  what exactly is the difference between an ontology and a taxonomy and what skills will matter.

    In the ontology world, a taxonomy strictly speaking, is a hierarchical arrangement of terms.   Taxonomists populate term nodes and decide what the form of the term is, any variants, equivalents, and semi-equivalents and create hierarchies.   Ontologists do the heavy lifting — they decide what the classes will be and define the links and generate RDF and OWL.

    But there is a bright spot in this rather dull picture of  taxonomy work.   The most progressive and insightful taxonomists insist on sorting terms into facets or classes. These facets are derived from an analysis of user needs, content, and domain knowledge.   The core of an ontologists work is   also to define classes or facets and links between classes.   These links between facets can then be inherited or asserted between classes.    A taxonomist who hasn’t thought about classes and design will create a taxonomy that looks like spaghetti, and an ontologist who lacks that skill can create an ontology that makes bad inferences and assertions.

    The bottom line is that there is overlap between taxonomy and ontology — so I would like to suggest a term to describe this synergy:  Taxo-ology.    By thinking in terms of Taxo-ology,  we can begin to overlap and have synergy between taxonomists and ontologists:

    • Facets and classes:  Both taxonomists and ontologists need to create classes in which to classify terms.
    • Discipline in Creating Homogenous Hierarchies:  Hierarchies, ideally, should have homogenous properties. For example, Secretary of State is a constitutional office of the United States;  Hillary Rodham Clinton is filling that role, but it is one of many roles she has had.  Christine Connors,  a semantic web guru, uses “Prince of Wales” as her example. That role is there whether or not Charles is Prince.  It is part of the institution of English Monarchy.   Even for the practical reason of longterm maintenance,  these entities need to be in their own class (facet) and linked.
    • Greater Use of Linkages using Associative Relationships: Once terms are sorted homogenous buckets, associative relationships (sometimes with semantic labels for the relationship) can be used to link between classes or  term nodes within a class
    • Better Skill Sets:   Someone who is a Taxo-ologist knows how to use rich ontology tools, like TopBraid, understands OWL and XML output but can also adapt to other tools and content management software such as auto-categorization.  A taxo-ologist can apply the best practices of building classes/facets, homogenous hierarchies, and developing associative relationships
    • Better models for paying Taxo-ologists:  Taxonomists sometimes get paid by the number of terms built-out, but in the world of taxo-ology, compensation needs to be based on results — sometime strategic (is our organization collecting, sharing and exchanging the information  changing market, technical and economic conditions) to tactical need to the right SOP at the right time.  Search, for example, is a great example of how less is more, when good tax0-ologists can make smaller, sleeker taxologies  that can be uses to auto-tag concepts across facets.  Or they create smaller taxonomies that have higher matches to user queries because of use of variants.

    Taxologists seems like a good word to help bridge the gap between these disciplines, but there needs to be a discussion and synergy between the taxo community and the ontology world.     Taxonomists to apply more discipline to how they do their work and embrace the autocategorization and semantic tools that make it easier to process content.    The semantic world can save some time  in its development process by learning from the practical experience taxonomists have built by being in the enterprise, libraries, doing card sorts, understanding user experience, analyzing content, and merging all that with domain knowledge.

    My goal this week is to find out more about what will help semantic technologies gain more traction, what are the practical, killer applications, and what are the future skills.    Be sure to stop by Christine’s booth to find out more about how ontologists can help with strategic information management and technical integration with semantic web technologies.

    ~ Marlene Rockmore (blogging from SemTech San Francisco 2010)

    First Aid for the Accidental Taxonomy

    Many successful information systems utilized taxomies and metadata, but finding taxonomists to support this work usually happens by accident. Taxonomy design and development is a specialized skill – maybe even a talent. A large organization may employ information architects, SharePoint architects, content managers, and corporate librarians, but these people most likely lack strong taxonomy experience. Although the closest matching formal education for taxonomy work is a masters in library and information science, many corporate librarians specialize in research (such as in business intelligence) and may only know about classification and organization of information based on a past library school course taken. Information architects’ experience with taxonomies may be limited to small taxonomies that fit within the limits of menu labels.

    When an organization decides it needs an enterprise taxonomy or needs to leverage and redesign existing taxonomies, then any of these aforementioned types of employees are often pressed into working as taxonomists, without prior experience.   This is what I refer to as the “accidental taxonomist,” as in the title of my recent book (see Heather Hedden, The Accidental Taxonomist, Information Today Inc., 2010 ).

    While reading my book is a good idea for anyone who becomes an accidental taxonomist, a book alone cannot teach all the needed skills. Designing and building taxonomies is a process that is fraught with decision-making. If a taxonomist or taxonomy manager is not a defined position within an organization, the “accidental taxonomists” who temporarily assume this role, no matter how skilled, still have their regular jobs to do and may not be able to devote the needed time for the taxonomy.

    Starting with a good taxonomy foundation will make it easier to maintain the taxonomy. It will save money, time and resources to get some outside help, especially during the initial stage of taxonomy development, which requires the greatest investment of hours.

    How can a taxonomy consultant help?

    • How should terms be assigned to facets
    • Should hierarchies be more deep or more broad
    • Is a complex hierarchy needed or will simpler arrangements work
    • Should taxonomy term labels be complex or simple
    • What governance is needed for longterm management of the taxonomy
    • Who should be on the governance team, and what training is needed

    Related to different levels of experience there is also a distinction between explicit knowledge, which may be explained in a book, and tacit knowledge, which is gained through expertise and is more difficult to explain or document. Taxonomists are trained to follow the industry standard guidelines, such as ANSI/NISO Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. But these are just “guidelines,” and in practical applications the guidelines may need to be modified slightly, such as when there are significant restraints on the taxonomy design. Knowing where and when it is appropriate to bend the rules and when it is not, is a part of tacit knowledge. Having the right knowledge, however, does not necessarily mean the taxonomy work gets done.

    Even within the narrower area of taxonomy expertise, it often helps to discuss and work out issues among multiple people who have an understanding of taxonomies Taxonomy work in a large organization can be a team effort. It requires different skills and perspectives to serve all its goals. In addition to the lead taxonomist with an information science background, other people needed include information architects and user experience professionals to ensure that the taxonomy fits well into the user interface and is easy to use, subject matter experts as authorities on the terminology, and IT professionals for the technical implementation of the taxonomy.

    If you are the sole taxonomist in your organization, you may want to consult with other, outside taxonomists, such as through online discussion groups, to bounce your ideas off them and get additional feedback based on their varied expertise. It’s hard to work alone without support.

    If you are at SLA Annual Conference in New Orleans on June 16, come hear my talk : “Taxonomy Made Easy: An Introduction to Taxonomies for the Accidental Taxonomist.” SLA members are mostly corporate librarians, who are likely candidates to become accidental taxonomists. I’ll help you develop your own taxonomy skills and also identify where you might need to talk to your management about consulting with others skilled in taxonomies. First aid for the accidental taxonomist is always available!

    ~ Submitted by Heather Hedden

    Enhanced by Zemanta

    Google’s Wonderwheel

    Google is trying out a new feature.  Click on other options on a search and then try the option called “WonderWheel.”    Here is a Wonderwheel result for the concept “Dog Parks.”

    Google's WonderWheel for "Dog Parks" May 5, 2010

    I’m curious if anyone has advice on how to optimize your site to be classified in the wonderwheel  — can I use metadata,  words in text, seo tricks?

    If anyone has insights into how Wonderwheel works, please post or contact me offline.

    ~  Marlene Rockmore

    Search Patterns and Faceted Taxonomies

    Peter Morville and Jeffrey Callendar have produced a beautiful  manifesto calling to improve search  called Search Patterns: Design for Discovery (Oreilly, 2010). It is an ode to making complex data beautiful and navigable in user interfaces.  It’s nice to see O’Reilly produce a book with visual flair.

    But once you journey through the many beautiful interfaces and design principles on how to present data,  you realize that there is still a need to understand that data presentation is related to data organization.  Morville hints at how data is organized to facilitate these interfaces.  In Chapter 2 on the anatomy of search, the authors write that sites should “embrace faceted navigation… Global facets might include topic, format, date and author.”   Morville downplays the role of formal hierarchies, focusing instead of the user experience of multiple interactions from “pearl growing” to browsing to managing your data to work towards a more immediate user experience.  Faceted navigation is described as “arguably the most significant search innovation of the past decade” (p 95), but there is only one short chapter on called Engines for Discovery that discusses how to create faceted navigation.

    The data organization that combines the product taxonomy with other facets is called “unified discovery.”  The engines of this discovery (Chapter 6) and this is where we get into the expanded role of the taxonomists is to add facets for

    • Category: broad classifications that vary by application,
    • Topics:  the smaller areas of common interest  such as specific cars or books or recipes
    • Format: how data is formatted whether as content, video, or idea
    • Audience:  the fundamental activity of understanding the needs of who might need the data, from scholar and expert to novice browser

    This global “one size fits all’  recommendation leaves out Time and Chance, which is when an object is produced, and the element of chance in that it is highly respected and relevant to the needs of users.  Date and date range is an important global facet.  Whether there is an “out of the box ” global taxonomy is probably up for debate.   Facets, and how many and how they are labeled,  needs to be validated by user need, application and content.   A global  model is a good starting point, but will probably need to be tuned.  Search across health care policies, for example, which probably requires facets on diseases, symptoms and treatments, and additional resources.    Determining the top categories can take some time so that these categories reflect common shared knowledge and vocabulary.  The top facets do not have to be 5 or 7 plus or minus 2, but rather what is needed by the application, users, and to organize the content.   Get over fixed universality rules and instead collect more data about user needs and content.

    These navigations rely on separate and distinct data structures which allow users to navigate and refine queries before they are passed to underlying database or data structures.  These data structures  needs to be maintained, governed and analyzed. Over time, the richer this conceptual metadata, the better the search experience – better techniques for creating and using metadata are only around the corner.

    On taxonomies and ontologies, the authors specifically argue that there may be other approaches to disambiguating terms (like Java the programming language from Java the island) based on clues like user and context rather than vocabularies:

    “It’s not that there’s no value in parsing sentences for meaning or developing thesauri (or ontologies) that map equivalent, hierarchical, and associative relationships.  These approaches can add value, especially within verticals with limited formal vocabularies, like medicine, law and engineering.  It’s just that less obvious approaches like employing query-query reformulation and post-query click data to drive autosuggest – may deliver better results at lower costs. And we should be wary of claims that computers “understand meaning,” at least until they get a whole lot better at filtering spam.” (p. 162)

    While these ideas are valid, it loses the essential wisdom of why librarians adapted taxonomies and spent so long building a body of standards for taxonomy creation. One thing librarians have long known about taxonomies is that they have a shelf-life beyond a specific application – that they can be used to share data across applications, communities and across the globe.

    If we are to move the beauty of Morville and Callendar’s interfaces to uses beyond e-commerce and towards accessible, lower cost applications, we are going to have to understand the data structures behind these beautiful designs, and reach some shared understandings about how they should be built.  Search-side approaches to search are wise, but they depend on a good design for faceted navigation where it has validated user categories with user’s needs.  The skills of the taxonomist can be applied to search-side information design.

    One discussion I enjoyed was on the under-appreciated role of color as a “quick way to reference the major categories and key players.” (p.15) I have often thought that it might be useful to have a color attribute when defining a facet or category so that all the terms and concepts within a facet share the same color.  That would help in visual sorting of ideas which is an idea Morville and Callendar explore more on the following pages.  Sites without a visual library of photos but only ideas and concepts could become more visual through the use of color-coding.  That would be useful if blogs and databases would look at ways of adding color so that similar concepts in a facet or category  can also be categorized by shared color.

    To move to the next level, where we move search patterns from e-commerce to other uses, such as health care or better access to government information and more widely adapt better and more visual search designs,  we have to broaden the understanding of how to create and validate  faceted navigation and categories and what the supporting data structures need to be.  Perhaps O’Reilly’s next book should be on the common data structures for design for discovery such as the art of taxonomy and ontology.

    Search Patterns is a valuable little  book  to stimulate creative juices.  The link  to buy Search Patterns is at http://searchpatterns.org/

    Thank you to Andy Oram, a mensch of an editor at O’Reilly.

    ~ Marlene Rockmore

    Enhanced by Zemanta

    The Mars Test

    A recent segment on NPR discussed with New Yorker writer Peter Hessler, who has lived in China for the past 15 years, what it was like to re-enter life in the United States and how United States looks to Chinese citizens.  Hessler discussed how hard it is for the rest of the world to understand our complex system of check and balances, of federal, state and local power, of influential groups with non-governmental status.    So that raised the question of what governmental websites do to help orient visitors to what the basic organization and framework of government.

    What if we were visiting from Mars?  What would we learn from our governmental websites about how the United States is organized.   The Mars test, in taxonomy and information design, is also called the ‘mental model.’  A mental model uses common knowledge or frameworks for creating website navigation.  So a good place to start design a US Government website might be with 4th grade civics, which distinguishes Executive and administration, Legislative, and Judicial Branches and explain responsibilities of federal government and those functions reserved for state government.

    Here is the US Government portal called USA.gov.  Does it pass the Mars test?

    USA.gov on April 16, 2010

    It is a directory like interface  that is organized, it seems to me, based on arbitrary topics with no association to government agencies. Where would I even begin to find out about the President of the United States, the new health care bill, the Supreme Court?  How do you find a local office of a government office like my legislator’s office or the social security office.  In a week where a United States Supreme Court justice retired and volcanic ash disrupted air travel, there is no acknowledgment of these events or links to related website.  The site in fact gives an impression that lights are on but nobody is home.

    USA.Gov.com  is actually experimenting with some sophisticated clustering software such as  Vivisimo (vivisimo.com).  This clustering application illustrates how clustering results can be customized in this case by topic, by agency and by sources. While the topic clusters are automatically generated on-the-fly, the agency and source filters are generated based on HTML metatags.

    The United Kingdom is experimenting with its own clustered interface but the site also uses  RDFa and shared metadata. This system has the advantage of having a reusable metadata model that can allow state and local agencies map their content to the governmental model.  This promotes “harmonization” and cooperation in supplying data between federal and state government.  Because of this harmonization through use of shared metadata,  directgov.uk can enable features such as search by zipcode for local offices that deliver state and local services.  Even better, the interface looks like someone is minding the store and cares what content appears on the website.

    Direct.Gov.UK April 16, 2010

    I am not opposed to clustering.  Clustering promises to be a great technology to quickly retrieve masses of documents and content, but a little upfront work is needed to filter automated technologies into useful categories that reflect our  shared  knowledge and common sense.  This work  would help in  creating automated systems that sort results into useful buckets that clarify content and help users find government assistance and  solutions.

    Search.usa.gov is actually an exciting engine that has clustered over 50 million government documents.  However it needs a friendlier, warmer interface to the experience.   For example search for  Supreme Court, and results  mixes state courts with the United States Supreme Court.  Wouldn’t  search experience  be improved if the portal to the search engine helpe users  understand and  filtered  searches to distinguish between by federal and  state courts.

    Using common models through taxonomies and shared metadata might not only help the visitors from Mars.  It might also help citizens of the United States find a clearly navigable path based on stuff they learned in 4th grade.

    Reblog this post [with Zemanta]