XML pour le débutant absolu

Le HTML et le World Wide Web sont partout. Pour illustrer leur omniprésence, je vais en Amérique centrale pour Pâques cette année, et si je le souhaite, je pourrai surfer sur le Web, lire mes e-mails et même effectuer des opérations bancaires en ligne depuis les cybercafés de Antigua Guatemala et Belize City. (Je n'ai pas l'intention de le faire, cependant, car cela prendrait du temps à une date que j'ai avec un palmier et une noix de coco remplie de rhum.)

Et pourtant, malgré l'omniprésence et la popularité du HTML, ce qu'il peut faire est très limité. C'est bien pour diffuser des documents informels, mais le HTML est maintenant utilisé pour faire des choses pour lesquelles il n'a jamais été conçu. Essayer de concevoir des systèmes de données robustes, flexibles et interopérables à partir de HTML, c'est comme essayer de construire un porte-avions avec des scies à métaux et des fers à souder: les outils (HTML et HTTP) ne sont tout simplement pas à la hauteur.

La bonne nouvelle est que de nombreuses limitations du HTML ont été surmontées dans XML, le langage de balisage extensible. XML est facilement compréhensible par quiconque comprend le HTML, mais il est beaucoup plus puissant. Plus qu'un simple langage de balisage, XML est un métalangage - un langage utilisé pour définir de nouveaux langages de balisage. Avec XML, vous pouvez créer un langage spécialement conçu pour votre application ou votre domaine.

XML complétera, plutôt que de remplacer, HTML. Alors que HTML est utilisé pour formater et afficher les données, XML représente la signification contextuelle des données.

Cet article présentera l'histoire des langages de balisage et comment XML est né. Nous examinerons des exemples de données en HTML et passerons progressivement au XML, démontrant pourquoi il fournit un moyen supérieur de représenter les données. Nous explorerons les raisons pour lesquelles vous pourriez avoir besoin d'inventer un langage de balisage personnalisé, et je vous apprendrai comment le faire. Nous aborderons les bases de la notation XML et comment afficher du XML avec deux types différents de langages de style. Ensuite, nous plongerons dans le modèle d'objet de document, un outil puissant pour manipuler des documents en tant qu'objets (ou manipuler des structures d'objets en tant que documents, selon la façon dont vous le regardez). Nous verrons comment écrire des programmes Java qui extraient des informations à partir de documents XML, avec un pointeur vers un programme gratuit utile pour expérimenter ces nouveaux concepts. Enfin nous'Jetons un coup d'œil à une société Internet qui fonde sa stratégie technologique de base sur XML et Java.

Le XML est-il fait pour vous?

Bien que cet article s'adresse à toute personne intéressée par XML, il a une relation particulière avec la série JavaWorld sur les JavaBeans XML. (Voir Ressources pour des liens vers des articles connexes.) Si vous avez lu cette série et que vous ne «l'obtenez» pas tout à fait, cet article devrait clarifier comment utiliser XML avec des beans. Si vous êtes l' obtenir, cet article sert de pièce compagnon idéal à la série JavaBeans XML, puisqu'il couvre des sujets qui y sont intacts. Et, si vous êtes l'un des rares chanceux à avoir encore les articles XML JavaBeans à attendre, je vous recommande de lire d'abord le présent article comme matériel d'introduction.

Une note sur Java

Il y a tellement d'activité XML récente dans le monde informatique que même un article de cette longueur ne peut qu'effleurer la surface. Néanmoins, le but de cet article est de vous donner le contexte dont vous avez besoin pour utiliser XML dans vos conceptions de programme Java. Cet article explique également comment XML fonctionne avec la technologie Web existante, car de nombreux programmeurs Java travaillent dans un tel environnement.

XML ouvre la programmation Internet et Java à des fonctionnalités portables, sans navigateur. XML libère le contenu Internet du navigateur de la même manière que Java libère le comportement du programme de la plate-forme. XML rend le contenu Internet disponible pour de vraies applications.

Java est une excellente plate-forme pour l'utilisation de XML et XML est une représentation de données exceptionnelle pour les applications Java. Je vais souligner certaines des forces de Java avec XML au fur et à mesure.

Commençons par une leçon d'histoire.

Les origines des langages de balisage

Le HTML que nous connaissons et aimons tous (enfin, que nous connaissons, de toute façon) a été conçu à l'origine par Tim Berners-Lee au CERN ( le Conseil Européen pour la Recherche Nucléaire, ou le Laboratoire Européen de Physique des Particules) à Genève pour permettre aux nerds de la physique ( et même des non-nerds) pour communiquer les uns avec les autres. HTML a été publié en décembre 1990 au CERN et est devenu accessible au public à l'été 1991 pour le reste d'entre nous. Le CERN et Berners-Lee ont donné les spécifications du HTML, du HTTP et des URL, dans la belle et ancienne tradition du partage et de la jouissance sur Internet.

Berners-Lee a défini HTML dans SGML, le langage de balisage généralisé standard. SGML, comme XML, est un métalangage - un langage utilisé pour définir d'autres langues. Chaque langage ainsi défini est appelé une application de SGML. HTML est une application de SGML.

SGML est issu de recherches effectuées principalement chez IBM sur la représentation de documents texte à la fin des années 60. IBM a créé GML ("General Markup Language"), un langage prédécesseur de SGML, et en 1978, l'American National Standards Institute (ANSI) a créé sa première version de SGML. La première norme a été publiée en 1983, le projet de norme a été publié en 1985, et la première norme a été publiée en 1986. Il est intéressant de noter que la première norme SGML a été publiée à l'aide d'un système SGML développé par Anders Berglund au CERN, l'organisation qui, comme nous avons vu, nous avons donné le HTML et le Web.

SGML est largement utilisé dans les grandes industries et les gouvernements tels que les grandes entreprises aérospatiales, automobiles et de télécommunications. SGML est utilisé comme norme de document au Département de la Défense des États-Unis et à l'Internal Revenue Service. (Pour les lecteurs en dehors des États-Unis, l'IRS est le fisc.)

Albert Einstein a déclaré que tout devrait être rendu aussi simple que possible, et non plus simple. La raison pour laquelle SGML n'est pas trouvé dans plus d'endroits est qu'il est extrêmement sophistiqué et complexe. Et le HTML, que vous pouvez trouver partout, est très simple; pour beaucoup d'applications, c'est trop simple.

HTML: toute forme et aucune substance

HTML est un langage conçu pour «parler» de documents: en-têtes, titres, légendes, polices, etc. Il est fortement orienté vers la structure des documents et la présentation.

Certes, les artistes et les hackers ont pu faire des miracles avec l'outil relativement ennuyeux appelé HTML. Mais le HTML présente de sérieux inconvénients qui en font un outil inadapté à la conception de systèmes d'information flexibles, puissants et évolutifs. Voici quelques-unes des principales plaintes:

  • Le HTML n'est pas extensible

    Un langage de balisage extensible permettrait aux développeurs d'applications de définir des balises personnalisées pour des situations spécifiques à une application. À moins d'être un gorille de 600 livres (et peut-être même pas dans ce cas), vous ne pouvez pas exiger de tous les fabricants de navigateurs qu'ils implémentent toutes les balises de balisage nécessaires à votre application. Donc, vous êtes coincé avec ce que les grands fabricants de navigateurs, ou le W3C (World Wide Web Consortium) vous permettront d'avoir. Ce dont nous avons besoin, c'est d'un langage qui nous permet de créer nos propres balises de balisage sans avoir à appeler le fabricant du navigateur.

  • Le HTML est très centré sur l'affichage

    Le HTML est un bon langage à des fins d'affichage, sauf si vous avez besoin de beaucoup de contrôle de mise en forme ou de transformation précis (auquel cas ça pue). HTML représente un mélange de structure logique de document (titres, paragraphes, etc.) avec des balises de présentation (gras, alignement d'image, etc.). Étant donné que presque toutes les balises HTML concernent la manière d'afficher des informations dans un navigateur, HTML est inutile pour d'autres applications réseau courantes, telles que la réplication de données ou les services d'application. Nous avons besoin d'un moyen d'unifier ces fonctions communes avec l'affichage, de sorte que le même serveur utilisé pour parcourir les données puisse également, par exemple, exécuter des fonctions commerciales d'entreprise et interagir avec les systèmes hérités.

  • Le HTML n'est généralement pas directement réutilisable

    Creating documents in word-processors and then exporting them as HTML is somewhat automated but still requires, at the very least, some tweaking of the output in order to achieve acceptable results. If the data from which the document was produced change, the entire HTML translation needs to be redone. Web sites that show the current weather around the globe, around the clock, usually handle this automatic reformatting very well. The content and the presentation style of the document are separated, because the system designers understand that their content (the temperatures, forecasts, and so on) changes constantly. What we need is a way to specify data presentation in terms of structure, so that when data are updated, the formatting can be "reapplied" consistently and easily.

  • HTML only provides one 'view' of data

    It's difficult to write HTML that displays the same data in different ways based on user requests. Dynamic HTML is a start, but it requires an enormous amount of scripting and isn't a general solution to this problem. (Dynamic HTML is discussed in more detail below.) What we need is a way to get all the information we may want to browse at once, and look at it in various ways on the client.

  • HTML has little or no semantic structure

    Most Web applications would benefit from an ability to represent data by meaning rather than by layout. For example, it can be very difficult to find what you're looking for on the Internet, because there's no indication of the meaning of the data in HTML files (aside from META tags, which are usually misleading). Type

    red

    into a search engine, and you'll get links to Red Skelton, red herring, red snapper, the red scare, Red Letter Day, and probably a page or two of "Books I've Red." HTML has no way to specify what a particular page item means. A more useful markup language would represent information in terms of its meaning. What we need is a language that tells us not how to

    display

    information, but rather, what a given block of information

    is

    so we know what to do with it.

SGML has none of these weaknesses, but in order to be general, it's hair-tearingly complex (at least in its complete form). The language used to format SGML (its "style language"), called DSSSL (Document Style Semantics and Specification Language), is extremely powerful but difficult to use. How do we get a language that's roughly as easy to use as HTML but has most of the power of SGML?

Origins of XML

As the Web exploded in popularity and people all over the world began learning about HTML, they fairly quickly started running into the limitations outlined above. Heavy-metal SGML wonks, who had been working with SGML for years in relative obscurity, suddenly found that everyday people had some understanding of the concept of markup (that is, HTML). SGML experts began to consider the possibility of using SGML on the Web directly, instead of using just one application of it (again, HTML). At the same time, they knew that SGML, while powerful, was simply too complex for most people to use.

In the summer of 1996, Jon Bosak (currently online information technology architect at Sun Microsystems) convinced the W3C to let him form a committee on using SGML on the Web. He created a high-powered team of muckety-mucks from the SGML world. By November of that year, these folks had created the beginnings of a simplified form of SGML that incorporated tried-and-true features of SGML but with reduced complexity. This was, and is, XML.

In March 1997, Bosak released his landmark paper, "XML, Java and the Future of the Web" (see Resources). Now, two years later (a very long time in the life of the Web), Bosak's short paper is still a good, if dated, introduction to why using XML is such an excellent idea.

SGML was created for general document structuring, and HTML was created as an application of SGML for Web documents. XML is a simplification of SGML for general Web use.

An XML conceptual example

All this talk of "inventing your own tags" is pretty foggy: What kind of tags would a developer want to invent and how would the resulting XML be used? In this section, we'll go over an example that compares and contrasts information representation in HTML and XML. In a later section ("XSL: I like your style") we'll go over XML display.

First, we'll take an example of a recipe, and display it as one possible HTML document. Then, we'll redo the example in XML and discuss what that buys us.

HTML example

Take a look at the little chunk of HTML in Listing 1:

   Lime Jello Marshmallow Cottage Cheese Surprise   

Lime Jello Marshmallow Cottage Cheese Surprise

My grandma's favorite (may she rest in peace).

Ingredients

Qty Units Item
1 box lime gelatin
500 g multicolored tiny marshmallows
500 ml cottage cheese
dash Tabasco sauce (optional)

Instructions

  1. Prepare lime gelatin according to package instructions...

Listing 1. Some HTML

(A printable version of this listing can be found at example.html.)

Looking at the HTML code in Listing 1, it's probably clear to just about anyone that this is a recipe for something (something awful, but a recipe nonetheless). In a browser, our HTML produces something like this:

Lime Jello Marshmallow Cottage Cheese Surprise

My grandma's favorite (may she rest in peace).

Ingredients

Qty Units Item
1 box lime gelatin
500 g multicolored tiny marshmallows
500 ml Cottage cheese
  dash Tabasco sauce (optional)

Instructions

  1. Prepare lime gelatin according to package instructions...

Listing 2. What the HTML in Listing 1 looks like in a browser

Now, there are a number of advantages to representing this recipe in HTML, as follows:

  • It's fairly readable. The markup may be a little cryptic, but if it's laid out properly it's pretty easy to follow.

  • The HTML can be displayed by just about any HTML browser, even one without graphics capability. That's an important point: The display is browser-independent. If there were a photo of the results of making this recipe (and one certainly hopes there isn't), it would show up in a graphical browser but not in a text browser.

  • You could use a cascading style sheet (CSS -- we'll talk a bit about those below) for general control over formatting.

There's one major problem with HTML as a data format, however. The meaning of the various pieces of data in the document is lost. It's really hard to take general HTML and figure out what the data in the HTML mean. The fact that there's an of this recipe with a (quantity) of 500 ml () of cottage cheese would be very hard to extract from this document in a way that's generally meaningful.

Now, the idea of data in an HTML document meaning something may be a bit hard to grasp. Web pages are fine for the human reader, but if a program is going to process a document, it requires unambiguous definitions of what the tags mean. For instance, the tag in an HTML document encloses the title of the document. That's what the tag means, and it doesn't mean anything else. Similarly, an HTML tag means "table row," but that's of little use if your program is trying to read recipes in order to, say, create a shopping list. How could a program find a list of ingredients from a Web page formatted in HTML?

Sure, you could write a program that grabs the headers out of the document, reads the table column headers, figures out the quantities and units of each ingredient, and so on. The problem is, everyone formats recipes differently. What if you're trying to get this information from, say, the Julia Childs Web site, and she keeps messing around with the formatting? If Julia changes the order of the columns or stops using tables, she'll break your program! (Though it has to be said: If Julia starts publishing recipes like this, she may want to think about changing careers.)

Now, imagine that this recipe page came from data in a database and you'd like to be able to ship this data around. Maybe you'd like to add it to your huge recipe database at home, where you can search and use it however you like. Unfortunately, your input is HTML, so you'll need a program that can read this HTML, figure out what all the "Ingredients," "Instructions," "Units," and so forth are, and then import them to your database. That's a lot of work. Especially since all of that semantic information -- again, the meaning of the data -- existed in that original database but were obscured in the process of being transformed into HTML.

Now, imagine you could invent your own custom language for describing recipes. Instead of describing how the recipe was to be displayed, you'd describe the information structure in the recipe: how each piece of information would relate to the other pieces.

XML example

Let's just make up a markup language for describing recipes, and rewrite our recipe in that language, as in Listing 3.

  Lime Jello Marshmallow Cottage Cheese Surprise  My grandma's favorite (may she rest in peace).    1 lime gelatin   500 multicolored tiny marshmallows   500 Cottage cheese    Tabasco sauce     Prepare lime gelatin according to package instructions     

Listing 3. A custom markup language for recipes

It will come as little surprise to you, being the astute reader you are, that this recipe in its new format is actually an XML document. Maybe the fact that the file started with the odd header


  

gave it away; in fact, every XML file should begin with this header. We've simply invented markup tags that have a particular meaning; for example, "An is a (quantity in specified units) of a single , which is possibly optional." Our XML document describes the information in the recipe in terms of recipes, instead of in terms of how to display the recipe (as in HTML). The semantics, or meaning of the information, is maintained in XML because that's what the tag set was designed to do.

Notes on notation

It's important to get some nomenclature straight. In Figure 1, you see a start tag, which begins an enclosed area of text, known as an Item, according to the tag name. As in HTML, XML tags may include a list of attributes (consisting of an attribute name and an attribute value.) The Item defined by the tag ends with the end tag.

Not every tag encloses text. In HTML, the

tag means "line break" and contains no text. In XML, such elements aren't allowed. Instead, XML has empty tags, denoted by a slash before the final right-angle bracket in the tag. Figure 2 shows an empty tag from our XML recipe. Note that empty tags may have attributes. This empty tag example is standard XML shorthand for .

In addition to these notational differences from HTML, the structural rules of XML are more strict. Every XML document must be well-formed. What does that mean? Read on!

Ooh-la-la! Well-formed XML

The concept of well-formedness comes from mathematics: It's possible to write mathematical expressions that don't mean anything. For example, the expression

2 ( + + 5 (=) 9 > 7

looks (sort of) like math, but it isn't math because it doesn't follow the notational and structural rules for a mathematical expression (not on this planet, at least). In other words, the "expression" above isn't well-formed. Mathematical expressions must be well-formed before you can do anything useful with them, because expressions that aren't well-formed are meaningless.

A well-formed XML document is simply one that follows all of the notational and structural rules for XML. Programs that intend to process XML should reject any input XML that doesn't follow the rules for being well-formed. The most important of these rules are as follows:

  • No unclosed tags

    You can get away with all kinds of wacko stuff in HTML. For example, in most HTML browsers, you can "open" a list item with

  • and never "close" it with
  • . The browser just figures out where the would be and automatically inserts it for you. XML doesn't allow this kind of sloppiness. Every start tag must have a corresponding end tag. This is because part of the information in an XML file has to do with how different elements of information relate to one another, and if the structure is ambiguous, so is the information. So, XML simply doesn't allow ambiguous structure. This nonambiguous structure also allows XML documents to be processed as data structures (trees), as I'll explain shortly in the discussion of the Document Object Model.

  • No overlapping tags

    A tag that opens inside another tag must close before the containing tag closes. For example, the sequence

    Let's call the whole thing off

    isn't well-formed because opens inside of but doesn't close inside of . The correct sequence must be

    Let's call the whole thing off

    In other words, the structure of the document must be strictly hierarchical.

  • Attribute values must be enclosed in quotes

    Unlike HTML, XML doesn't allow "naked" attribute values (i.e., HTML tags like

    , where there are no quotes around the attribute value). Every attribute value must have quotes (

  • The text characters (), and (") must always be represented by 'character entities'

    To represent these three characters (left-angle bracket, right-angle bracket, and double quotes) in the text part of the XML (not in the markup), you must use the special character entities (

    <

    ), (

    >

    ), and (

    "

    ), respectively. These characters are special characters for XML. An XML file using, say, the double quote character in the text enclosed in tags in an XML file isn't well-formed, and correctly designed XML parsers will produce an error for such input.

  • 'Well-formed' means 'parsable'

    A generic XML parser is a program or class that can read any well-formed XML at its input. Many vendors now offer XML parsers in Java for free; (you'll find links to these packages in Resources at the bottom of this article). XML parsers recognize well-formed documents and produce error messages (much like a compiler would) when they receive input that isn't well-formed. As we'll see, this functionality is very handy for the programmer: You simply call the parser you've selected and it takes care of the error detection and so on. While all XML parsers check the well-formedness of documents (meaning, as we've seen, that all the tags make sense, are nested properly, and so on), validating XML parsers go one step further. Validating parsers also confirm whether the document is valid; that is, that the structure and number of tags make sense.

    For example, most browsers will display a document that (nonsensically) has two elements, but how can this be? Only one title or no title makes sense.

    For another example, imagine that in Listing 3 the "cottage cheese" ingredient looked like this:

      500 9 Cottage cheese  

    This XML document is certainly well-formed, but it doesn't make sense. It isn't structurally valid. It is nonsense for a to contain a <Qty>. What's the of this ?

    The problem is, we have a document that's well-formed, but it isn't very useful because the XML doesn't make sense. We need a way to specify what makes an XML document valid. For example, how can we specify that a tag may contain only text (and not any other elements) and report as errors any other case?

    The answer to this question lies in something called the document type definition, which we'll look at next.





    ). #####