WWW Administration - Universitätsbibliothek Bielefeld

How can we create universally acceptable standards for electronic publishing?


Mark Bide, Mark Bide and Associates

Introduction

I begin my brief presentation about the development of standards for electronic publishing by explaining a little about myself, which should help to explain my prejudices. Although I have been working as a consultant for the last three years, the first twenty years of my working life were spent in(primarily academic(publishing management. I am a publisher and tend to bring a publisher's perspective to the issues of electronic publishing.

My work as a consultant is primarily concerned with the impact of technology on the management of the publishing process, which explains my interest in standards for electronic publishing.

Standards for Electronic Tables of Contents

When I was first asked to talk about standards at this conference, I struggled to find the correct approach; eventually I decided that I should focus on a case study, looking at two projects in which I have been involved, to develop standards for the storage and transmission of Electronic Tables of Contents, or EToCs. The first, started in 1994, was to develop a draft standard for EToCs for Serial publications; the second, which started in 1995, was a follow on project to develop a draft standard for EToCs for books. Both projects have been run by Book Industry Communication (BIC), a UK cross-industry standards body jointly sponsored by the Publishers Association, the Library Association, the Booksellers Association and the British Library. The projects have been funded by the British National Bibliography Research Fund.(1)These were the first two BIC projects directly related to developing standards for publishing content; its previous work (since its establishment in 1991) had been primarily concerned with developing standards for electronic commerce and for publishers' bibliographic databases. Out of this work came the realisation that some standards for publishing content would be essential if users were to have any hope of accurately identifying and locating the particular content in which they had an interest.

From here is was only a short step to the realisation that the first area where standards would be essential would be in metadata(information about information. The main reasons for this are doubtless obvious. It is particularly important to have standards where information from more than one source has to be combined in order to give the date enhanced value; this is very clearly the case with metadata. Indeed, it may be argued that metadata is the only area in which anything approaching tightly defined standards can be applied.

So we resolved that the first place to start would be with Tables of Contents, most specifically with the ToCs for serials, where there was clearly an already established commercial demand.

Commercial justification

It is perhaps appropriate here to say a few words about the commercial aspects of standards development. I have a very definite prejudice in favour of developing standards only where they are likely to be adopted; this suggests that they will be most effective where they fulfil an existing demand. While our reports are not intended to suggest any particular commercial model for the exploitation of EToCs, they do assume that their adoption is of commercial value.

While I was researching the serials EToC project, I spoke at length to a number of commercial suppliers of EToC data to libraries. Among the things which I discovered was that at least three of these services were using the same supplier in the Philippines to re-key much the same data at much the same time. This inefficiency is in no one's interest(except the supplier in the Philippines. However, it may not be at first sight obvious to everyone in the supply chain that somebody else's inefficiency damages everyone else in the chain. Indeed, I often find when dealing with supply chain issues of this kind that there is a real unwillingness to work towards solutions which apparently only help someone else in the chain. Yet there is only one source of money in any supply chain. From this, it should be quite obvious that money spent on supporting the infrastructure is money wasted to everyone in the chain. Part of the purpose for developing standards is to reduce that waste.

In looking at standards for EToCs, we therefore have to look for the point at which we can capture the earliest possible quality-controlled and standardised key stroke. This implies capture of the key stroke at the publisher.

What is a Serial Table of Contents?

When I started the Serials project, I had a very clear idea of what a serial ToC looked like: a list of article titles and authors names, very like those which appear on the cover or first page of a typical journal. However, it rapidly became apparent that my idea of a ToC was inadequate for what was needed. It was clear that what we needed was a standard for the complete article "header" down to and including the abstract.

This complicates commercial issues further, since publishers have been increasingly asking whether abstracts themselves may have value, as well as complete journal articles. However, again I should stress that the development of the standard is separate from the development of the commercial model. However, if I may be allowed a brief aside at this point, it does seem to me that it is very much in a publisher's interest to promote the usage of their content in libraries which subscribe to their serials or purchase their books. I am therefore extremely surprised that some publishers attempt to discourage libraries from photocopying and circulating the ToCs from printed journals to which they subscribe. It would similarly seem to me to be of obvious value to allow all users on a particular site free and unrestricted access to an EToC for a publication to which the library subscribes.

The technological platform

Having decided that we needed to develop a standard which would cover the entire article header meant that the decision about the technological platform for the standard was essentially made for us. Since we were determined to capture the publishers' keystroke, this implied a technology which could be embedded in the publishers' production process This meant that we had no realistic option except to adopt SGML.

However, we also knew that the adoption of SGML standards was not easy. A good deal of standards work had already been done by publishers and more was in progress. This work had proved the difficulty for large publishing houses of developing a single standard SGML Document Type Definition for use within a single publishing programme, let alone a standard which would work for several publishing houses.

The closest which anyone had come to developing a standard was the European Working Group on SGML, a group of European Publishers and others with an interest in SGML standardisation. This group has successfully developed a journal header DTD, called MAJOUR, on which the DTDs used by most major serials publishers are based.(2) I stress that they are based on MAJOUR. No publisher that I have spoken to has used MAJOUR entirely unaltered.

At the same time as we were doing our work on standards, another group of publishers, the OASIS group of publishers (the title stands for "The Organisation for Article Standards in Science", a loose association of STM publishers) were also working to develop standards for a "minimum data set" for journal article headers. This work was also loosely built on MAJOUR, but was not intended to develop a standard SGML DTD.

At a very well attended international open meeting of publishers and others held in London in September 1994, we got a surprisingly unanimous decision from delegates that we should adopt the MAJOUR DTD as our interchange standard(a standard for storage and transmission. This would not imply that any publisher had to use "pure" MAJOUR for their own internal purposes, a path which they had found to be impossible. It simply meant that all publishers would use DTDs which could be converted to MAJOUR for interchange purposes. Since most publishers had enhanced MAJOUR in order to make it more complex, this backwards compatibility should not prove to be too difficult to achieve although it inevitably leads to some loss of data.

There was one major proviso (that MAJOUR should prove to be compatible with the OASIS group's minimum data set, which had not at that point been completed.(3) This is undoubtedly one of the biggest difficulties which arise when you attempt to develop standards(that in order to ensure their acceptability you are frequently waiting for others to complete the work which they are doing in order to incorporate their work into your standard.

Identification of fragments

The same proved to be the case in another vitally important area for metadata standards, the question of individual article identification. Here, we were keen to work closely with SISAC, the American serials standards body, the Serials Industry Systems Advisory Committee. They had been proving extremely supportive of the work which we were doing with EToCs and for it to be effective we would need their endorsement. They had developed several years ago a standard which many publishers had adopted for the identification of individual journal issues and articles within issues; this is the SICI, the Serials Item and Contribution Identifier, NISO standard Z39/56. There were several other contenders, including the BIBLID;(4) however, the SICI had the advantage of being in active use, something which does not appear to be the case for other identification schemes. However, the SICI had also proved to have some significant disadvantages from the point of view of publishers, primarily to do with the identification of individual articles prior to publication. We were promised that these problems would be addressed in a review of the SICI which was then due to complete by the end of 1994. It will perhaps come as no surprise that this revision is now more than 12 months late, which has been a significant source of delay to us. There is a considerable amount of work currently in hand to develop universal schemes for the identification of digital content of all types: text, graphics, sound, moving images. The schemes proposed fall into two different camps, those that use "intelligent" numbers (like the CAE) and those which use entirely dumb numbers. I do not have time in this presentation to devote to this topic in any detail.

However, what seems clear to me is that we do not have the luxury of being able to wait for a universal solution, which will take years and possibly decades to devise. Publishers and users of publishers products need a pragmatic solution which can be put into place very quickly.

The SICI in the light of the revision which should now be complete shortly has several advantages:

1. It is a "semi-intelligent" code, based on the ISSN
2. It now has two forms, both of equal validity. One of these allows the publisher to issue a random sequential number, which enables an individual article to be identified unambiguously before pagination of the serial issue in which it is to appear; it also allows for the identification of an article that might never appear in print.
3. It deals with the major problem of "legacy documents"; how are we to identify the large number of articles in journals that exist in print (and probably only in print)? The SICI can be accurately reconstructed from the printed document by the user.

There is little doubt that, along with other identification schemes, the SICI will eventually disappear or rather be subsumed into some larger scheme. However, until that happens, it has advantages which we see in no other identification scheme. We have therefore recommended its acceptance as the standard.

Some complications

If this looks like a remarkably straightforward route to the development of the draft standard, then there are a couple of issues which have arisen which are still taking some time to sort out The first of these relates to identifiers. A group of publishers, led by Elsevier in the United States, have proposed an identification scheme called the PII, which is very similar but not identical to the SICI. It is clearly essential that there should not be two similar but incompatible identification schemes in use; we are currently unclear as to why this alternative scheme is considered necessary and are hoping to bring the two standards together.

The second problem that has arisen has been with the compatibly between the OASIS minimum data set and the MAJOUR DTD. Incompatibilities, although apparently relatively minor revolving primarily around the question of compulsory and optional data elements, turned out not to be susceptible to compromise.

The SSSH!

Here, the only possible solution has been to develop a variant of the MAJOUR DTD, to be known as the SSSH! (or "Simplified SGML for Serial Headers") This variant, developed by Francis Cave one of the UK's leading SGML experts will be published in our report.

EToCs for Books In parallel with the later stages of the development of the serial EToC standard, we have also been developing a standard for EToCs for books. In this case, the EToC DTD will again convey more information than simply the ToC as it appears in the printed book.(5) Since we could find no existing standards which approached our requirements, this work had to be done from scratch. The DTD (as yet unnamed), developed by my co-author Ken Moore, will allow for the transmission of a full publisher's bibliographical record as well as the ToC itself.

There have been a number of problems with developing this standard. One of the first difficulties which we had to overcome was how to encode the ToC elements themselves. Unlike two journal article headers, the different elements in two books' tables of contents have no precise semantic equivalence. A chapter in one book is not necessarily the same logical division of the text as a chapter in another book. Indeed, it would be possible to divide up precisely the same book into different "chapters".

A book ToC is an intellectual construct; the only way we could identify of handling this was to follow the logic of the author or publisher who had devised the ToC in the first place, and assign essentially arbitrary "levels" to the different elements in the ToC. This preserves the hierarchical relationship between the elements within an individual ToC but can imply no relationship between elements in two different ToCs.

There is one other problem worth referring to here. Unlike the serials publishers, few publishers will, as yet, be able to deliver book EToCs conforming to the standard, not least because of the total lack of integration of their information systems. Nevertheless, there are many who recognise the importance of developing the ability to deliver this information properly and we are confident that over time the standard will prove to be of value.

Fragment identification in books

Again, we have some difficulty with identifiers for "fragments" of books. Again, there is no existing standard which is fulfils all our requirements so we have recommended the development of a standard similar to the SICI, based on the ISBN rather than the ISSN. This could be developed very quickly and give us the equivalent temporary, pragmatic solution to fragment identification as the SICI.

Lessons for the development of standards for electronic publishing

What have I learned about standards development for electronic publishing? I claim no unique expertise but there are certain things which have become clear to me.

Cross-sectoral bodies like BIC and SISAC are much more appropriate fora for the development of standards for electronic publishing than groups which represent just one part of the information chain. Securing international co-operation is also essential(this is generally relatively easy, because the various international groups are accustomed to working together. At the moment we have too many groups(typically groups of librarians or groups of publishers(attempting to establish standards without input from others. The process is slow, because you are often dependent on other people's work to complete your own. However, it is better to wait than to constantly re-invent the wheel and risk alienating a key group; unless standards are developed which are acceptable to everyone in the information chain and which take into account all views they will not be used.

That there are limits to what can be standardised. Where there is a clear necessity to mix data from different sources, standardisation is essential. However, standards covering full content will prove elusive. If developing SGML standards for serial article headers and book ToCs has been relatively difficult, developing standards for full text would be impractical. On the plus side, however, it is clear that publishers are tending to adopt standard technological platforms(SGML and PDF(and furthermore in certain areas are trying hard to develop standard approaches to issues like installation procedures for CDROM products.

There are other areas like copyright management where a similar, co-operative and international approach will be essential.

Throughout this paper, I have referred to the standards set out in our report as "drafts" This is very important. It is only through use that they can be validated; we are optimistic that both standards will be extensively piloted during 1996 and their conceptual value proved. It may then be possible to establish commercial models for their exploitation.

NOTES

(1) This report is to be published in: Bide M et al, EToCs for Books and Serials: Standards for Structure and Transmission, Book Industry Communication (expected early 1996).

(2) European Workgroup on SGML, The MAJOUR header DTD, 1991. DTD and manual available from STM, Keizergracht 462, Amsterdam. There is also an international standard, ISO12083, based on work by the American Association of Publishers. As with MAJOUR, publishers are generally not using ISO 12083 in unaltered form.

(3) For a detailed discussion of the work of the OASIS group, see: Jarvis J, Standardising "headers": OASIS Learned Publishing 8 (3) 141-3, 1995 and Morgan C, OASIS in context Learned Publishing 8 (4) 213-5, 1995.

(4) Bibliographic identification of contributions in serials and books, IS0 9115, 1987

(5) Proof of the considerable value of adding ToC data to OPACs comes from an unpublished report: Dacey J, Dempsey L, Prowse S and Walker S (unpublished) Online Access to Subject-Enriched Bibliographic Records Report to the British Library Research and Development Department. I am very grateful to the authors who gave me access to this report.

(c) 1996 Mark Bide


Sekretariat der Bibliothek der Universität Bielefeld