WWW Administration - Universitätsbibliothek Bielefeld

How are status and integrity of documents guaranteed in Networks?


P. Luksch und G. F. Schultheiß, Fachinformationszentrum Karlsruhe

Introduction

"Everything you know about intellectual property is wrong" is the subtitle of an article published by Nicholas Negroponte, head of MITīs Media Lab, in WIRED magazine 1. This extreme view of Negroponte predicting the end of copyright in digital networked environments is not very popular among publishers and copyright lawyers. Indeed, the Internet experience of the past few years has not only revealed the tremendous potential of computer networks to improve public access to a variety of information from and to anywhere in the world but has also stimulated the fear of illegal copying and unlimited dissemination of information once released into the network.

The protection of intellectual property is one of the fundaments of the economic system in our world and should not be put at risk because technological developments seem to evolve faster than the corresponding legal regulations. Today most standards for the protection of documents are valid for the printed media and are not easily applicable to digital media.

The impact of digital technologies on copyright law is an issue that has been addressed by various organizations. The World Intellectual Property Organization WIPO has sponsored a symposium on this topic 2. Recently, the White House Information Infrastructure Task Force (IITF) has released a report 3 written by the Working Group on Intellectual Property Rights. The report explains that intellectual property law does apply for networked environments and makes legislative recommendations to the Congress to adapt the law to the digital age.

Except in France and in Spain there are no clear legal regulations for the electronic distribution of copyright-protected works in European countries. The European Commission (DG XIII) is still in the process of searching for "Possible Options for Introducing a European Electronic Copyright"

The IITF report also points out that law regulations are not sufficient to provide an effective copyright protection. Additional effort in the fields of technology and education is required to successfully address the issue of intellectual property protection.

Content providers (authors and publishers) want to be confident that the technologies developed to distribute their works will be secure and that documents placed on these systems will remain authentic and unaltered.

This paper will focus on the technological means to guarantee the security and integrity of documents in networks and reviews the current techniques for encryption and digital signatures. Before discussing the protection methods for intellectual works one should try to define what exactly we want to protect.

What is a document in the digital medium?

One of the requirement of copyright protection is the fixation in a tangible medium of expression. In digital form, a work is a simple bit sequence. Following the IITF report this representation fits within the list of permissible manners of fixation. In knowledge of the fact that the form of the fixation and the method or medium used are almost unlimited, the House Report 4 allows "that a work may be fixed in words, numbers, notes, sounds, pictures, or many other graphic or symbolic indicia, and may be embodied in a physical object in written, printed, photographic, sculptural, punched, magnetic, or any other stable form and may be capable of perception either directly or by means of any machine or device now known or later developed." The same report states that a transmission itself is not a fixation. While a transmission may result in a fixation, a work is not fixed by virtue of the transmission alone.

The copyright problems related to the "plasticity" of works in digital form is discussed by Pamela Samuelson 5. Examples of intellectual works with an undefined fixation or questionable authorship are given. Who is the author of works that are automatically generated by a tool or computer program?
Scientific visualization programs like those commonly used in Chemistry allow users to generate a graphical representation from a table of numerical data and are thereby converting what is viewed as a "literary work" into a "pictorial work".

In the scientific and technical community the typical intellectual work is an article consisting of text, tables, formulas and figures. The inclusion of audio or video components is still an exception. The coding scheme of such documents may be different (plain ASCII text, SGML, TEX, etc.).
However, a prerequisite to safeguarding and metering documents is an effective form of article idendification. This identification should be as widely available as the ISBN, ISSN, or ISRN for books, serials, or reports. The issue of a Universal Article Identification (UAI) is currently discussed.

Independently of the article identification, a standard is required for the information to be included in the document header. A list proposed by Michael Jensen and discussed by David Worlock 6 consists of the following items:

ISBN, ISSN, or other identifier

The above list describes the final version of a publication. However a scientific publication is usually created during a time period ranging between a few weeks to one or two years and includes several document versions with the same content. The author may send the initial version of the paper to a Preprint server, the paper is revised during the review process, and even the last version released by the author might differ from the article distributed by the publisher. Beside the copyright issue the technical problem of identifying the status or version of document is not resolved. Versioning techniques are well established for software products and technical product descriptions, but not in field of electronic documents.

What are the threats and vulnerabilities of Internet?

Internet was designed using "open system" techniques and UNIX as the predominant operating system. It is based on the use of the TCP/IP protocol and is therefore subject to the vulnerabilities inherent to this protocol. Unfortunately, the expansion of Internet has been accompanied with the emergence of individuals whose primary purpose is to access systems and restricted information and in some cases to damage data or systems. Additional threats range from computer viruses to equipement failures.

What are the legal constraints and the technological means available for preserving data in a secure way and for preventing their disclosure, unauthorized modification, or illicit distribution?

Long term preservation of electronic documents

In his article "Ensuring the longevity of digital documents" Rothenberg 7 describes a fictive scenario where his grandchildren find CD-ROM and a letter dated 1995 while exploring the attic of his house in the year 2045. Of course, they are unable to read the CD, which is supposed to contain the key to the fortune of their grandfather. This story intents to show the problems related to the reading of electronic documents after a long time period.

One aspect of the preservation is the physical lifetime of the storage media, which varies from a few years for magnetic tapes or disks to several decades for optical disks. Another limitation is given by the time a storage medium becomes obsolete. most of the personal computers sold now-a-days have no drive to read 5,24 inch floppies, which appeared on the market some 10 years ago. However, neither the fragility nor the lifetime of digital media constitute the main problem in the conservation of data. The bit stream retrieved from a medium has to be interpreted. Most files contain information that is interpretable solely by the software that created them. It is almost impossible to read a multimedia file without the appropriate software tool.

The preservation of data requires the availability of a complete instrumental chain consisting of

Optical media like CD-ROM and WORM are not rewritable and are therefore secure against unauthorized alterations of the content. The expected lifetime of traditional CDs as used for mass production as well as of gold-plated writable CD-ROM is estimated to more than 100 years under optimized storage conditions.
The lifetime of magnetic media like DAT (Digital Audio Tape) or DLT (Digital Linear Tape) is shorter and the tapes need to be rewound periodically and recopied after a few years.

The choice of the hardware and system platform should be based on wide-spread technologies and standards. The data written on the storage medium should be readable by most of the common computers and operating system platforms.

The question which format should be used for archiving documents on long terms is essential for publishers, libraries or other archiving institutions.

The most appropriate formats for long term archiving are SGML (Standard General Markup Language), PostScript and the Adobe Acrobat Format. If logical links between different parts of a document have to be preserved, HTML can be used.

SGML is able to describe in detail the structure of a document and is an adequate basis format for deriving various products. Unfortunately, there are no popular tools to convert documents from other formats into SGML. Publishers are increasingly using SGML for the in-house production.

PostScript can be produced from most word processors and is commonly used for printing. One advantage is that PostScript is an ASCII based description format. On the other hand the storage of a document requires lot of space.

The Acrobat format - sometimes called PDF (Portable Document Format) - recently introduced by Adobe can be generated from PostScript. The original layout of the document is preserved but the files require only a fraction of the corresponding PostScript files. Another advantage is that free readers are available on the common platforms (PC, MacIntosh, and UNIX).

None of the above mentioned formats (except Acrobat to some extend) is capable of handling hypertext documents like they are available through the World Wide Web. It is still an open question, how one can preserve documents, where parts of the content are given by links to other documents residing on a variety of computers spread all over the world, given that most of the links will be obsolete in 10 or more years.

A project on the long term preservation of electronic material at the Norwegian National Library 8 came to the following conclusions: writable CD-ROM are used as storage medium (ISO 9660)
Originally unformatted texts are preserved unformatted (usually using the ISO Latin 1 character set)
Formatted texts are converted to the Adobe Acrobat format and preserved together with the Acrobat readers for various operating platforms.
One must continuously monitor whether the computer hardware and operating system platforms available are able to interface to the necessary storage devices to read the preserved information. In addition, one must ensure that the preserved data formats are readable by the existing computers. Whenever a format or a device is being abandoned by a producer, the preserved information must be converted to a new standard.

The Commission on Preservation and Access and the The Research Libraries Group have created a Task Force which issued a Report 9 on the Preservation of Digital Information. According to this report, repositories for achival functions should prove that they meet or exceed the standards and criteria of an independent-administered program for archival certification. It also requires a critical fail-safe mechanism for certified archives.

Safeguards and protection techniques

In order to guarantee the privacy and data integrity, the access and the transmission of documents have to be controlled.

Access control Mechanisms

A primary condition for protecting documents is to run a secure site. The security of the site can be enhanced by using a firewall. The local area network of the site computers should be placed inside the firewall and the server, which has to be accessed from external users should be placed outside the firewall.

LAN < ------> FIREWALL <------> SERVER <------> OUTSIDE

This is called a "sacrificial lamb" configuration. There is a risk that the server is being broken, but this will not affect the security of the inner network.

In order to control the access of confidential documents on the server, the following access restrictions can be used:

Restriction by IP address, subnet, or domain

Only browsers connecting from defined IP (Internet) addresses, subnets or domains can access individual documents or directories. This is not a very secure method because there are ways for a hacker to simulate an IP address. To be safe IP address restriction should be combined with an identification of the user.

Restriction by user loginid and password

A restriction by user name and password can also be applied on the document or directory level. One problem is that the same password is sent over the network every time a document is accessed, making it vulnerable for interception by hackers.

Encryption techniques

Encryption is probably the most effective method ensuring the privacy and integrity of documents transmitted through networks. It can be augmented by authentication methods (hashing, digital signature) ensuring the identity of the sender (digital signature) and that the document received matches the document sent (hashing).

Encryption techniques use "keys" to control the access of encrypted data. Symmetric encryption uses one private or single-key to encrypt and decrypt information. Public key schemes are a form of asymmetric cryptography where a mathematical algorithm is used to generate two related keys for each individual user. The public key - (which can be disclosed) is used to encrypt and the private key which has to remain secret is used to decrypt.

Private key schemes

The DES (Data Encryption Standard) was developed and adopted as a standard by the US government in 1977. It uses a relatively short 56-bit key and is therefore not considered as very secure. The DES system was originally created for electronic funds transfer. One of the main disadvantage of private key systems is that prior to the transmission of documents, a relationship between the user and the source has to be established guarantying a secure way to exchange the key.

Public key schemes

The first public key systems have been described in 1976. The idea of public key schemes is that no secret key has to be exchanged and that anyone that has the public key of a third party can send him encrypted information and be assured of privacy and integrity. One of the most popular systems is PGP (Pretty Good Privacy) developed by Zimmermann 10. Another system which is likely to become a standard for commercial applications is the RSA system 11. The idea is that a public key is used to encrypt the data which can only be decrypted by the holder of the associated private key. This system offers the advantage for publishers to equip account-holding libraries with private keys.

Most national security services have strong concerns with encryption and in some countries (France) it is illegal to encrypt information using private keys. The USA policy is to allow to close a network transmission to third parties, but it should remain open to the government.

Encryption as described above has to be applied to entire documents and is sometimes unpratical or costly. Hash functions can be employed in order to protect documents against alteration. The document itself remains unencrypted but a small string (of at least 160 bytes) produced by the hashing algorithm is added to the document. this string can include a digital signature in an encrypted form. Like a handwritten signature the digital signature is unique to the electronic document it signs. If the document is altered or tampered in some way, the signature will not read at the userīs site.

The National Institute of Standards and Technology (NIST) has developed the Digital Signature Standard (DSS) based on the Digital Signature Algorithm (DSA). At IBM Almaden, Dwork and Naor 14 have developed a digital signature scheme which is claimed to be unbreakable against any vicious attack.

Authentication systems

The process of verifying a userīs identity is called authentication. In traditional systems the user identity is verified by checking a password typed in by the user at login time. As mentioned earlier, this is not a secure method for computer networks, because the passwords can be intercepted. More secure authentication system use encryption and provide also confidentiality.

Kerberos, developed in the project Athena at MIT, is an authentication service that allow a process (a client) running on behalf of a user to prove its identity to a verifier (a server) without sending data across the network that might allow a hacker to impersonate the principal.

The detailed protocol is described by Neuman and Tsīo 12, the developers of Kerberos.

However, one should not forget, that encryption only protects the document between the vendor and the key-holding recipient. Once received and decrypted, the document can be printed, copied or re-transmitted to third parties.

Techniques against illicit document dissemination

In order to discourage the illicit distribution of a document, it can be marked in a way that is indiscernible to the user by a unique codeword that identifies the registered user to whom the document was sent. If a suspected copy of this document is found, this copy can be decoded and the registered owner can be identified.

Different coding methods are proposed by Brassil 13 together with experimental results. The techniques are:


Sekretariat der Bibliothek der Universität Bielefeld