Copyright on terminology and similar data

(contributed by Christian Galinski, Infoterm, Vienna)

  1. Today's Copyright Problems in General
  2. Copyright for IKR (Information and Knowledge Representation) "Units"
  3. Preparation of New Entries/Records
  4. Copyright Situation - Analysis
  5. Recommendations
  6. References


This summary report and its recommendations are based on a number of documents prepared by high-level expert groups as well as on the experience gained since 1988 with organizing the preparation of the "Guide to Terminology Agreements" in close cooperation with a number of international organisations of the UN system and many pertinent institutions/organisations as well as with organizing or coorganizing the

The report focusses primarily on the copyright aspect, and only secondarily on other IPRs and neighbouring rights, although any of them can - under certain circumstances and for certain aspects or types of data - apply also to terminology. The contents of the report also applies to data similar to terminology, such as documentation languages (e.g. documentation thesauri) and - to a minor extent - bibliographic and factual data, if expressed largely by terminology or textual data..

Emphasis is further laid on legal, technical and ethical aspects as well as on solutions to immediate practical problems rather than on wider dimensions, such as the protection of intellectual property and the respective intellectual property rights (IPRs) as a major socio-economic and R&D-political issue of the future European Information Infrastructure, not to mention a major conflict area between "information rich" and "information poor" countries at a global scale as implied in the slogan "free flow of Information".

  1. Today's copyright problems in general
  2. By means of digitalization any representation of information and knowledge (IKR), whether simple or highly complex,

    • is reduced to a harmonized form, viz. a string of 0s and 1s,
    • in which it can be stored on a variety of data carriers for use and re-use.
    All data, therefore, inevitably undergo - and in fact have to undergo - conversion and processing for various purposes during and after input. In order to be displayed in intelligible form, the data again have to pass a number of conversion processes.

    Today traditional "analogous copying" is the exception rather than the rule; most of today's copying processes require already a conversion process into digitalized representation from the outset. The very concept of "copy", therefore underwent substantial changes. The means (or elements of these means, viz. computer hardware and software) for converting and processing data itself are also subject to various kinds of IPRs. This digitalization accompanied by the gradual convergence of information and communication technologies (ICTs) revolutionize the hitherto largely static and linear IKRs into dynamic (e.g. by applying hypermedia and multimedia) and spatial (multi-dimensional) representations.

    Increasingly unrestricted possibilities to manipulate (viz. modify, convert or transform,) IKRs make it difficult to distinguish clearly between the "original" (i.e. a "work" meeting the legal requirements with regard to "originality" constituting copyright) and its offsprings. Moreover, the transfer of information via wide-area information networks (WAN) - and the more so via future information super-highways - linking together information producers, re-users (i.e. modifiers, converters, transformers etc.) and users create a global socio-economic situation, where information ultimately becomes a "raw material" that can be further processed - in principle freely - into a fully marketable/commerciable commodity on the one hand, and value-added products and services on the other hand (including hitherto unknown kinds of exploitation).

    The subject matter of the international protection of literary and artistic property, better known as copyright, covers works represented in the form of words, music, pictures, three-dimensional objects, or combinations thereof. Practically all national copyright laws provide for the protection among others of:

    • literary works (also extented towards translations),
    • musical works,
    • artistic works,
    • maps and technical drawings,
    • photographic works,
    • audiovisual works.
    Some protect even works of applied art, choreographic works etc., while others regard sound recording, broadcast and computer programs also as works. Terminological data can - at least in principle - comprise any kind of such representations and works. Because of advances in the field of ICTs, discrepancies and inconsistencies emerged within most national legislations in addition to existing ones between legislations and jurisdictions of different countries.

    IPRs are closely linked to the "originator" or "creator" of a "work" ("author" in the case of copyright) who "owns" the respective IPR as soon as the work is created or registered. IPRs can - and as a rule are - transferred to an "exploiter" for commercial or non-commercial exploitation. Users, too, have certain rights (e.g. citation right) - and obligations (e.g. to pay fees).

  3. Copyright for IKR "units"
  4. Specialized information and knowledge (including terminology) can be represented by a variety of linguistic and non-linguistic symbols. In addition there are different IKR levels, such as the basic level of conceptual knowledge (represented by terms etc.) and higher ranks for propositional knowledge, sets of propositions, theories etc. expressed by texts, formula etc. IKRs form IKR units, of which those of the higher levels are as a rule anyhow covered by copyright. Any such IKR unit (e.g. when contained in a database) can be decomposed into smaller and less complex units or elements. In this connection the "smallest unit" to constitute an IPR unit is posing definition, identification and legal problems.

    A terminology databases (TDB) is a peculiar kind of database for factual data on concepts (represented by linguistic symbols, such as terms, definitions etc., and non-linguistic symbols, such as graphical symbols, images, formula etc.). Depending on the data model TDBs (and others having similar characteristics from a copyright perspective) are composed of entries and/or records, which again are composed of fields. In TDBs fields and their values (data elements) can be linked in many ways from a formal point of view, as well as from a contents point of view: among others by hyperlinks). But also entries/records can, and often must have formal or other links - sometimes accross different files. The links in most cases have to be established in the course of recording, otherwise the complexity of conceptual knowledge is not sufficiently retained. These links, therfore, in a way also represent intellectual property.

    Moreover, the fields of individual entries/records can be taken from different sources (which, therefore, belong to different copyright owners). Last but not least the research efforts carried out in order to find methods for creating new databases by "automatically" re-using individual data stemming from a multitude of existing records - possibly of different databases selected and transformed according to "intelligent" routines, thus creating "new knowledge" - will further complicate the copyright situation.


    The types of data that may - not today, but maybe in the future - qualify as IPR units in advanced TDBs are among others:

    • linguistic knowledge representations, such as
      • terms (incl. abbreviations, synonyms, nomenclature terms etc.),
      • thesaurus descriptors, class names of a classification scheme etc.,
      • definitions and other types of textual concept description,
      • statements (representing a proposition),
      • contexts/cotexts,
    • non-linguistic knowledge representations such as
      • formulae (e.g. in mathematic and chemistry etc.),
      • alphanumeric codes (or equivalent codes, such as bar-codes),
      • graphical symbols,
      • complex graphs, figures etc. (e.g. flow charts etc.),
      • images.
    In new multimedia encyclopedia moving animations or pictures, sound and other kinds of representation can be found, which sooner or later will also find their way into TDBs.

    TDBs as a rule consist of several different files, such as

    • terminology files (with entries/records whose data centre around the term and concept description as the most important elements to represent the concept in question),
    • bibliographic data files (with records containing data on "documents", such as author, title, publisher, abstract etc. to name only a few),
    • documentation language files (with records containing class names and/or descriptors etc. for indexing and retrieval purposes).
    Often different documentation languages are used for the indexing of terminological and bibliographical records.

    The documentation language entries are applied for indexing and retrieval purposes to both terminological data and bibliographic data. In addition there may be phraseology files (containing phraseological units together with contexts etc.), factual data files (containing numerical, graphical or textual information) etc. All files and entries/records comprise also data for administrative and other formal purposes, which are less subject to copyright, but indispensible for data management, identification, data security and other purposes not related to contents. Any field containing linguistic representations as a rule can occur with different language equivalents or translations (which may stem from different sources or originators) for the respective field contents. In the case of different writing systems such foreing language elements can also occur in transliterated or transcribed form.


    Bibliographic data in hardcopy published form are under copyright in the same way as any monograph or periodical. Individual entries can be used for quotation purposes and other kinds of indvidual use or re-use. In the case of high-level bibliographies with comprehensive entries comprising also keywords, descriptors and/or abstracts, individual entries may be subject to copyright. Lengthy extractions of entries are anyhow subject to copyright. The same applies to bibliographic data available or accessible in electronic form.

    Terminological data can be found among others in:

    • technical dictionaries (of all sorts and kinds, in monoligual and multilingual works),
    • terminology standards (or terminology chapters of subject standards),
    • specialized lexicons (with the same variety as dictionaries),
    • grey literature (e.g. as "dependent" parts of publications),
    • TDBs and electronic dictionaries.
    The situation is similar with regard to documentation language collections.

    According to existing copyright provisions and jurisdictional practice only the data collection as a whole is subject to copyright in some countries. If individual entries/records contain sentences representing full statements these could in principle be protected under copyright. This situation may radically change with new legislation at European level formulating a sui generis law for the protection of databases.


    Strictly speaking any selection and import/input of data from a given publication or database into a(nother) database is normally based on a different data model, which inevitably requires conversion processes, in other words "manipulation" of data. The original copyright is thus easily superseded by the new "original", although this may be largely derived from the original work.

    Scanning of data on hardcopy, which is still somehow near to copying, is possible without major problems only in cases, where meticulously designed and unambiguous rules with respect to representation and layout of data were strictly applied. It is difficult already with bibliographic data (which have a comparatively unsophisticated structure) and so far proved next to impossible with terminological and lexicographical data, where in addition to the "visible" data much information, especially links between fields etc. is hidden (and more or less only "retrievable" via the human eye by the human brain). Scanning, therefore, for quite some time to come will normally not be feasible because of economic reasons (i.e. costs due to post-editing etc.).

  5. Preparation of new entries/records
  6. Given a scenario, in which new terminological entries/records are prepared by a working group ar commission in a systematic way (i.e. in the form of proper terminology work according to ISO 10241)
    • some of the data will be selected from a preparatory data collection comprising data from several sources,
    • some or most of the data will be adapted or essentially reformulated according to the concept system in question,
    • some terms will be recognized as inappropriate or even misleading and, therefore, be recoined, the collaborating experts provide their expertise in a joint effort for a common goal. Usually they are organized and coordinated by a secretariat possibly having a mandate from an authority.

    In such a scenario, which is becoming common in many subject fields, a detailed and highly consistent methodology (at least for the group coordinator and database manager) needs to be designed, which allows the storage of a maximum of additional data (e.g. source, originator, dates etc.) until the final stage of a given project is reached. Then most of these data can be dropped according to defined rules. Hereafter these entries/records and the respective set of mutually related entries/records contitute new copyright.

    As a rule, the individual contributions of the experts will be amalgamated into a joint copyright, whose "owner" is the agency holding the secretariat or the authority from which it is equipped with the mandate. Authoritative sources for some of the data (especially for standardized terms and definitions) should be identified by means of an unambiguous source code (which, if necessary, would refer the user to a more detailed source indication, e.g. a bibliographic record).

  7. Copyright Situation - Analysis

    If not the whole data collection or at least large portions hereof are imported into one's own database, terminological, documentation language and bibliographical records enjoy only a very low level of copyright protection. This situation might change with new sui generis regulations for databases under consideration at EU level.

    Highly elaborate bibliographical records (including indexing terms and abstracts) are in some cases or countries copyright protected, in others not. It is not rather difficult, however, to ensure and enforce this protection.

    Textual information in terminology and lexicography can be protected by copyright, if explicit or implicit statements are included. Definitions as a rule are not protected so far.

    If external data are input into a database, as a rule new copyright is genereated due to changes of layout, different data structure (incl. links within the record and between records), need for additional information etc.

    If data from various sources are merged, even more changes, additional data and other human interventions are required. This further reduces the copyright protection of the original data.

    In cases of highly authoritative data, such as standardized terminology, the copyright situation is not clear at all. In some countries standards, including terminology standards, are not considered regular publications available through normal book market channels. From a formal point of view they often can, therefore, be considered as &quto;grey literature".


    Manipulation of data in order to avoid copyright issues and the suppression of information on the source should be highly discouraged as unethical for several reasons:
    • the proper quoting of sources is (at least in the case of authoritative sources) an indicatorfor the high quality of data,
    • for one's own data management purposes as well as for the user of the data the identification of sources is not only convenient, but also a necessity (e.g. for checking and re-checking purposes),
    • any change in definitions or other descriptive parts of the record should be unambiguously marked as such in order not to mislead the user etc.
    Especially in the case of "sensitive" data, to which anyhow before long product liability regulations etc. will become applicable, manipulation and deprecation of data is unethical.

    If taken to the extreme the following additional features might become necessary in terminology management systems in the future:

    • indication of the source (in the form of a code as short as possible and nevertheless
    • unambiguous in the respective data environment) of the contents of a field or to a combination of fields (if they constitute a meaningful whole together),
    • subdivision of fields into subfields may become necessary, if changes are carried out on individual portions of the field; these subdivisions might have to be supplemented by the same additional information as the field as a whole,
    • "inheritance" routines to attach certain information (record identifier, record or field source, date of preparation/revision etc.) to fields of the record in the course of merging, downloading etc.
    In this connection a minimum of rules (even for decision in case of conflicting principles) need to be drafted.


    Indication of the source in case of authoritative data is
    • a PR and marketing support to the originator of the data
    • a quality indicator of one's own data.
    There should be a mutual interest between those creating large TDBs and the originators of high-quality terminology data to cooperate without jeopardizing the cooperation by too many (financial and non-financial) conditions.

    In the case of re-use of terminological data of a lower degree of authoritativeness the originator of the data could be provided with feedback in the form of well documented changes (modifications, revisions, etc.) of all records to which this is applicable. Although the work on the respective data generates a new copyright, it would be useful, if some sort of acknowledgement is attached to the new records.

  9. Recommendations
  10. Collective/cooperative preparation of data
    Copyright ownership is reserved from the very beginning to the coordinator (coordinating institution/organisation) or agency which grants the mandate. An acknowledgement mentioning the names of the experts involved in the project could/should be placed in the publication and on the respective data file or sub-file.

    Selection and import of data
    If data are obtained (in bookform or as a file or database) the selection and input/import of data for internal purposes (e.g. for starting a terminology project) is not subject to copyright, however, prior agreement should be considered for ethical reasons.

    Selection and input/import of dictionary data
    Only in case that the whole collection of terminological (thesaurus or similar) data selected from a dictionary containing only terms (descriptors) and equivalents are input/imported into a structure and data model similar to the original, a copyright violation could be argued. As soon as the data are substantially changed in the course of conversion and revision new copyright is constituted.

    Selection and input/import of complex terminology entries/records
    Even if substantial changes are carried out in the course of conversion necessitated by the input/import copyright violation could be argued. As soon as revision sets in the original copyright is superseded by newly constituted copyright. For ethical and strategic reasons it is nevertheless advisable to acknowledge the source, especially in the case of high-quality data.

    Re-use of "authoritative" terminological data
    The re-use of authoritative data should be acknowledged on all accounts. Although copyright might not apply it is advisable to obtain the agreement for using the data. This could be done in the form of informing the respective authority latest after the respective project has been terminated.

    Textual data
    For the re-use of textual data, such as whole statements (in some cases also for formula and other non-linguistic representations) a permission must be obtained. Definitions taken over form (especially authoritative sources) should at least be identified by a source code.

    Documentation of use and re-use of data
    Any use and re-use of external data should be well documented even if distributed only in closed circles (such as a project team or working group). Everybody obtaining data for working on it should sign a form stating that he/she may not use the data for purposes outside of the project and not pass them on without the explicit permit of the coordinator. A set of working rules drafted to this extent should be strictly observed.

    Application of the "Guide to Terminology Agreements"
    In any case in which a copyright violation might be argued, a simple agreement drafted on the basis of the "Guide to Terminology Agreements" could be proposed to the originator (or authorized exploiter of the copyright) from whom the data are taken.

    Acknowledgment for the re-use of standardized terminologies
    As the originators of standardized and other highly authoritative terminologies most probably will benefit as much from the re-use (and explicit acknowledgement of use) of their data, a free of charge re-use of data should be negotiated. The acknowledgement should for ethical and strategic reasons be mentioned even if the entries/records were altered substantially in the course of the terminology project.

    Working rules
    The above listed recommendations can (should) be reformulated in the course of practical experience into detailed working rules, in order to have a consistent approach and to save time by using past experience in equivalent or similar cases as well as to establish precedents.

