NDLI: A Method to Import , Process , and Export Arbitrary XMLTM Files with SAS ®

Please wait, while we are loading the content...

A Method to Import , Process , and Export Arbitrary XMLTM Files with SAS ®

Content Provider	Semantic Scholar
Author	Palmer, Michael Collins
Copyright Year	2001
Abstract	The release of the FDA and industry-supported CDISC XML clinical data standard will make SASs XML capabilities increasingly important. XML-formatted data are text with the hierarchical, tagged structure of markup languages, not the row and column style of SAS datasets. SAS has some experimental XML capabilities but it cannot import, process, and export arbitrary XML vocabularies as XML. This paper presents a method to import, process, and export arbitrary, data-centric XML files. The method consists of an indexing algorithm, construction of a canonical instance, and processing of populated instances. Populated instances are put into a canonical form, and the algorithm is applied. The resulting indexed instance is pruned to just records that contain data and stored as a SAS dataset with the index as a key, shorn of the markup structure that is hard to work with in SAS. The pruned instance can be mapped into SAS datasets and processed with the index, preserving the XML structure. INTRODUCTION Historically, the processing of clinical data from initial collection to submission documents has depended on proprietary or ad hoc standards and specifications. New and continuing forces in the pharmaceutical industry including mergers, acquisitions, outsourcing, use of central clinical laboratories, international regulatory harmonization, and the FDAs mandate to accept electronic submissions have focused attention, sometimes unwanted, on the time, labor, and expense required by the historical practices. Recently, the pharmaceutical industry, in the form of CDISC (Clinical Data Interchange Standards Consortium, Inc.) has embraced XML as an enabling technology to help streamline and improve the industrys historical practices. The bottom line for those of us working with SAS is that were going to be faced with the necessity of importing, processing, and exporting XML. LEGACY PRACTICES SAS does not have the capability to import, process as XML, and export arbitrary XML-formatted data. Beginning with version 8, SAS has limited capabilities to import and export XML using an XML engine in the LIBNAME statement. ODS offers some XML export capabilities for SAS output, but not for data per se. DATAstep programming, particularly text processing capabilities, can be used because XML files are always just plain text. On the XML side, several approaches exist for database programming and could be evaluated for implementation in SAS (See reference 1). NEW TECHNOLOGY The technology presented in this paper translates an instance of XML-formatted data into a flat file representation which can be processed alone or with other similarly translated data and then translated back to valid and well-formed XML of the same vocabulary. This work has been developed as part of Zurich Biostatistics Tekoa Technology. XML (eXtensible Markup Language) is a non-proprietary, platform-independent meta-language for hierarchically structuring information. XML vocabularies, generally defined by Document Type Definitions (DTDs) or schemas (see reference 2), use the meta-language to define elements and the rules on how to use them. An XML instance is a text file of data conforming to a specific vocabulary. XML is extensible in the sense that anyone with an understanding of the rules of XML can make up their own XML vocabulary. XML is a markup language because it provides the means to describe documents of data. The contents of XML documents can go into a database with relative ease or the content of a database can go into XML documents. The World Wide Web Consortium (W3C) guided the creation of XML in a joint effort of many industry and academic contributors. The XML standard was released in February 1998. The W3C is a non-profit consortium made up primarily of corporate and other organizations with an interest in the development of the World Wide Web. The technology discussed in this paper has been implemented in part for the OASIS table model in Tekoa Technology table automation software (see reference 3). This paper is a generalization of that earlier work. It falls into the elements as fields strategy discussed by Quin (reference 1). The method described here is essentially a variation of the Edge approach described by Tian et. al. (see reference 4). A significant difference is that they consider the case of DTD-less XML fragments, and we consider the case of DTD-defined XML and attempt to exploit the existence of the DTD with our canonical document. THE NEED Data-centric XML (see reference 5) is driving standards development in pharma and many, many other industries. Each standard is a separate and distinct vocabulary, often rich enough to support complex data like clinical studies but leading to tremendous variability in file structure when compared to legacy flat files. SAS professionals will increasingly be called upon to import, export, and process XML-formatted data that conform to these emerging standards. The technology discussed in this paper is a general way to handle these vocabularies. IMPORT OF XML When compared to the traditional, row and column structure, XMLs hierarchical, markup language file structure guided by DTDs looks structureless. Data arrive in a text stream of named values. The hierarchical relationships between the named values must be ascertained by examining the markup language that accompanies the data itself. Translating this to the typical SAS dataset requires converting the hierarchical relationships and the data to row-column structure. PROCESSING XML-BASED DATA The hierarchical relationships among data elements in XMLformatted data must be used to identify data types, such as blood pressure data for patient x on study day y. In practical terms, programs cannot simply, for example, merge datasets to bring together demographic and efficacy data because the keys for the merge exist in a tree, not in records in a flat file. Whatever processing that is done must preserve the hierarchical relationships so that the data can be exported as XML (that is, round tripping). This means that combining data from different sources requires a common instance-independent way of defining the hierarchy. EXPORT OF XML The preservation of the hierarchical relationships during processing makes it possible to export valid and well-formed XML documents. IMPLEMENTATION The initial, limited implementation of Tekoa Technology for the OASIS table model was completely in base SAS version 8. COMPONENTS In the terminology of Tekoa Technology, a populated instance is translated into a pruned instance, and vice versa. The translation is mediated by a particular representation of the DTD. In Tekoa Technology terminology, this representation is called the canonical document. CANONICAL DOCUMENT A canonical document is, in effect, an empty record for a datacentric DTD. Like an empty record for a row-column file, it defines a data structure for import, export, and processing. The canonical document exists as XML. In the canonical document, each element type and attribute in the DTD, or in a well-defined subset of the DTD, appears in the hierarchy that the DTD defines. To make the representation of attributes canonical, they appear in lexicographic order. A canonical document may be built for a subset of a DTD as long as the resulting instances are valid for the DTD. A canonical document includes a multi-field index for each element and attribute, and the index reflects the elements or attributes place in the data hierarchy. Among the implementation choices not discussed here but important in any implementation are wrapping content in dummy elements, converting null element types (empty elements in XML terminology) to their equivalent non-null form, and converting attributes to elements. In a way similar to the way an empty record serves as a data layout template for flat files, a canonical document is a template for XML data files. Every element and attribute in the canonical document must appear in every populated instance of the canonical document, even if it has no content or value in the instance, i.e., is null. This is necessary for the indexing to be invariant to instance. This invariance must be maintained so that data from different instances with the same index will be of the same type. As mentioned above, this approach has been implemented in a table automation project for the OASIS table model and it should work with a broader class of data-centric DTDs. POPULATED INSTANCE A populated instance is simply a canonical document populated with data, where the data exist, and null where it does not. PRUNED INSTANCE The pruned instance is a flat file representation of the populated instance. One transforms a populated instance into a pruned instance by removing the markup language, leaving the index and content, and adding a document order counter for each indexed element of content. Since the index is often shorter than the markup language it replaces, this can result in substantially smaller files than the original XML. INDEXING ALGORITHM The indexing algorithm is a way to represent with numbers a nested data structure. To generate the index conceptually, the canonical document is laid out with each element indented if it is nested in the previous element and otherwise not indented. Attributes are treated as if they are indented in the element that contains them. The index has one field per indentation level, and the value in a field at any point is the number of elements that have occurred in that field since the last element that contained the field. The index for an element is the concatenation of all the fields. A null value for a field occurs when an element contains elements corresponding to that field, or is at the same indentation level as an element that contains elements corresponding to that field. The relationship between any two records with the same canonical document can be determined by comparing their index values. This is
File Format	PDF HTM / HTML
Alternate Webpage(s)	https://www.lexjansen.com/pharmasug/2001/Proceed/AppDev/ad09_palmer.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in