<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=905310923417895&amp;ev=PageView&amp;noscript=1">

Designing submission datasets using Define-XML

May 28, 2014 12:00:00 AM




Kevin Burges

Find me on


Discover Define-XML.

By Mark Wheeldon and Kevin Burges.

Previously, Define.pdf was used as a submission deliverable to enable FDA reviewers to understand the content of submitted proprietary datasets. Define-XML was introduced by CDISC as a machine-readable alternative, and this has replaced Define.pdf as a mandatory part of an eCTD-based FDA submission. Define-XML is still often seen as an afterthought, only produced when datasets are ready to be submitted.



Since its inception in 2000, the Clinical Data Interchange Standards Consortium (CDISC) has developed and supported globally-adopted data standards to improve clinical trial efficiency. Clinical data standards are now recognized as playing a vital role in the entire, end-to-end clinical trial process, as well as introducing operational and time savings in the development of new drugs.


common-file-text-edit  Note

One of the most widely used standards today is Define-XML. Based on a recent poll, 71% of those currently using CDISC have implemented Define-XML.


This paper explores alternative uses of Define-XML from study set-up through the production of CDISC Study Data Tabulation Model (SDTM) datasets in study conduct and analysis and onto submission. In addition, the new capabilities of Define-XML 2.0 will be explored.


Define-XML: The myth and the reality

It is commonly thought that Define-XML is simply a way to document what datasets look like – the names and labels of datasets and variables, what terminology is used etc. Define-XML does this in great detail but thinking of this dataset metadata purely as a dataset descriptor circumvents its true potential.

Progressive uses of Define-XML primarily focused on using it to design datasets upfront, alongside CRFs, have seen Define-XML optimize the end-to-end clinical trial process in the following ways:

  • Establish Libraries and Templates
    • Re-use of all dataset designs (SDTM, proprietary EDC, Analysis (ADaM)) end-to-end
    • Storage of any type of dataset design
    • Standards governance of dataset designs
    • Dataset mappings and presentation of Define.pdf can also be re-used
  • Study Set-up and Build
    • Streamline design from protocol to submission
    • Findings, interventions, and events core variables facilitate the design of new CRF content.
    • Configuration of data transformation tools using machine-readable Define-XML
  • Study Conduct and Analysis
    • Define-XML metadata-driven dataset conversions
    • Initiation and automation of clinical data warehouse load
    • Automated validation of study designs versus standards and delivered datasets against upfront dataset designs
  • Submission
    • Auto-generation of human-readable Define.pdf/Define.html with a table of contents, bookmarking, and annotated CRF hyperlinks.



Streamlining study set-up and CRF design



Figure 1. Streamlining study set-up and CRF design

Designing submission datasets using Define-XML and SDTM at the start of a study (see Figure 1, “Target Define.xml/Target SDTM spec”) aids form creation and makes it simple to check that CRFs contain all the information required to generate the datasets.

SDTM gives all the Identifier, Topic, Qualifier, and Timing variables which can all be used when designing the form (see Figure 1, “ODM/CRF and eCRFs”).

SDTM datasets designed upfront aid dataset annotation of CRFs with SDTM variables (see Figure 2 below). This has the additional benefit of providing basic mapping between the forms and the datasets. CDISC provides a mechanism to extend Define-XML which is permissible and allows the storage of additional metadata such as complex dataset mappings (e.g. how data may be merged into one single dataset from two sources). We call this “mapping.xml” which will be discussed in more detail later.



Figure 2. SDTM annotated CRF



Facilitating EDC data conversions

Define-XML is, in fact, not limited to just describing CDISC SDTM and ADaM dataset structures. EDC systems export their own proprietary dataset formats which can be described using the Define-XML model. With the right tools, a Define-XML describing the EDC export datasets can be automatically generated from CRFs/eCRFs (see Figure 1, “Source Define.xml”). This can then be displayed in a friendly HTML or PDF format (Figure 1, “Source Proprietary Spec”) allowing visibility, at the start of a study, of the datasets that will be delivered by the EDC system.

The Source Proprietary dataset spec enables upfront mapping of EDC datasets to SDTM datasets. These mappings can be described (and made machine-executable) using “mapping.xml” and human-readable SDTM mapping specifications produced automatically, aiding review and approval of mappings.

In addition, mapping.xml provides a machine-executable format that can be processed by data transformation code to enable the automatic conversion of datasets in SAS®, Informatica® or other commercially available tools.

Mapping.xml is not limited to just mapping EDC to SDTM datasets but can also be used to convert data in the CDISC Operational Data Model (ODM) to SDTM and also to convert SDTM datasets into data marts for loading into a clinical data warehouse, for example.



Figure 3. Data flows from data capture to datasets and how CDISC metadata fits in.


The diagram above shows the flow of data from data capture and onto CDISC datasets and the part CDISC metadata plays (in designing data capture forms using CDISC ODM and Define-XML In designing destination datasets). All of this vendor-neutral metadata can form the basis of form and dataset libraries which can be re-used from study to study.



Creating and re-using dataset libraries

Define-XML is the perfect vehicle to store libraries of datasets (EDC, SDTM, ADaM, other), mappings, page links to CRF variables, and so on for re-use from one study to the next.

We have illustrated that a metadata-driven approach using Define-XML can optimize a single study from set-up to submission, but creating libraries of metadata that can be re-used will make future studies even more efficient.

Libraries of data acquisition forms, proprietary EDC datasets, SDTM datasets, ADaM, and dataset mappings ensure that only new content has to be created for each study and that only these new objects must be validated from study to study.

Figure 4 below shows the effect of re-use on only one part of the end-to-end clinical trial – database build – but shows that after building a library based on 5 studies, around 70% of metadata for a study can be re-used from the library, rising to around 85% after 35 studies.

Our clients have reported 70% re-use end-to-end using libraries in this manner.



Figure 4. Standardization and Re-use.




Automating dataset validation

Another major advantage to defining datasets upfront is that by moving to a prospective definition of the intended datasets, it is then possible to machine validate study dataset designs for conformance to external standards, and also that data populated datasets match the original specifications. Data quality and submission compliance are built-in upfront with less reliance on downstream validation.

Now it is possible to automatically perform the following validation tasks:

1. Comparing Study Dataset Designs and Controlled Terminology to External and Internal standards

First and foremost when designing SDTM datasets and creating controlled terms, it is imperative that these comply with the latest and/or chosen version of SDTM or National Cancer Institute Controlled Terminology (NCI CT). During the dataset design phase, automatic comparisons and compliance checks can be made with the appropriate version of CDISC SDTM, ADaM, and NCI-CT.

In addition, companies are moving towards developing their own domains that comply with CDISC SDTM but, with content that falls outside of the standard safety domains, for example, specialist findings domains may be required for a particular therapeutic area. The same situation also occurs for controlled terms as the NCI-CT coverage of controlled terminology is still quite small. Again, companies can compare study dataset designs against their own data standards (stored in Define-XML) to check for differences and either accept or reject them accordingly.

2. Comparing “As Specified” Study Dataset Specifications against “As Delivered” Study Dataset Designs Increasing studies are outsourced to Contract Research Organizations (CRO) and this leads to an increased sponsor burden in two areas: (a) upfront specification of deliverables and (b) downstream validation of those deliverables.

Figure 1 shows the creation of a machine-executable target SDTM Define-XML specification and a corresponding human-readable target SDTM specification (in HTML, PDF, Word or Excel) which would be given to a CRO to describe what would be expected in delivered datasets.

Now, when CROs return the datasets they should also provide study dataset metadata (an “as delivered” Define-XML). With both “as specified” and “as delivered” study dataset metadata available, it is easy to compare the study dataset metadata to verify that the “as delivered” dataset actually matches what was specified.

3. Comparing Dataset Data to Dataset Metadata
Again, having upfront target SDTM Define-XML available allows automated comparison of delivered datasets against study dataset metadata, either “as specified” or “as delivered”. Comparing data to “as specified” Define-XML verifies that the data matches what was originally intended/specified, whilst comparing data to “as delivered” Define-XML ensures that the data matches the dataset definition. This is important as it will ultimately be this “as delivered” Define-XML that will be submitted to the FDA.



Define-XML 2.0: What’s changed?



Figure 6. Changes in Define-XML 2.0

Define-XML 2.0 introduces many improvements over the original, which are summarized in Figure 6. These changes are focused around removing ambiguity, improving value level metadata, linking to CDISC/NCI Controlled Terminology standards, and linking to annotated CRFs.



Define-XML: Technical deep dive


Value level metadata re-imagined

Value Level Metadata is needed where values in a table column may have different metadata depending on the row they are in. For example, the content of VSORRES might be DataType=”integer” for one value of VSTESTCD, but DataType=”float” for another.

Define-XML 1.0 supports this by providing a Value List that defines the content of VSORRES for each test code. It does this “by convention” however, with no clear, unambiguous way to know exactly what the Value List is defining. It also doesn’t describe how Value Lists can be used to provide Value Level Metadata for multiple columns, for example, where VSORRES and VSPOS both have different definitions, depending on the test code.

Different organizations interpreted the specification in different ways and ended up with incompatible implementations. Define-XML 2.0 removes all this ambiguity by allowing Value Lists to be provided explicitly for each variable in the dataset. This allows full description of the metadata for any value in any variable.


Where clauses

There is a new mechanism to describe the conditions under which each value definition is applicable. Where Clauses define a condition, such as “Where VSTESTCD=SYSBP”. These conditions are linked to the Value Level Metadata so that it is unambiguously known when each definition applies. Compound conditions can be used such as “Where VSTESTCD=SYSBP and VSPOS=SITTING” (see Figure 7).



Figure 7. Compound Where Clause in Define-XML 2.0


These were the most requested features for Define-XML 2.0, and also provide support for ADaM parameter-level metadata.



The Value List mechanism is focused around variables, grouping together the definitions of all the possible values for a given variable. Using the information provided by the new Where Clause mechanism, Value Level Metadata can be displayed in a transposed format that shows how a particular “Slice” of a dataset looks. Rather than seeing how a specific variable looks for each condition, a Slice shows how the whole dataset looks for a given condition.

Slices and Value Lists are simply two different ways of looking at the same underlying metadata. Figure 8 shows Value Level Metadata represented as Value Lists, specifically an example set of Values for the VSORRES and VSORRESU Variables.



Figure 8. Value Level Metadata represented as Value Lists


Viewing this metadata as a Slice presents what the entire Domain looks like for a given condition, as shown in Figure 9.



Figure 9. Value Level Metadata represented as a Slice


Enumerated items

In Define-XML 1.0, Controlled Terminology definitions had to have both a coded value and a decode given. In most cases, these were the same as SDTM, which uses controlled lists of values rather than having codes and decodes. Define-XML 2.0 leverages the new enumerated items mechanism in ODM 1.3.1, so Controlled Terminology can be defined simply as a list of allowable values if there is no code/decode relationship. Figure 10 shows how this is used to describe a Severity Code List.



Figure 10. Use of Enumerated Items


Standardized controlled terminology

Define-XML 2.0 allows linking of code lists and even individual codes to the published CDISC and NCI Controlled Terminology standards using Aliases. It also allows codes to be flagged as “extended values” where a sponsor has added additional codes to an extensible code list from those Controlled Terminology standards.

Figure 11 shows how this is used to reference the SCTESTCD codes both at the Code List and the Code List Item level by adding an Alias that points at the standard “C” codes.



Figure 11. Referencing NCI Controlled Terminology


Enhance data types and data type guidance

The latest release of Define-XML introduces a richer set of data types and defines how these should be used in relation to the SAS Char and Num data types that are used in SDTM. This allows for better specification of the expected data and, as a result, the possibility of better checking of the data against those data types.


Enhanced linking

Define-XML 2.0 introduces the ability to link to a specific page or pages in a document. This is used in several places:

  • Formedix Origin can now link to the specific page in an annotated CRF that a variable was collected on, as shown in Figure 12
  • Comments can now link to sections in supplemental documents that provide information about variables
  • Methods can now link to sections in supplemental documents that describe the derivation of a value, as shown in Figure 13.



Figure 12. Linking to pages in an annotated CRF



Figure 13. Linking to sections in a supplemental document


A similar mechanism can be used to link ADaM variables to their predecessors as shown in Figure 14.



Figure 14. Linking to ADaM predecessor variables


Enhanced derivations

The old Define-XML 1.0 method of specifying derivations has been replaced by an improved implementation taken from the ODM 1.3 standard. This enables a variable that appears in multiple datasets to have a different derivation for each dataset it is present in.


Clarification of how to specify split domains

Define-XML 1.0 was released before split domains were introduced in SDTM 1.2, and so it does not define how the various properties of a domain should be used to specify both the core domain code (e.g., “QS”) and the extended split domain code (e.g., “QSCG”). Define-XML 2.0 now properly defines how this information should be specified.


Clarification of extending Define-XML

Due to ambiguity in Define-XML 1.0, there have been varying opinions on what the core Define-XML model includes, whether other parts of the underlying ODM model can be used, and what extensions to the model can be used for. Define-XML 2.0 states clearly that:

  • Anything not defined in the specification is considered an extension, even if it is part of the underlying ODM model
  • Use of extensions from the underlying ODM model is not prohibited, however, they have no meaning with regard to the standard; their meaning must be agreed between the sender and receiver of the metadata
  • Extensions that duplicate functionality in the core Define-XML 2.0 model are not allowed – this is to ensure all users apply the same mechanism for all functionality defined in the model
  • Extensions cannot fundamentally change the meaning of the model, i.e. if all extensions are removed, the metadata essentially must have the same meaning as it did with the extensions present.

These clarifications were added to prevent fragmented implementations of Define-XML and as such, allow applications to be confident of the meaning of a piece of Define-XML metadata.


Defining a model, not a view

The Define-XML 1.0 specification was intended to define the model that describes a set of datasets; however, due to the way it was presented, it was commonly interpreted to define how dataset metadata should be displayed for viewing.

Define-XML 2.0 tries to make it clear that it is the model that is being defined, not how it should be displayed. Define-XML 2.0 includes a stylesheet that demonstrates how the dataset metadata can be displayed, however, this display format is not part of the standard and implementers are free to display the dataset metadata in any way that is suitable for the receiver.



Define-XML 2.0 is based on and largely similar to the Define-XML 1.0 model, however, it is not completely backward-compatible with it. Compatibility has been sacrificed in order to produce a cleaner, less ambiguous model. For example, using a value list on the –TESTCD variable to define the contents of the –ORRES or other variable is no longer permitted; a value list now always describes the variable that references it. To provide Value Level Metadata for multiple variables, simply attach a value list to each. Existing Define-XML 1.0 files will require updating to make them Define-XML 2.0 compliant. This updating process is fairly simple and can be automated.

Due to the ambiguities in Define-XML 1.0, it would not be possible to provide a single upgrade routine that would correctly upgrade all files from all systems, so it is left to system implementers to provide upgrade routines for their implementation of Define-XML 1.0 if they feel this is required.



Define-XML should not be just thought of as a submission deliverable but as a CDISC model that helps optimize the whole end-to-end clinical trial process. It can be used to establish dataset libraries that promote study-to-study re-use, as well as driving efficiencies through expedited study set-up and streamlined dataset conversions in study conduct and analysis. Using Define-XML at the start of a new study design makes it possible to machine-validate dataset deliverables guaranteeing that data quality and submission compliance are built-in with less reliance on downstream validation.

Define-XML 2.0 provides a substantially enhanced and more robust mechanism for describing dataset metadata by allowing full specification of value and parameter level metadata for any variable, and improves interoperability and machine readability by removing ambiguity. This will lead to increased opportunities for automation and as such, drive further efficiencies in the study process.



Mark Wheeldon, Kevin Burges “Discover Define-XML”, March 14, 2013, www.formedix.com/.
Schering-Plough, “Effect of Standards Libraries and Re-use in CDMS design”, Dublin, May 2003.
Stephane Auger, Danone Research, September 2013
CDISC. “Define-XML 2.0”. 2013-03-05. Available at http://cdisc.org/define-xml.
CDISC. “ODM”. Available at http://www.cdisc.org/odm



The authors would like to acknowledge the help of all of the Formedix team in preparing this paper and clients who use Define-XML upfront in their studies today.


Recommended reading
  • CDISC Define-XML Specification and Implementation Guides
  • CDISC SDTM Standard and Implementation Guide.


SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.


Similar blogs you might like...