Unpacking the Complexities of Clinical Data Submission Standards
There is a storm brewing in the world of CDM (Clinical Data Management) and regulatory submissions. Since I came across the news that the FDA are considering Dataset-JSON as a new standard for submission I have conducted a lot of research to distil the wealth of information at our fingertips and understand what the potential impact might be.
Summary of Research
The most important lessons I learned during my research are:
- Dataset-JSON is part of the CDISC ODM v2.0 standard
- SAS can already import JSON with the LIBNAME JSON engine
- FDA will conduct further testing before changing regulations
Current Submission Formats
Since the 1999 guidance for Providing Regulatory Submissions in Electronic Format the FDA has mandated study data be submitted in SAS XPORT version 5 files.
Furthermore, according to the guidance Providing Regulatory Submissions In Electronic Format – Standardized Study Data and the more detailed requirements in the Study Data Technical Conformance Guide the following exchange format standards are supported by the FDA:
- Extensible Mark-up Language (XML)
- Portable Document Format (PDF)
- File Transport Format (XPORT)
XML is widely accepted as the de facto format for describing metadata about datasets for analysis and reporting to regulatory agencies and comes in the form of a Define-XML file. The Define-XML file is a critical component of clinical trial data packages such as SDTM (data from raw sources) and ADaM (data for analysis) datasets as defined by CDISC.
PDF is also accepted as a standard for providing documents in electronic format to the International Conference on Harmonisation (ICH). However, the problem with PDF is that data cannot be extracted and analysed very easily. On the plus-side, the format is useful for maintaining an archive of study data and can be stored in a hierarchical fashion to replicate the organisation of a clinical trial.
XPORT (XPT) is an is an open file format published by SAS Institute for the exchange of study data. XPT version 5 was used from 1989 and version 8 was introduced in 2012. It would appear that the FDA have not accepted v8 XPT files because there is not a compelling enough reason to do so. The following table summarises the few differences in each version, suggesting some reasons why the new version has not been widely adopted.
SAS Property | Version 5 | Version 8 |
Variable Names | Up to 8 characters | Up to 32 characters |
Character Variable Lengths | Up to 200 bytes | Up to 32,767 bytes |
Variable Labels | Up to 40 characters | Up to 256 characters |
Don’t forget the regulations in FDA 21 CFR Part 11 that are for those who maintain records or submit information to the FDA; and which requires all datasets submitted in electronic format to provide an accurate and complete copy of the data suitable for inspection, review, and copying.
Answering Some Questions
What are the Limitations of SAS v5?
In 2017 PHUSE released a whitepaper discussing suggestions for replacing SAS version 5 transport format. In this document they listed the following limitations that needed to be overcome if a new standard for submissions should be adopted by the FDA.
- Limited variable types means some data is incorrectly interpreted.
- Limited to US ASCII encoding which prevents use of double-byte characters used in many parts of the world.
- Field names, variable names, variable labels, and character widths are restricted which reduces how much information can be conveyed.
- Inefficient use of storage space caused by empty column space not being allocated.
- Datasets cannot be compressed which leads to logistical file issues, especially when the maximum file size is 5 Gigabytes.
- Lacks a robust metadata layer which means external files like Define-XML must be kept synchronised.
- Only works for 2-dimensional data structures which means more complex structures such as nested objects, graphs, trees, and linked lists are not possible.
- The format cannot be extended which means it is less compatible with modern technology.
Who is JSON?!
JSON (JavaScript Object Notation) pronounced jason was derived from the JavaScript programming language but they have no relationship to each other beyond the name. JSON is a lightweight format for storing and transporting data and is human-readable so is easy to understand.
What is Dataset-JSON?
A dataset is the term used to describe a collection of structured data in a single file. According to CDISC, Dataset-JSON seeks to address the limitations of SAS V5 XPORT. Dataset-JSON
What about Dataset-XML and Define-XML?
Dataset-XML was created by CDISC in 2014 as a replacement to SAS XPT V5 and removed many of the V5 transport file restrictions.
Dataset-XML and Define-XML are different but complementary standards; Define-XML metadata describes the Dataset-XML dataset content. Dataset-XML can represent any tabular dataset and supports all language encodings supported by XML.
What about ODM?
ODM (Operational Data Model) is another CDISC standard. It is used for exchanging and archiving clinical and translational research data; and has been implemented by many electronic data capture (EDC) tools. The latest version is 2.0 and was released in 2023 to include the option of transporting data in XML or JSON format. To be clear, ODM is not intended for submission to the FDA.
Will we still need to use SAS?
In short, yes.
SAS will continue to be used for statistical analysis before submitting data to the FDA.
It is expected that clinical technology companies will provide XPT files alongside JSON in the transition period between JSON being accepted and XPT files being retired as the required format for submissions. The good news is that SAS already has the ability to import JSON for review and this can be achieved with the LIBNAME JSON engine. Once the data has been reviewed satisfactorily in SAS it is anticipated that the same data will be submitted to the FDA.
Fitting it all Together
As a clinical data manager you might be asking how does this impact me? Well the following answers are a good place to start.
- First, you need to be aware of these potential changes coming down the line. I recommend monitoring this FDA Data Standards page.
- Second, the FDA will provide plenty of notice before mandating a new standard. The Study Data for Submission to CDER and CBER page of the FDA website says at least one year’s notice will be given before a new version of a standard is required.
- Third, Dataset-XML was not accepted so it is possible that Dataset-JSON is not the next standard adopted by the FDA for submissions.
As of August 2024 the FDA stated they will conduct further testing to evaluate Dataset JSON’s capability to support the submission of regulatory study data. I for one, will be closely monitoring this project to see what the outcome is!
Further Reading
Whitepaper describing working with Dataset-JSON using SAS – submitted to Pharma SUG (May-2022)
https://www.pharmasug.org/proceedings/2022/AD/PharmaSUG-2022-AD-150.pdf
Example JSON datasets for ODM V2 in GitHub (Aug-2023)
https://github.com/cdisc-org/DataExchange-DatasetJson
Post about Draft Dataset-JSON API on LinkedIn (Mar-2024)
https://www.linkedin.com/pulse/dataset-json-api-sam-hume-jdeye/
Slides about Dataset-JSON as an alternative transport format for regulatory submissions pilot (Apr-2024)
https://www.cdisc.org/sites/default/files/2024-04/2024-EU-Interchange-Session-6B-SamHume-final.pdf
CORE (Conformance Rules) is a free and open software to test study data for conformance to CDISC standards as well as to regulatory and sponsor-specific conformance rule sets (Jul-2024)
https://www.cdisc.org/core
CDISC Public review webinar on Dataset-JSON v1.1 (Sep-2024)
https://www.cdisc.org/events/webinar/dataset-json-v1-1-public-review-webinar