This article describes the current version HICDEP 1.100. For a more detailed version history, please refer to the ChangeLog.
The table pages referenced in the overview describe the specific tables' structure in detail and present a list of suggested codes, both standard and human readable.
All codes apart from trivial no, yes or unknown codes are presented as lookup tables, the usage of these are described in the the article Considerations for using the format to create a database.
Along with the basic structure described in each ?Core fields? section, additional fields containing additional or more specific data are described in the ?Additional fields? sections. These fields were taken from several cohort collaborations but with the required changes that were needed for the specific data structures. This is presented to the reader to show that the core structure is not a fixed proposal but rather a basic structure, which can be altered by adding fields.
Issues regarding duplicates are discussed in Considerations For Data Management.
|tblART||holds type of antiretroviral drug, start and stop dates and reason for stopping|
|tblART_MUM||Antiretroviral Medication of mother in cases where mother is not enrolled in cohort|
|tblBAS||holds basic information such as demographics, basic clinical information and date of AIDS diagnosis|
|tblCANC||holds type and date of diagnosis of cancer|
|tblCENTER||holds information about the Center (e.g. geographical localisation, type of clinic) where the patient is receiving HIV care|
|tblCEP||holds type and date of clinical events and procedures including serious non-AIDS conditions. Former known as tblAE (adverse event).|
|tblDELIVERY_CHILD||holds delivery information related to the child|
|tblDELIVERY_MUM||holds delivery information related to the mother|
|tblDIS||holds type and date of CDC-C diseases and malignancies.|
|tblLAB||holds type, date, value and unit of laboratory tests.|
|tblLAB_BP||holds date, diastolic and systolic values and unit of blood pressure measurements.|
|tblLAB_CD4||holds date and value of CD4 measurements.|
|tblLAB_RNA||holds date, value, detection limit and type of viral assay.|
|tblLAB_RES||holds background information on the resistance test, laboratory, library, kit, software and type of test|
|tblLAB_RES_LVL_1||holds nucleoside sequence for the PRO and RT sequences|
|tblLAB_RES_LVL_2||holds mutations and positions of these.|
|tblLAB_RES_LVL_3||holds resistance result in relation to antiretroviral drug.|
|tblLAB_VIRO||holds test results for viro-/serological tests (hepatitis etc.)|
|tblLTFU||holds data on death and drop-out|
|tblMED||holds type, start and stop dates for other medication/treatments.|
|tblNEWBORN||holds information related to newborns|
|tblNEWBORN_ABNORM||holds information related to abnormalities of newborns|
|tblOVERLAP||holds information on the patient's participation in other cohorts|
|tblPREG||holds general pregnancy-related information|
|tblPREG_OBS||holds information on obstetrical problems during pregnancy|
|tblPREG_OUT||describes the pregnancy outcome|
|tblPROGRAM||holds information on the program with which the center is associated|
|tblREFILL||holds information on prescription refills|
|tblSAMPLES||holds information on the storage of blood, urine and other biological samples|
|tblVIS||holds visit related information such as weight, wasting, smoking, occupational status etc.|
Diagram based on HICDEP release 1.100
The data collected in HIV collaborations is presented on the following pages in a set of data files/tables. Typically data would be put into one data file that would hold one line/record per patient where each field is represented as a separate column in that dataset. Often a dataset could contain more than 3000 columns of data.
The implication of going from thousands of fields to fewer fields means that data is in fact transposed from the flat format into the normalised format.
Example of a flat file structure:
The normalised structure would then be like this:
The type of measurement is identified through the TYPE_ID field. Here 1 codes for ALAT and 2 codes for ASAT:
|1||ALAT - Alanin-Aminotransferase|
|2||ASAT - Aspartat aminotransferase|
To enable a normalised structure that minimises the number of columns dramatically, the one file solution must be broken into several minor tables. These breakdowns are driven by the different data characteristics.
Each table has a basic structure that includes the patient identifier, a code that represents e.g. drug, adverse event or laboratory test performed. Along with this combination values like date, result, unit etc are present for each record.
A record for a laboratory measurement would include:
A record for usage of an antiretroviral drug would include:
These issues imply that a set of distinct tables must be generated based on the ?nature? of the data. Since laboratory, medication and event data both cannot and should not be mixed at least 3 tables must be designed. Additionally there are other types of information that need their own domains: background information on the patient (height, birth date etc.), visit related data (weight, blood pressure, wasting etc.), and resistance testing (the latter requires more consideration due to the diversity of data present).
In this protocol further separation of data into different tables are presented. These separations are not only based on the rules for the relational model and normalisation, but they are ?culturally? related.
For example: antiretroviral treatment medication is kept in one table and other medication in another table; CD4 cell measurements and HIV-RNA measurements are put into separate tables, that are also different from the general laboratory table. These separations are done simply because data in these tables are of distinct importance in analysis and often are gathered more frequently and with more attention than other variables.
Although it is best to have precise dates in the format of YEAR-MONTH-DAY ?ISO standard, it might be that some cohorts are limited to representing date data at the level of the month only, or information kept on the patient in the charts only defines dates to the month and in some cases only to the year. To solve this a set of date codes are presented here.
In this case the date should be coded as the 15th of the month ? so that 1999-12-?? becomes 1999-12-15. This enables the date to be no more than 15 days away from the actual date.
Best approach to this is to apply something similar, as with unknown dates, this would then mean that 1999-??-?? becomes 1999-07-01.
If the year is unknown but the presence of the date value is needed as in case of opportunistic infections or adverse events (see later in this document) a fictive date should be used that couldn?t be mistaken with an actual date. An unknown year should be coded as 1911-11-11.
An alternative to the above is to apply an additional field to each date field for which it is known that there might be issues regarding the precision of the dates. The field is then used to specify at which degree of the day, month or year the date is precise:
|Code||Precision of date|
|<||Before this date|
|D||Exact to the date|
|M||Exact to the month|
|Y||Exact to the year|
|>||After this date|
The Data Transfer Protocol for IeDEA Multi-regional Collaborations suggests that such a precision annotation variable should have the same name as the date variable with the additional Suffix _A. For example, the precision of BIRTH_D will be annotated using an optional variable with the name BIRTH_D_A.
The coding system is the official standard for coding of diseases, however there is a wide set of ?homebrew? codes used within the HIV field in data coding in general, often it?s a 3 or 4 letter code which is an abbreviation for the AIDS defining disease. ICD-10 doesn?t have single codes that represent all single CDC-C events and as a consequence of this a list of 3 to 4 letter codes is the recommended way of coding for all CDC stage C events
ICD-10 codes are however the recommended for codes AE?s since it would become impossible for this protocol to maintain a complete list of all possible AE?s. ICD-10 is also recommended for causes of death.
ATC is a hierarchical structure for coding medication. The structure and hierarchy are best explained with an example of how a drug code is defined. Here it is on Indinavir:
This hierarchy has some benefits as will be explained later, but one of its limitations is that it?s impossible to ?read? the code compared to the widely used 3 letter mnemonic codes for antiretroviral drugs.
The difference is that the IDV code is easily readable, where the ATC code is not; going from a flat file structure to a normalised structure the human readable aspect becomes increasingly important. In the flat file format the column names and the possibility of labels makes data more or less readable; in the normalised format only the coding can help. Because of this the 3 letter codes are being presented in this document. However it must be stressed that usage of the ATC coding should be used to diminish the risk of several homebrew and non-compatible coding schemes.
Currently however, the ATC scheme does not provide sufficient detail on the specific drugs, there is e.g. no official way to code Saquinavir as hard or soft gel. Thus a slight alteration to the set of codes will be presented in the sections of the ART and MED tables. The alterations are designed to extend the existing structure of ATC.
One of the benefits is that the structure of ATC allows easier statistics on e.g. drug classes
Although the codes might be harder to read they provide grouping mechanisms in the way they are coded. Interested readers should go to the ?ATC Website to learn about the structure of ATC. A fully updated database of ATC codes and DDD (Defined Daily Dosage) is available for querying.
It is often necessary to code for values like ?Yes?, ?No? and ?Unknown?, this document suggests that the following codes should be used:
Unknown should be used to identify the difference between a value that has not yet been collected (Empty) and a value that cannot be collected (Unknown). Empty values should be required where Unknown values make little sense to keep querying for a value.
Example ? weight:
Depending on the unit in which weight is measured, a different value for Unknown should be applied. In the case of kg the ?Unknown? code should be 999 and not just 9 or 99, the last two could be actual values.
Blank values, for SAS users also known as "." and for database programmers known as NULL, should be used wherever specified in this protocol. However, sometimes it might be more correct just to omit the record if no value has been recorded, test has not been performed etc.
In order to verify the consistency and correctness of the data, QA checks are made before the data is used. The QA checks applying to a given table are listed at the bottom of its article. Additionally, a list of all QA checks, including checks which do not directly apply to the HICDEP tables themselves, is available here.