-
Notifications
You must be signed in to change notification settings - Fork 2
/
ISATAB-RDF-correspondence.txt
992 lines (720 loc) · 40.7 KB
/
ISATAB-RDF-correspondence.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
http://isatab.sourceforge.net/specifications.html
This is an excerpt of the ISA-TAB description in http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf
The lines marked with * are comments, attempting to map ISA-TAB fields to a RDF representation.
Study:
characteristics; subjects
=============
4.1.1 Ontology source section
Term source name, file, source version, description
* There is no reason for a separate section in RDF representation.
* Ontologies can be imported and referred to by the standard means.
* RDF Terms from ontologies are referred by standard means (URI)
=============
4.1.2. Investigation section
* Provides means to group several studies together. May be useful in TB, but optional.
* Investigation file can be generated on the fly, when exporting as ISA-TAB is requested.
* Or later an Investigation resource could be introduced.
4.1.2.1 INVESTIGATION
Investigation Identifier
A locally unique identifier or an accession number provided by a repository.
Investigation Title
A concise name given to the investigation
Investigation Description
A textual description of the investigation
Investigation Submission Date
The date on which the investigation was reported to the repository.
Investigation Public Release Date
The date on which the investigation should be released publicly.
* If Investigation is aresource, these above could be predefined properties.
* If generated on the fly, could be requested to be entered upon export (does it make sense?)
4.1.2.2 INVESTIGATION PUBLICATIONS
Each publication associated with an Investigation has its own column in the Investigation Publication section.
Such publications are specifically dealing with the investigation as a whole. Publications relating to the specific
Studies may be referenced in the Study sections. Information may be supplied using as many additional
columns as needed.
Investigation PubMed ID
The PubMed IDs of the described publication(s) associated with this investigation.
Investigation Publication DOI
A Digital Object Identifier (DOI) for that publication (where available).
Investigation Publication Author List
The list of authors associated with that publication.
Investigation Publication Title
The title of publication associated with the investigation.
Investigation Publication Status
A term describing the status of that publication (i.e. submitted, in preparation, published).
Investigation Publication Status Term Accession Number
The accession number from the Term Source associated with the selected term.
Investigation Publication Status Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one the Term Source Name declared in the in the ontology section 4.1.1.
*This refers to the investigation as a whole.
May not be relevant to the initial phase of data modeling, but definitely worth to have
possibilty to group studies and assign a higher level context.
Especially in terms of (future) publications, which will likely refer to several studies.
4.1.2.3 INVESTIGATION CONTACTS
Investigation Person Last Name
...
Investigation Person Address
The address of a person associated with the investigation.
Investigation Person Affiliation
The organization affiliation for a person associated with the investigation.
Investigation Person Roles
Investigation Person Roles Term Accession Number
* Same arguments as for the previous section. Person roles could come from an ontology.
==================
4.1.3 Study section
-Study Identifier
A unique identifier: either a temporary identifier supplied by users or one generated by a repository or
other database. For example, it could be an identifier complying with the LSID specification.
*An LSID is represented as a Uniform Resource Name (URN) with the following format.
URN:LSID:<Authority>:<Namespace>:<ObjectID>[:<Version>]
http://en.wikipedia.org/wiki/LSID
*We could generate LSID style identifier by internal identifiers, version, etc.
*Or use a centralized service to generate LSIDs?
* dc:identifier
-Study Title
A concise phrase used to encapsulate the purpose and goal of the study.
* Use Dublin Core metadata, described at http://dublincore.org/documents/dcmi-terms/
* eg. dc:title or dcterms
* The notes below use dcterms: as prefix for http://purl.org/dc/terms
-Study Description
A textual description of the study, with components such as objective or goals.
* http://purl.org/dc/terms/abstract , or shortly dcterms:abstract
* or http://purl.org/dc/terms/description ?
-Study Submission Date
The date on which the study is submitted to an archive.
* http://purl.org/dc/terms/created ; dctemrs:created
*or http://purl.org/dc/terms/dateSubmitted dcterms:dateSubmitted ??
-Study Public Release Date
The date on which the study should be released publicly.
* dcterms:available http://purl.org/dc/terms/available
-Study File Name
A field to specify the name of the Study file corresponding the definition of that Study. There can be
only one file per cell. In case, implementers wish to split the Study Files on their nodes (i.e Source
Name and Sample Name), a process which results in multiple files being necessary to report the
same information, they should create a bundle archive with files and report the name of the archive,
thereby complying with the one file only rule.
* Study resource URI , or/and file name when exported as ISA-TAB
* The study resource should probably have the same /study/{id} returning metadata and /study/{id}/file
returning content -
----------
TB fields:
hasAuthor, hasKeyword, hasOwer, isSummarySearchable, priorVersion, versionInfo
----------
==================
4.1.3.2 STUDY DESIGN DESCRIPTORS
-Study Design Type
A term allowing the classification of the study based on the overall experimental design, e.g crossover
design or parallel group design. The term can be free text or from, for example, a controlled
vocabulary or an ontology. If the latter source is used the Term Accession Number and Term Source
REF fields below are required.
-Study Design Type Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Design Type Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Study Design Term
Source REF has to match one the Term Source Name declared in the ontology section 4.1.1
* This is definitely an entry from an ontology, ideally of Study design types
* In practice this probably has to include entries from different ontologies
* Suggestions of ontologies ???
* tb:hasStudyDesignType
-------
4.1.3.3 STUDY PUBLICATIONS
* decides on some of the citation ontologies
* or http://purl.org/dc/terms/bibliographicCitation dc:bibliographicCitation
* perhaps DOI is sufficient?
-Study PubMed ID
The PubMed IDs of the publication(s) associated with this study (where available).
-Study Publication DOI
A Digital Object Identifier (DOI) for this publication (where available).
-Study Publication Author List
The list of authors associated with this publication.
-Study Publication Title
The title of this publication.
-Study Publication Status
A term describing the status of this publication (i.e. submitted, in preparation, published). The term
can be free text or from, for example, a controlled vocabulary or an ontology. If the latter source is
used the Term Accession Number and Term Source REF fields below are required.
-Study Publication Status Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Publication Status Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one the Term Source Name declared in the in the ontology section 4.1.1.
------
4.1.3.4 STUDY FACTORS
-Study Factor Name
The name of one factor used in the Study and/or Assay files. A factor corresponds to an independent
variable manipulated by the experimentalist with the intention to affect biological systems in a way
that can be measured by an assay. The value of a factor is given in the Study or Assay file,
accordingly. If both Study and Assay have a Factor Value (see section 4.2.5 and 4.3.1.5,
respectively), these must be different.
EFO ontology
* Analogous to ot:Feature
* Could use
* <study> tb:hasFactor <thefactor>.
* <theFactor> dc:title <here-the-factor-name>.
* <thefactor> rdf:type <http://www.ebi.ac.uk/efo/EFO_0000001>. //this is top level experimental factor in EFO
* ??? derive from ot:Feature or vice versa
-Study Factor Type
A term allowing the classification of this factor into categories. The term can be free text or from, for
example, a controlled vocabulary or an ontology. If the latter source is used the Term Accession
Number and Term Source REF fields below are required.
- Study Factor Type Term Accession Number
The accession number from the Term Source associated with the selected term.
- Study Factor Type Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one of the Term Source Name declared in the ontology section 4.1.1.
* The three entries above could be replaced by subclassing (or linking in another way)
* to
* <thefactor> rdf:type <http://www.ebi.ac.uk/efo/EFO_0002605>. //efo:gene name
* <thefactor> rdf:type <http://purl.obolibrary.org/obo/OBI_0100026> //efo:organism
------
4.1.3.5 STUDY ASSAYS
The Study Assay section declares and describes each of the Assay files associated with the current Study.
* This seems just to establish links between studies and assays
* Isn't it better to have Assay metadata in the Assays themselves, and just link Study and Assay URIs?
* <study1> tb:hasAssay <Assay1>.
Study Assay Measurement Type
A term to qualify the endpoint, or what is being measured (e.g. gene expression profiling or protein
identification). The term can be free text or from, for example, a controlled vocabulary or an ontology.
If the latter source is used the Term Accession Number and Term Source REF fields below are
required.
-Study Assay Measurement Type Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Assay Measurement Type Term Source REF
The Source REF has to match one of the Term Source Name declared in the ontology section 4.1.1.
* <assay> tb:hasMeasurementType <assay-measurement-ontology-entry>. //could EFO be used again?
-Study Assay Technology Type
Term to identify the technology used to perform the measurement, e.g. DNA microarray, mass
spectrometry. The term can be free text or from, for example, a controlled vocabulary or an ontology.
If the latter source is used the Term Accession Number and Term Source REF fields below are
required.
-Study Assay Technology Type Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Assay Technology Type Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one of the Term Source Names declared in the ontology section 4.1.1.
-Study Assay Technology Platform
Manufacturer and platform name, e.g. Bruker AVANCE
* <assay> tb:hasTechnology <assay-technology-ontology-entry>. //could EFO be used again?
- Study Assay File Name
A field to specify the name of the Assay file corresponding the definition of that assay. There can be
only one file per cell. In case, implementers wish to split the Assay Files on their nodes, a process
which results in multiple files being necessary to report the same information, they should create a
bundle archive with files and report the name of the archive, thereby complying with the one file only
rule.
* Assay URI or file when exporting as ISA-TAB
* Again /assay/{id} for metadata and /assay/{id}/file for files
------
4.1.3.6 STUDY PROTOCOLS
* Protocols - we have separate resource for protocols - could just link them to studies - or to assays - or both?
* <study>
-Study Protocol Name
The name of the protocols used within the ISA-TAB document. The names are used as identifiers
within the ISA-TAB document and will be referenced in the Study and Assay files in the Protocol REF
columns. Names can be either local identifiers, unique within the ISA Archive which contains them, or
fully qualified external accession numbers.
* <protocol> dcterms:title "the-name".
* <study> tb:hasProtocol <protocol>.
-Study Protocol Type
Term to classify the protocol. The term can be free text or from, for example, a controlled vocabulary
or an ontology. If the latter source is used the Term Accession Number and Term Source REF fields
below are required.
-Study Protocol Type Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Protocol Type Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one of the Term Source Name declared in the ontology section 4.1.1.
* <protocol> rdf:type <protocol in EFO> .
* <protocol in EFO> - any of subclasses of http://purl.obolibrary.org/obo/OBI_0000272
* or protocols, defined in other ontologies
-Study Protocol Description
A free-text description of the protocol.
* <protocol> dcterms:abstract "the-description".
-Study Protocol URI
Pointer to protocol resources external to the ISA-TAB that can be accessed by their Uniform
Resource Identifier (URI).
* http://toxbank.*/protocol/{id}
-Study Protocol Version
An identifier for the version to ensure protocol tracking.
* <protocol> dcterms:hasVersion "the-version".
-Study Protocol Parameters Name
A semicolon-delimited (";") list of parameter names, used as an identifier within the ISA-TAB
document. These names are used in the Study and Assay files (in the "Parameter Value [<parameter
name>]" column heading) to list the values used for each protocol parameter. Refer to section
Multiple values fields in the Investigation File on how to encode multiple values in one field and match
term sources
* This could go into Protocol Templates/protocol/id/template , together with parameter definitions
* Values are study specific
-Study Protocol Parameters Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Protocol Parameters Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one of the Term Source Name declared in the ontology section 4.1.1.
* These are likely the "fixed factors" discussed, for protocols.
*
* <protocol> tb:hasParameter <ProtocolParameter>.
* <ProtocolParameter> rdf:type <entry-from-protocol-ontology>
-Study Protocol Components Name
A semicolon-delimited (";") list of a protocols components; e.g. instrument names, software names,
and reagents names. Refer to section Multiple values fields in the Investigation File on how to encode
multiple components in one field and match term sources.
-Study Protocol Components Type
Term to classify the protocol components listed for example, instrument, software, detector or
reagent. The term can be free text or from, for example, a controlled vocabulary or an ontology. If the
latter source is used the Term Accession Number and Term Source REF fields below are required.
-Study Protocol Components Type Term Accession Number
The accession number from the Source associated to the selected terms.
-Study Protocol Components Type Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match a Term Source Name previously declared in the ontology section 4.1.1.
* These are likely the "fixed factors" discussed, for protocols.
*
* <protocol> tb:hasComponent <ProtocolComponent>.
* <ProtocolComponent> rdf:type <entry-from-protocol-ontology>
------
4.1.3.7 STUDY CONTACTS
* we could address these by refering to user URI
* <study> tb:hasOwner <user>.
* <study> tb:hasAuthor <user>.
-Study Person Last Name
The last name of a person associated with the study.
-Study Person First Name
The first name of a person associated with the study.
-Study Person Mid Initials
The middle initials of a person associated with the study.
-Study Person Email
The email address of a person associated with the study
-Study Person Phone
The telephone number of a person associated with the study.
-Study Person Fax
The fax number of a person associated with the study.
-Study Person Address
The address of a person associated with the study.
-Study Person Affiliation
The organization affiliation for a person associated with the study.
-Study Person Roles
Term to classify the role(s) performed by this person in the context of the study, which means that the
roles reported here need not correspond to roles held withing their affiliated organization. Multiple
annotations or values attached to one person may be provided by using a semicolon (";") as a
separator, for example: "submitter;funder;sponsor .The term can be free text or from, for example, a
controlled vocabulary or an ontology. If the latter source is used the Term Accession Number and
Term Source REF fields below are required.
-Study Person Roles Term Accession Number
The accession number from the Term Source associated with the selected term.
-Study Person Roles Term Source REF
Identifies the controlled vocabulary or ontology that this term comes from. The Source REF has to
match one of the Term Source Name declared in the ontology section 4.1.1.
----------
TB fields:
hasKeyword
Could be represented by dcterms:subject
http://dublincore.org/usage/terms/history/#subjectT-001
isSummarySearchable, <<< this actually says if the metadata could be searched and displayed or not
priorVersion, versionInfo
dcterms:isVersionOf
dcterms:hasVersion
http://dublincore.org/documents/dcmi-terms/#terms-isVersionOf
http://dublincore.org/documents/dcmi-terms/#terms-hasVersion
----------
=================
4.2 Study file
In the Study file the fields are organized on a per-row basis, the first row containing column headers. The
Study file contains contextualizing information for one or more assays. The sections below describe in details
the Study file columns headers, organizing them as nodes (potentially containing the string Name or File,
previously referred to as sections), and attributes (previously referred to as fields) for nodes and node
processing events, qualifiers for node attributes, and other valid fields.
* Studies generally describe the subject of the study, how they are prepared.
* The results are in the assay files.
------
4.2.1 Study nodes
* ??? These are the steps of the study in fact !
* Protocols not nesessarily correspond to assays, there might be protocols, which do not lead to data generation.
(e.g. feeding animals)
*And there could beprotocol, which are data processing events, not operating on biological entities.
-Source Name
Sources are considered as the starting biological material used in a study. Source items can be
qualified using the following headers:
--Characteristics [ ],
--Material Type,
--Term Source REF,
--Term
--Accession Number,
--Unit,
--Provider,
--Description, and Comment [ ].
-Sample Name
Samples represent major outputs resulting from a protocol application other than the special case
outputs of Extract or a Labeled Extract. Sample items can be qualified using the following headers:
--Characteristics [ ]
--Material Type
--Term Accession Number
--Term Source REF
--Unit
--Comment [ ].
------
4.2.2 Attributes of Study nodes
* Are these Study "factors" ?
-Material Type
Used as an attribute column following Source Name, or Sample Name. The term can be free text or
from, for example, a controlled vocabulary or an ontology (e.g. whole organism, organism part). If the
latter source is used the Term Accession Number and Term Source REF fields below are required.
-Characteristics [<category term>]
Used as an attribute column following Source Name, Sample Name. This column contains terms
describing each material according to the characteristics category indicated in the column header. For
example, a column header "Characteristics [organism part]" would contain terms describing an
organism part. The term can be free text or from, for example, a controlled vocabulary or an ontology.
If the latter source is used the Term Accession Number and Term Source REF fields are required. If
the characteristic being reported is a measurement, a Unit column (with qualifying Term Source REF
and Term Accession Number) may be also be used
- Provider ([STDYID])
Used as an attribute column following Source Name. It can be appended with the [STDYID] tag only
in the context of referencing Sources used in a SDTM data submission. This mechanism would allow
listing subject identifiers coming from several different SDTM submissions. See section 5.3 for
implementation guidelines.
-------
4.2.3 Attributes of processing events for Study nodes
-Protocol REF
One or more Protocol REF columns should be used to specify the method used to transform a
material or a data node. This column contains a reference to a Protocol Name (previously defined in
the Investigation File) or to accession numbers of protocols already present in public repositories.
Protocol REF can be further refined with the following elements:
-Parameter Value [<parameter term>]
The field allows reporting on the values taken by the parameter when applying a protocol.
Note that the term between [ ] must map to one (and only one) of parameters defined in the
investigation file. Values can be qualitative or quantitative. Refer to section 5 on design
pattern for additional information.
-Performer
Name of the operator who carried out the protocol. This allows account to be taken of
operator effects and can be part of a quality control data tracking.
-Date
The date on which a protocol is performed. This allows account to be taken of day effects and
can be part of a quality control data tracking. Dates should be reported in ISO format (YYYYMM-
DD).
------
4.2.4 Qualifiers for the Study nodes attributes
-Unit
Used if the terms provided in the Characteristics [ ], Parameter Value [ ] or FactorValue [ ] column
classify data that are dimensional.
-Term Accession Number
The accession number from the Term Source associated with the selected term, if this is from, for
example, a controlled vocabulary or an ontology. Qualifies the following headers Characteristics [ ],
Material Type, Parameter Value [ ] or FactorValue [ ] and Unit.
-Term Source REF
Identifies the controlled vocabulary or ontology that the selected term comes from. The Source REF
has to match a Term Source Name previously declared in the ontology section 4.1.1.
------
4.2.5 Other Study file fields
- Factor Value [<factor name>]
A factor is an independent variable manipulated by an experimentalist with the intention to affect
biological systems in a way that can be measured by an assay. This field holds the actual data for the
Factor Value named between the square brackets (as declared in the Investigation file, section
4.1.3.4); for example, Factor Value [compound]. Qualifiers for Factor Values, are: Unit, Unit Term
Accession Number and Unit Term Source REF in case of quantitative values, and Term Accession
Number and Term Source REF in case of qualitative values. See section 5.2 for examples.
* make use of a construct similar to ot:FeatureValue to hold values
- Comment
Comment columns can be added to provide additional information, but only when no other appropriate field exists.
=================
4.3 Assay file
In the Assay file the fields are organized on a per-row basis, the first row containing column headers.
The Assay file represents a set of assays, defined by the
-- endpoint measured (i.e. gene expression) and
-- the technology employed (i.e. DNA microarray), as described in the Investigation file.
An assay file can refer to one or more external data files, see section 2.4.
* hence /assay/{id} -> technology & endpoint; /assay/{id}/file -> the content
------
4.3.1 Generic Assay file structure
This section holds a list of (generic) column headers to describe several types of assays, organized
as:
-nodes (containing the string Name or File),
-attributes for these nodes,
-attributes for node processing events,
-qualifiers for the nodes attributes and other valid fields.
* everything wich contan "Name" or "File" is considered an identifier!
* The assay does not contain the data itself, it is still metadata!
* Basically, assay names are identifiers for the combination of input factors
* pointing to the data files.
4.3.1.1 Assay nodes
-Sample Name (or Source Name)
Sample Name is used as an identifier to refer to from within the Study file. Thereby associating it with
the Source Name in that Study file. Source Name in the Assay file can only be qualified with
Comment [ ]. However, in experiments where the source (starting biological material used in a study)
is also the sample (measured in the assay) the Source Name acts as the identifier to link the Assay to
its Study file. For example, a whole body scan by NMR where the body is both source and sample,
(i.e., used in toto, rather than in part).
* the row is the bioentity
-Extract Name (where applicable)
Used as an identifier within an Assay file. This column contains user-defined names for each portion
of extracted material. Valid optional qualifying headers for Extract are Characteristics [ ], Material
Type, Description, and Comment [ ].
-Labeled Extract Name (where applicable)
Used as an identifier within an Assay file. Labeled Extract Name can be qualified using the following
headers: Label, Characteristics [ ], Material Type, Description, Comment [ ].
-Assay Name
Used as an identifier within the Assay file. This column contains user-defined names for each assay.
Qualifying headers for Assay Name are Performer, Date, and Comment [ ].
* (partial) assay identifier
-Data Transformation Name
Used as an identifier within the Assay file. This column contains user-defined names for each
normalization events. Qualifying headers for Data Transformation Name are Performer, Date, and
Comment [ ].
* thought this should be yet another protocol - is there an example?
-Image File (where applicable)
Column to provide names (or URIs) of the image files generated by an assay. The optional qualifying
header for Image File is Comment [ ]. For submission or transfer, image files can be packaged with
ISA-TAB files into an ISArchive, see section 2.4.
-Raw Data File
Column to provide name (or URI) of raw data files. The optional qualifying header for Raw Data File is
Comment [ ]. For submission or transfer, data files can be packaged with ISA-TAB files into an
ISArchive, see section 2.4.
* the data file itself
-Data Transformation Name
Used as an identifier within the Assay file. This column contains a user-defined name for each data
transformation applied.
* not a protocol?
-Normalization Name
Used as an identifier within the Assay file. This column contains a user-defined name for each
normalization applied.
* not a protocol?
-Derived Data File
Column to provide name (or URI) of files resulting from data transformation or processing. The
optional qualifying header for Derived Data File is Comment [ ]. For submission or transfer, data files
can be packaged with ISA-TAB files into an ISArchive, see section 2.4.
* more data files, result of applying processing on the raw data file.
* then there are >1 steps inside the assay ...
------
4.3.1.2 Attributes of Assay nodes
-Material Type
In the Assay file, this is used as an attribute column for Sample Name (unless the sample has already
been described in the Study file), Extract Name, or Labeled Extract Name. The term can be free text
or from, for example, a controlled vocabulary or an ontology (e.g. whole organism, organism part). If
the latter source is used the Term Accession Number and Term Source REF fields below are
required.
-Characteristics [<category term>]
Used as a qualifying field following Sample Name (unless the sample has already been described in
the Study file), Extract Name, or Labeled Extract Name. This column contains terms describing each
material according to the characteristics category indicated in the column header. For example, a
column header "Characteristics [purity]" would contain terms describing the purity of that portion of
material. The term can be free text or from, for example, a controlled vocabulary or an ontology. If the
latter source is used the Term Accession Number and Term Source REF fields are required. If the
characteristic being reported is a measurement, a Unit column (with qualifying Term Accession
Number and Term Source REF) may be also be used.
-Label (where applicable)
Used as an attribute column following Labeled Extract Name to indicate a chemical or biological
marker, such as a radioactive isotope or a fluorescent dye which is bound to a material in order to
make it detectable by some assay technology (e.g. P33, biotin, GFP). The term can be free text or
from, for example, a controlled vocabulary or an ontology. If the latter source is used the Term
Accession Number and Term Source REF fields are required.
-------
4.3.1.3 Attributes of processing events for Assay nodes
-Protocol REF
One or more Protocol REF columns should be used to specify the method used to transform a
material or a data node. Each column should contain a reference to a Protocol Name (defined in the
Investigation File) or to accession numbers of protocols already present in public repositories.
Protocol REF can be further refined with the following elements:
* could link to /protocol/{id}
-Parameter Value [<parameter term>]
The field allows reporting of the values taken by a parameter when applying a protocol, where
Parameters have been declared. Values can be qualitative or quantitative. Refer to section 5
on design pattern for examples.
* the value of protocol parameter....
-Performer
Name of the operator who carried out the protocol. This allows accounting for the operator
effect and can be part of a quality control data tracking. If several operators need to be
reported, a semi-colon (;) U+0033 separated list of values is accepted.
* link to the user ?
-Date
To report of on the date on which a protocol is performed. This allows accounting for a day
effect and can be part of a quality control data tracking. Date should be reported in ISO format
(YYYY-MM-DD).
*
------
4.3.1.4 Qualifiers for the Assay nodes attributes
-Unit
Used if the terms provided in the Characteristics [ ], Parameter Value [ ] or FactorValue [ ] column
classify data that are dimensional.
* Perhaps fix the unit into the parameter value definition?
-Term Accession Number
The accession number from the Term Source associated with the selected term, if this comes from,
for example, a controlled vocabulary or an ontology. Qualifies the following headers; Characteristics [
], Material Type, Parameter Value [ ] or Factor Value [ ] and Unit.
-Term Source REF
Identifies the controlled vocabulary or ontology that the selected term comes from. The Source REF
has to match a Term Source Name declared in the ontology section 4.1.1.
* link to ontology; same as for the factors , as commentedin the study section above
-------
4.3.1.5 Other Assay file fields
-Factor Value [<factor name>]
When Factor Value [] in an Assay File, they should reference technical variations (such as software,
intrument or protocol variations) . In that sense, they are intended to allow reporting of studies where
technical fine tuning plays a key role in the quality of the measured signal, for instance in the case of
technique optimization.
This field holds the actual data for the Factor Value named between the square brackets (as declared
in the Investigation file, section 4.1.3.4); for example, Factor Value [compound]. Qualifiers for Factor
Values, are: Unit, Unit Term Accession Number and Unit Term Source REF in case of quantitative
values, and Term Accession Number and Term Source REF in case of qualitative values. See section
5.2 for examples.
* how these differ from Study factors? Looks like minor variations should be reported on Assay level?
-Comment
Comment columns can be added to provide additional information, only when an appropriate field
does not exists.
-------
4.3.2 Assay file with Technology Type: DNA microarray hybridization
Extending the list given in section 4.3.1, this type of Assay file contains additional nodes and nodes relabeled
with more specific name, in line with MAGE-TAB (3):
* specific for microarray
-Hybridization Assay Name (in place of Assay Name)
Used as an identifier within the Assay file. This column contains an user-defined name for each
hybridization. Qualifying headers for Hybridization Assay Name item include Array Design REF or
Array Design File.
-Scan Name
Used as an identifier within the Assay file. This column contains a user-defined name for each Scan
event.
-Array Data File (in place of Raw Data File)
Column to provide name (or URI) of raw array data files.
-Derived Array Data File (in place of Derived Data File)
Column to provide name (or URI) of data files resulting from data transformation or processing.
-Array Data Matrix File
Column to provide name (or URI) of raw data matrix files.
-Derived Array Data Matrix File
Column to provide name (or URI) of processed data matrix files, resulting from data transformation or
processing. Where data from multiple hybridizations is stored in a single file, the data should be
mapped to the appropriate hybridization (or scan, or normalization) via the Data Matrix format itself
[9].
- Array Design File
Column to provide name of file containing the array design, used for a particular hybridization. For
submission or transfer, ADF files can be packaged with ISA-TAB files into an ISArchive, see section
2.4.
- Array Design REF
This column is used to reference the identifier (or accession number) of an existing array design.
------
4.3.3 Assay file with Technology Type: Gel electrophoresis
Extending the list given in section 4.3.1, this type of Assay file contains additional nodes and nodes relabeled
with more specific name, in line with other domain-specific formats and associated reference repositories (6,
10, 12, 22).
* technology specific. Could it be defined in an ontology which technology needs which fields?
* Or perhaps already is?
- Gel Electrophoresis Assay Name (in place of Assay Name)
Used as an identifier within the Assay file. This column contains user-defined names for each
electrophoresis gel assay. For 2-dimensional gels, the following qualifying headers can be used
instead:
- First Dimension
The term can be free text or from, for example, a controlled vocabulary or an ontology. If the
latter source is used the Term Accession Number and Term Source REF fields are required.
- Second Dimension
The term can be free text or from, for example, a controlled vocabulary or an ontology. If the
latter source is used the Term Accession Number and Term Source REF fields are required.
- Scan Name
Used as an identifier within the Assay file. This column contains user-defined names for each Scan
event.
- Spot Picking File
Column to provide name (or URI) of files file holding protein spot coordinates and metadata for use by
spot picking instruments.
The node attributes and remaining fields are as described in section 4.3.1.
-------
4.3.4 Assay file with Technology Type: Mass Spectrometry (MS)
Extending the list given in section 4.3.1, this type of Assay file contains additional nodes and nodes relabeled
with more specific name, in line with other domain-specific formats and associated reference repositories (6,
12,18).
-MS Assay Name (in place of Assay Name)
Used as an identifier within the Assay file. This column contains user-defined names for each MS
Assay.
-Raw Spectral Data File (in place of Raw Data File)
Column to provide name (or URI) of raw spectral data files.
-Derived Spectral Data File (in place of Derived Data File)
Column to provide name (or URI) of derived spectral data files, resulting from data transformation or
processing.
When Mass Spectrometry is used in proteomics the following data files are required, according to PSI
specifications and Pride submission requirements (6, 10):
-Peptide Assignment File
Column to provide name (or URI) of file(s) containing peptide assignments.
-Protein Assignment File
Column to provide name (or URI) of file(s) containing protein assignments.
-Post Translational Modification Assignment File
Column to provide name (or URI) of file(s) containing posited post-translational modifications.
Capturing data resulting from the use of mass spectrometry in metabol/nomics requires a settled
definition for a Metabolite Assignment File (inter alia); such a file is currently under development in
collaboration with the Metabolomics Standards Initiative (MSI, 18)
The node attributes and remaining fields are as described in section 4.3.1.
------
4.3.5 Assay file with Technology Type: Nuclear Magnetic Resonance spectroscopy (NMR)
Extending the list given in section 4.3.1, this type of Assay file contains additional nodes and nodes relabeled
with more specific name, in line with other domain-specific formats (12,18).
* technology specific
- NMR Assay Name (in place of Assay Name)
Used as an identifier within the Assay file. This column contains user-defined names for each NMR
Assay
-Free Induction Decay Data File (in place of Raw Data File)
Column to provide name (or an URI) of files corresponding to the free induction decay data files.
-Acquisition Parameter Data File
Column to provide name (or an URI) of files corresponding to the acquisition parameters data files.
-Derived Spectral Data File (in place of Derived Data File)
Column to provide name (or an URI) of files corresponding to derived spectral data files, resulting
from data transformation or processing.
Capturing data resulting from the use of NMR spectrometry in metabol/nomics requires a settled
definition for a Metabolite Assignment File (inter alia); such a file is currently under development in
collaboration with the Metabolomics Standards Initiative (MSI, 18)
The node attributes and remaining fields are as described in section 4.3.1.
------
4.3.6 Assay file with Technology Type: High throughput sequencing
Under development in collaboration with the Genomics Standards Consortium (GSC, 19) and reference
repository (20).
* the spec file is from 2008 - looking for updates!
* 2011 publication http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001088
* http://standardsingenomics.org/index.php/sigen
*http://bioinformatics.oxfordjournals.org/content/26/18/2354.full.pdf+html
==========
5. Design patterns for Study and Assay files
5.1 Representing node processing
Conceptually, a protocol takes one or more inputs (biological material or data) and generates one or more
outputs (biological material or data). Therefore protocols correspond to edges in the experimental graph, while
materials and data correspond to the nodes. The following example shows how to represent the
transformation of a node, the operator and the date by referecing its protocol (declared in the STUDY
PROTOCOLS section of the Investigation file) and also shows the optional performer and date columns:
Node Name | Protocol REF | Performer | Date | Node Name
* This is quite close to the OpenTox concept of a procesing algorithm dataset -- (algorithm) --> dataset
* The links could be made explicit in RDF
------
5.1.1 Pooling or joining nodes
two or more parent nodes contribute to the creation of the same child node.
Example1:
Source Name Protocol REF Sample Name
Source1 PoolP1 Sample1
Source2 PoolP1 Sample1
Source3 PoolP1 Sample1
Source4 PoolP1 Sample1
Example2: several labeled extract materials that are put on the same gel
Labeled Extract Name Label Protocol REF Gel Electrophoresis Assay Name
LabeledExtract1 Cy3 GelP Gel1
LabeledExtract2 Cy3 GelP Gel1
LabeledExtract3 Cy5 GelP Gel1
* TODO represent the workflow in RDF
------
5.1.2 Splitting nodes
one node is processed and generates two or more children nodes
Example 1
shows two splitting events, where two samples (children nodes) are derived from the same source (parent node)
Source Name | Material Type | Protocol REF | Sample Name
Animal1 | whole organism | organ_removal_Pl Animal1.liver
Animal1 | whole organism | organ_removal_Pk Animal1.kidney
Animal2 | whole organism | organ_removal_Pl Animal2.liver
Animal2 | whole organism | organ_removal_Pk Animal2.kidney
Example 2:
Hybridization Assay Name | Scan Name | Image File
Hybridization1 | Scan1 | Image1.tiff
Hybridization1 | Scan1 | Image2.tiff
* TODO represent the workflow in RDF
-------
5.2 Representing node qualifiers
how to represent qualitative values for three different nodes (Characteristics [ ],
Factor Value [ ] and Parameter Value [ ]), when terms for the values are from a controlled vocabulary or
ontology:
Example:
Characteristics [organism part] Term Accession Number Term Source REF
Liver CARO:123424 CARO
Factor Value [compound] Term Accession Number Term Source REF
Aspirin CHEBI:15365 CHEBI
Parameter Value [detector] Term Accession Number Term Source REF
Channeltron PSI:1000107 PSI
* This should be trivial in RDF , explained above for factors
----