APPENDIX B: Metadata Table and Data Dictionary

The following metadata tables apply to the data that can be accessed both in SciServer and through using the APIs. For each API call, results can be filtered by agency designation using url parameter agency={AGENCY}. For example, https://prod.democratizing-data.tacc.utexas.edu/authors?agency=USDA

In each case below, the API returns data in the schema listed in the table, unless noted otherwise. The API endpoint can be found below each table.

agency_run: the table of runs for the different agencies and their datasets

https://prod.democratizing-data.tacc.utexas.edu/agency_runs

Column name
Description
Data type
Length
Is nullable

id

unique identifier to the agency_run table

bigint

0

NO

agency

name of the agency for which the run was performed

varchar

32

NO

version

version of the run for the agency; allows multiple versions on the same datasets, or possibly new runs for the same agency but with different input datasets

varchar

32

NO

run_date

approximate date the run was performed

date

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

asjc: the table with the All Science Journal Classification codes which define the research area of a journal and the articles it contains

https://prod.democratizing-data.tacc.utexas.edu/asjc A publication's research areas are defined through Elsevier's All Science Journal Classification scheme. More information about this classification system can be found here: https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

code

the All Science Journal Classification code which defines the research area of a journal and the articles it contains; there may be more than one ASJC code for each journal/publication, and the 334 codes are used here to provide a relatively precise definition of research area

bigint

0

NO

label

the All Science Journal Classification name

nvarchar

-1

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

author: the table with author information

https://prod.democratizing-data.tacc.utexas.edu/authors

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

external_id

id assigned by Elsevier to the author; this allow different authors to be identified across publications, even if they used different name variations

varchar

128

YES

given_name

the unique given name of the author as determined by Elsevier

nvarchar

150

YES

family_name

the unique family name of the author as determined by Elsevier

nvarchar

150

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

author_affiliation: the table linking authors to their affiliations in a publication

https://prod.democratizing-data.tacc.utexas.edu/author_affiliations

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_author_id

identifies the publication_author here linked to the publication affiliation

bigint

0

YES

publication_affiliation_id

identifies the publication_affiliation entry here linked to a publication author

bigint

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

date

0

NO

dataset_alias: the datasets provided by an agency for a particular run and possible aliases

https://prod.democratizing-data.tacc.utexas.edu/dataset_aliases (Note that the endpoint https://prod.democratizing-data.tacc.utexas.edu/datasets joins dataset_alias and agency_run, the schema being a concatenation of the two.) (To filter dataset_alias by other parameters, use https://prod.democratizing-data.tacc.utexas.edu/topics/{topic_id}/datasets, https://prod.democratizing-data.tacc.utexas.edu/authors/{author_id}/datasets, https://prod.democratizing-data.tacc.utexas.edu/publications/{publication_id}/datasets)

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

alias_id

the alias_id as provided by Elsevier

bigint

0

NO

parent_alias_id

identifies the parent dataset entry in this table, as identified by the alias_id

bigint

0

YES

alias

the name of the data set or the alias depending on whether alias_id==parent_alias_id or not

varchar

160

YES

alias_type

indicates the alias type; options include main, alias, acronym, doi

varchar

50

YES

url

URL to information about the dataset

varchar

2048

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

dyad: the core table with dyads representing dataset references

https://prod.democratizing-data.tacc.utexas.edu/dyads

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_id

foreign key the publication table's id column, identifying the publication within which the dataset reference represented by this dyad was identified

bigint

0

NO

elsevier_id

REMOVE

int

0

NO

dataset_alias_id

foreign key to the dataset_alias table's id column identifying the match made between this dyad and a dataset alias provided by an agency; if no such match was found this column has a NULL

bigint

0

YES

alias_id

the intrinsic id assigned by Elsevier to the dataset alias, corresponding to the alias_id column in the dataset_alias table

bigint

0

YES

mention_candidate

the phrase in the publication that was deemed by the algorithm to reflect a reference to a dataset

varchar

1028

NO

snippet*

snippet of text surrounding the mention_candidate, meant to provide contextual information to reviewers/validators

varchar

-1

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

is_fuzzy

column has value 1 if the matching between mention candidate and the dataset alias was performed using a fuzzy algorithm, 0 otherwise

bit

0

YES

fuzzy_score

in case the matching between mention candidate and the dataset alias was performed using a fuzzy algorithm, this column stores the score indicating how certain match was deemed to be

real

0

YES

*This column is set to NULL in the publicly available databases, and is only shown in restricted use databases.

dyad_model: the table with model scores for particular entries in the dyad table

https://prod.democratizing-data.tacc.utexas.edu/dyad_models

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

dyad_id

identifies the dyad

bigint

0

NO

model_id

identifies the model

bigint

0

NO

score

the score of this model for the dyad

real

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

issn: the table with ISSN/ISBN codes for the journal

https://prod.democratizing-data.tacc.utexas.edu/issns

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

journal_id

foreign key to the journal table's id column, identifying the journal for this ISSN

bigint

0

YES

ISSN

the ISSN/ISBN codes for the referenced journal/source

varchar

13

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

journal: the table linking publications to the journal in which they appeared

https://prod.democratizing-data.tacc.utexas.edu/journals

Column name
Description
Data type
Length
Is nullable

id

unique identifier and primary key for this table

bigint

0

NO

run_id

foreign key, identifier of the agency run for which this entry was determined

bigint

0

NO

publisher_id

foreign key to the publisher table, identifying the publisher for this journal at the time the agency run was executed

bigint

0

YES

external_id

the scopus ID for the journal/source

varchar

128

YES

title

the name of the journal/source that the publication was published in

varchar

1028

NO

cite_score

citescore is an Elsevier derived metric that measures the relative standing of a journal

decimal

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

model: the table with the Kaggle models that are run

https://prod.democratizing-data.tacc.utexas.edu/models

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

name

the name of the model

varchar

32

NO

github_commit_url

the github url where the commit for this model can be found

varchar

1024

YES

description

description of the model

nvarchar

-1

YES

last_updated_date

last time the row was updated; generally the time of the creation of the row

datetime

0

NO

publication: the publications discovered in a run and their metadata

https://prod.democratizing-data.tacc.utexas.edu/publications (To filter publications using other parameters, use https://prod.democratizing-data.tacc.utexas.edu/topics/{topic_id}/publications, https://prod.democratizing-data.tacc.utexas.edu/authors/{author_id}/publications, https://prod.democratizing-data.tacc.utexas.edu/datasets/{parent_alias_id}/publications)

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

journal_id

foreign key to the journal for this publication

bigint

0

YES

external_id

the scopus ID (for Elsevier publications) of this publication

varchar

128

YES

title

title of the publication

varchar

400

YES

doi

DOI of the publication

varchar

80

YES

year

the year that the publication was published as recorded in scopus

int

0

YES

month

the month of publication, which may not be available; this will be an integer value, such as 1 = January

int

0

YES

pub_type

the type of publication, which includes: article, review, book, book chapter, letter

varchar

30

YES

citation_count

the number of times this publication is cited in scopus

int

0

YES

fw_citation_impact

the Field Weighted Citation impact (FWCI) for the publication, which is a measure for how impactful or important a publication is as measured through normalized citations; the number of times cited is divided by the expected number of citations of articles in the same year, subject and publication type, and the world average across papers is 1.0 for this measure; this measure also changes over time

float

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

publication_affiliation: the table with affiliations linked to a publication

https://prod.democratizing-data.tacc.utexas.edu/publication_affiliations

Column name
Description
Data type
Length
Is nullable

id

unique identifier of the affiliation table

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

external_id

id assigned by Elsevier to this affiliation

varchar

128

YES

institution_name

the name of the institution to which an author was associated

nvarchar

750

YES

address

the address of the author, most likely that of their institution

nvarchar

750

YES

country_code

the three let6ter country code associated to the affiliation, most likely of the institution

nvarchar

10

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

city

the city for the address or institute in this affiliation

nvarchar

128

YES

state

if appropriate, the state for the address or institute in this affiliation

nvarchar

128

YES

postal_code

if appropriate, the postal code for this affiliation

nvarchar

64

YES

publication_asjc: the table linking a publication to its ASJC code(s)

https://prod.democratizing-data.tacc.utexas.edu/publication_asjcs

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_id

foreign key to publication table's id column, identifying the publication in this relation

bigint

0

NO

asjc_id

foreign key to the ASJC

bigint

0

NO

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

publication_author: the table linking publication and author tables

https://prod.democratizing-data.tacc.utexas.edu/publication_authors

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_id

foreign key to the publication for this author

bigint

0

NO

author_id

foreign key to the table with scopus author entries

bigint

0

NO

author_position

position of author in the list of authors on the publication

int

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

publication_topic: the table identifying the topic assigned to a publication

https://prod.democratizing-data.tacc.utexas.edu/publication_topics

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_id

foreign key to the topic table's id column identifying the publication in this relation between publications and topics

bigint

0

NO

topic_id

foreign key to the topic table's id column identifying the topic in this relation between publications and topics

bigint

0

NO

score

TBD

real

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

publisher: the table with a list of publishers that can be linked to journals

https://prod.democratizing-data.tacc.utexas.edu/publishers

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

external_id

external identifier of this publisher in Elsevier's scopus repository

nvarchar

128

YES

name

name of the publisher

nvarchar

120

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

topic: the table with Topics defined by Elsevier, consistinging of topic names with three concatenated keywords

https://prod.democratizing-data.tacc.utexas.edu/topics Topics are defined through Elsevier's Topic Prominence in Science methodology. More information about these Topics can be found here: https://www.elsevier.com/solutions/scival/features/topic-prominence-in-science (To filter topic by other parameters, use https://prod.democratizing-data.tacc.utexas.edu/authors/{author_id}/topics, https://prod.democratizing-data.tacc.utexas.edu/datasets/{parent_alias_id}/topics, https://prod.democratizing-data.tacc.utexas.edu/publications/{publication_id}/topics)

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

keywords

a topic is defined in Elsevier by three keywords; keywords are separated with a pipe symbol

varchar

1028

YES

external_topic_id

external identifier of this topic provided by Elsevier

varchar

128

YES

prominence

TBD

real

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

publication_ufc: the table with unified fingerprint concepts assigned to a publication

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

publication_id

foreign key to the publication table's id column identifying the publication in this relation between publications and UFC fingerprints

bigint

0

NO

concept_id

TBD

bigint

0

YES

concept_name

TBD

varchar

1024

NO

rank

TBD

float

0

YES

a_freq

TBD

integer

0

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

The following metadata tables apply only to the data that can be accessed in SciServer restricted use databases.

reviewer: the table with reviewers assigned to validate dyads in the publication_dataset_alias table

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

susd_user_id

foreign key to the susd_user table's id column, identifying the user corresponding to this reviewer

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

snippet_validation: the table containing validation results for dyads provided by reviewers

Column name
Description
Data type
Length
Is nullable

id

unique identifier for this entry

bigint

0

NO

run_id

identifies the agency run for which this entry was determined, foreign key to agency_run.id

bigint

0

NO

reviewer_id

foreign key to the reviewer table's id column, identifying the reviewer assigned to validate this snippet

bigint

0

NO

dyad_id

foreign key to the dyad table's id column, identifying the dyad that is being validated

bigint

0

NO

is_dataset_reference

if the value in this column is 1, it indicates the dyad indeed has been identified a reference to a dataset; if 0, it is not a dataset reference; if -1, the reviewer was unsure about it

smallint

0

YES

agency_dataset_identified

if the value in this column is 1, it indicates the dyad indeed identified the specific dataset provided by the agency; if 0, it is not a reference to that dataset; if -1, the reviewer was unsure about it

smallint

0

YES

notes

any notes the reviewer attached to the dyad being reviewed

nvarchar

-1

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

susd_user: the table identifying users of the validation tool

Column name
Description
Data type
Length
Is nullable

id

unique identifier of this table

bigint

0

NO

first_name

first name of the individual

varchar

100

YES

last_name

surname of the individual

varchar

100

YES

email

email of the user

varchar

100

YES

password

encrypted password of the user

varchar

100

YES

last_updated_date

last time the row was updated; generally the time of creation of the row

datetime

0

NO

Last updated