Page tree
Skip to end of metadata
Go to start of metadata

Category description and use cases

Entity extraction, or named entity recognition (NER), is a type of natural language processing (NLP) that attempts to identify and classify entities or concepts, like people, places, organizations, products, and topics, in unstructured text. These extracted entities could then be reviewed (and possibly normalized) and added to item or collection descriptions as access points or tags. Another possible use case is reviewing extracted entities as a way to assess the accuracy of a speech-to-text transcription. If a collection manager is familiar with a collection, unusual entities may serve as a flag for poor transcription. 

Workflow example:

Audio is passed through a segmenter MGM to label speech, silence and music. If necessary, the audio file is split into segments of speech. A new file composed of only the speech segments is sent through a speech-to-text MGM to generate transcripts. If necessary, timestamps are adjusted to restore original segments of silence and music. The transcript is converted to plain text and sent through an entity extraction MGM to extract entity types of interest to the user. Output can be used to generate lists of terms for users to review, timed-text transcripts with entity annotations (JSON) or entities with time offset annotations (JSON). 

AMPPD parameters:

  • Score threshold?
  • Entity types to use from output?
  • Language

Output standard

Summary: 

Element

Datatype

Obligation

Definition

media

object

required

Wrapper for metadata about the source media file.

media.filename

string

required

Filename of the source file.

media.characters

integer

required

Number of text characters from the file document evaluated by the NLP tool.

entities

array of objects

required

Wrapper for entities extracted from the document.

entities[*]

object

required

An entity extracted from the document.

entities[*].text

string

required

The entity text extracted from the document.

entities[*].type

string

required

The type of entity as classified by the tool or service.

entities[*].beginOffset

integer

required

The start character within the document of the extracted entity.

entities[*].endOffset

integer

required

The end of the entity string within the document, (i.e. the offset of the character immediately after the last character of the entity).

entities[*].start

string (s.fff)

optional

The start time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.

entities[*].end

string (s.fff)

optional

The end time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.

entities[*].subtype

array of strings

optional

A list of subtypes of the entity type (string). (Provided by some NLP services, ex. IBM Watson.)

entities[*].nounType

string (common | proper)

optional

Whether the entity name is common or proper. (Used by Google NLP.)

entities[*].score

object

optional

A confidence or relevance score for the entity.

entities[*].score.type

string (confidence | relevance)

required

The type of score, confidence or relevance. Confidence indicates the NLP service’s confidence of correctly detecting the type of entity while relevance indicates the importance of an entity to the document. (Of the candidates we tested, AWS Comprehend “score” would map to “confidence” while IBM Watson’s “relevance” and Google NLP’s “salience” would map to “relevance”.

entities[*].score.scoreValue

number

required

The score value, typically a float in the range of 0-1.

entities[*].normalizedForm

object

optional

A normalized form of the entity. Some services group mentions of similar terms and label them with a normalized term that may or may not correspond to an entity from an external knowledge base or graph. 

entities[*].normalizedForm.text

string

required

The normalized text form of the entity.

entities.normalizedForm.externalEntities

array

optional

A list of corresponding entity ids and/or urls from external knowledge bases or graphs.

entities.normalizedForm.externalEntities[*]

object

required

A corresponding entity id and/or url from an external knowledge base or graph.

entities.normalizedForm.externalEntities[*].source

string

required

The source of the external entity, ex. “Wikipedia”.

entities.normalizedForm.externalEntities[*].id

string

optional

An id for a corresponding external entity.

entities.normalizedForm.externalEntities[*].url

string

optional

A URL for a corresponding external entity.

Schema
{
  "$schema": "http://json-schema.org/schema#",
    "type": "object",
    "title": "Entity Extraction Schema",
    "required": [
        "media",
        "entities"
    ],
    "properties": {
       "media": {
            "type": "object",
            "title": "Media",
            "description": "Wrapper for metadata about the source media file.",
            "required": [
                "filename",
                "characters"
            ],
            "properties": {
                "filename": {
                    "type": "string",
                    "title": "Filename",
                    "description": "Filename of the source file.",
                    "default": "",
                    "examples": ["myfile.txt"]
                },
                "characters": {
                    "type": "integer",
                    "title": "Characters",
                    "description": "Number of text characters from the file document evaluated by the NLP tool.",
                    "default": "",
                    "examples": [47026]
                }
            }
        },
        "entities": {
            "type": "array",
            "title": "Entities",
            "description": "Wrapper for entities extracted from the document.",
            "items": {
                "type": "object",
                "required": [
                    "text",
                    "type",
                    "beginOffset",
                    "endOffset"],
                "properties": {
                    "text": {
                        "type": "string",
                        "title": "Text",
                        "description": "The entity text extracted from the document.",
                        "default": "",
                        "examples": ["New York"]
                    },
                    "type": {
                        "type": "string",
                        "title": "Type",
                        "description": "The type of entity as classified by the tool or service.",
                        "default": "",
                        "examples": ["PERSON", "COMMERCIAL_ITEM"]
                    },
                    "beginOffset": {
                        "type": "integer",
                        "title": "Begin offset",
                        "description": "The start character within the document of the extracted entity.",
                        "default": 0,
                        "examples": [4637]
                    },
                    "endOffset": {
                        "type": "integer",
                        "title": "End offset",
                        "description": "The end of the entity string within the document, (i.e. the offset of the character immediately after the last character of the entity).",
                        "default": 0,
                        "examples": [4645]
                    },
                    "start": {
                        "type": "string",
                        "title": "Start",
                        "description": "The start time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.",
                        "default": "",
                        "examples": ["837.834"]
                    },
                    "end": {
                        "type": "string",
                        "title": "End",
                        "description": "The end time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.",
                        "default": "",
                        "examples": ["838.79"]
                    },
                    "subtype": {
                        "type": "array",
                        "title": "Subtype",
                        "description": "A list of subtypes of the entity type (string). (Provided by some NLP services, ex. IBM Watson.)",
                        "items": {
                            "type": "string"
                        }
                    },
                    "nounType": {
                        "type": "string",
                        "title": "Noun type",
                        "description": "Whether the entity name is common or proper. (Used by Google NLP.)",
                        "enum": [
                            "proper",
                            "common"
                        ]
                    },
                    "score": {
                        "type": "object",
                        "title": "Score",
                        "description": "A confidence or relevance score for the entity.",
                        "required": [
                            "type",
                            "scoreValue"
                        ],
                        "properties": {
                            "type": {
                                "type": "string",
                                "title": "Type",
                                "description": "The type of score, confidence or relevance. Confidence indicates the NLP service’s confidence of correctly detecting the type of entity while relevance indicates the importance of an entity to the document. (Of the candidates we tested, AWS Comprehend “score” would map to “confidence” while IBM Watson’s “relevance” and Google NLP’s “salience” would map to “relevance”.",
                                "enum": [
                                    "confidence",
                                    "relevance"
                                ]
                            },
                            "scoreValue": {
                                "type": "number",
                                "title": "Score value",
                                "description": "The score value, typically a float in the range of 0-1.",
                                "default": 0,
                                "examples": [0.437197]
                            }
                        }
                    },
                    "normalizedForm": {
                        "type": "object",
                        "title": "Normalized form",
                        "description": "A normalized form of the entity. Some services group mentions of similar terms and label them with a normalized term that may or may not correspond to an entity from an external knowledge base or graph.",
                        "required": ["text"],
                        "properties": {
                            "text": {
                                "type": "string",
                                "title": "Text",
                                "description": "The normalized text form of the entity.",
                                "default": "",
                                "examples": ["New York City"]
                            },
                            "externalEntities": {
                                "type": "array",
                                "title": "External entities",
                                "description": "A list of corresponding entity ids and/or urls from external knowledge bases or graphs.",
                                "items": {
                                    "type": "object",
                                    "required": ["source"],
                                    "anyOf": [
                                        {
                                            "properties": {
                                                "source": {
                                                    "type": "string",
                                                    "title": "Source",
                                                    "description": "The source of the external entity, ex. “Wikipedia”.",
                                                    "default": "",
                                                    "examples": ["Google Knowledge Graph"]
                                                },
                                                "id": {
                                                    "type": "string",
                                                    "title": "Id",
                                                    "description": "An id for a corresponding external entity.",
                                                    "default": "",
                                                    "examples": ["/m/09c7w0"]
                                                }
                                            }
                                        },
                                        {
                                            "properties": {
                                                "source": {
                                                    "type": "string",
                                                    "title": "Source",
                                                    "description": "The source of the external entity, ex. “Wikipedia”.",
                                                    "default": "",
                                                    "examples": ["Dbpedia"]
                                                },
                                                "url": {
                                                    "type": "string",
                                                    "title": "url",
                                                    "description": "A URL for a corresponding external entity.",
                                                    "default": "",
                                                    "examples": ["http://dbpedia.org/resource/New_York_City"]
                                                }
                                            }  
                                        },
                                        {
                                            "properties": {
                                                "source": {
                                                    "type": "string",
                                                    "title": "Source",
                                                    "description": "The source of the external entity, ex. “Wikipedia”.",
                                                    "default": "",
                                                    "examples": ["Wikipedia"]
                                                },
                                                "id": {
                                                    "type": "string",
                                                    "title": "Id",
                                                    "description": "An id for a corresponding external entity.",
                                                    "default": "",
                                                    "examples": ["New_York_City"]
                                                },
                                                "url": {
                                                    "type": "string",
                                                    "title": "url",
                                                    "description": "A URL for a corresponding external entity.",
                                                    "default": "",
                                                    "examples": ["https://en.wikipedia.org/wiki/New_York_City"]
                                                }
                                            }  
                                        }
                                    ]
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

JSON Schema

Sample output – minimum

Sample Output - minimum
{
	"media": {
		"filename": "myfile.txt",
		"characters": 47582
	},
	"entities": [
		{
			"text": "John Dewey",
			"type": "Person",
			"beginOffset": 14,
			"endOffset": 24,
		},
		{
			"text": "student success",
			"type": "Concept",
			"beginOffset": 56,
			"endOffset": 81,
		},
		{
			"text": "Bloomington",
			"type": "Location",
			"beginOffset": 93,
			"endOffset": 104,
		},
		{
			"text": "New York",
			"type": "Location",
			"beginOffset": 155,
			"endOffset": 163,
		}
	]
}


Sample output - full

Sample Output - full
{
	"media": {
		"filename": "myfile.txt",
		"characters": 47582
	},
	"entities": [
		{
			"text": "John Dewey",
			"type": "Person",
			"beginOffset": 14,
			"endOffset": 24,
			"start": "22.888",
			"end": "23.555",
			"nounType": "proper",
			"score": {
				"type": "relevance",
				"scoreValue": 0.0001677875
			},
			"normalizedForm": {
				"text": "John Dewey",
				"externalEntities": [
					{
						"source": "Google Knowledge Graph",
						"id": "/m/04411"
					},
					{
						"source": "Wikipedia",
						"url": "https://en.wikipedia.org/wiki/John_Dewey"
					}]
			}
		},
		{
			"text": "student success",
			"type": "Concept",
			"beginOffset": 56,
			"endOffset": 81,
			"start": "32.888",
			"end": "33.555",
			"nounType": "common",
			"score": {
				"type": "relevance",
				"scoreValue": 0.00011485724
			},
		},
		{
			"text": "Bloomington",
			"type": "Location",
			"beginOffset": 93,
			"endOffset": 104,
			"start": "402.788",
			"end": "403.955",
			"nounType": "proper",
			"subtype": ["City"],
			"score": {
				"type": "relevance",
				"scoreValue": 0.0001677875
			},
			"normalizedForm": {
				"text": "Bloomington",
				"externalEntities": [
					{
						"source": "Google Knowledge Graph",
						"id": "/m/0snty"
					},
					{
						"source": "Wikipedia",
						"url": "https://en.wikipedia.org/wiki/Bloomington,_Indiana"
					}]
			}
		},
		{
			"text": "New York",
			"type": "Location",
			"beginOffset": 155,
			"endOffset": 163,
			"start": "837.834",
			"end": "838.455",
			"nounType": "proper",
			"subtype": [
				"PoliticalDistrict",
                "GovernmentalJurisdiction",
                "PlaceWithNeighborhoods",
                "WineRegion",
                "FilmScreeningVenue",
                "City"],
			"score": {
				"type": "relevance",
				"scoreValue": 0.433819
			},
			"normalizedForm": {
				"text": "New York City",
				"externalEntities": [
					{
						"source": "Google Knowledge Graph",
						"id": "/m/02_286"
					},
					{
						"source": "Wikipedia",
						"url": "https://en.wikipedia.org/wiki/New_York_City"
					}]
			}
		}
	]
}


Recommended tool(s)

AWS Comprehend

Official documentation:  https://aws.amazon.com/comprehend/

Language: 100 languages supported: https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html

Description: 

Cost: Requests for Entity Recognition, Sentiment Analysis, Syntax Analysis, Key Phrase Extraction, and Language Detection are measured in units of 100 characters, with a 3 unit (300 character) minimum charge per request at $0.0001/unit (up to 10M units monthly)

Social impact: 

Notes: 

Installation & requirements

AWS Comprehend is run via the AWS Console or AWS Comprehend API. Interaction with the API can be made through the AWS Command Line Interface (CLI) or by invoking scripts with AWS Lambda functions. AWS offers SDKs in a variety of programming languages. For testing, the AWS CLI was used. 

For each plain text file, the file is uploaded to an S3 bucket, then referenced in a call to API using the StartEntitiesDetectionJob method. (This method is used for texts over 5000 characters. For texts under 5000 characters, the DetectEntities method can be used.):

aws comprehend start-entities-detection-job --data-access-role-arn=arn:aws:iam::[access_role] --language-code=en --input-data-config S3Uri=s3://[input_bucket]/myfile.txt, InputFormat=ONE_DOC_PER_FILE --output-data-config S3Uri=s3://[output_bucket]/


This should return a job id and job status. Example:

{"JobId": "4ee1548fec685f96f76361276e588eba", "JobStatus": "SUBMITTED"}


To check on the status of a job, use the DescribeEntitiesDetectionJob:

aws comprehend describe-entities-detection-job --job-id=1337d49aa78c092d510ef8394b545a13


When the job is complete the output is sent to the S3 bucket listed in the OutputDataConfig parameter from the initial request.

Parameters

Full list of parameters: https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html

  • LanguageCode: default is English

Input formats

Plain text

Entity types

TypeDescription

COMMERCIAL_ITEM

A branded product

DATE

A full date (for example, 11/25/2017), day (Tuesday), month (May), or time (8:30 a.m.)

EVENT

An event, such as a festival, concert, election, etc.

LOCATION

A specific location, such as a country, city, lake, building, etc.

ORGANIZATION

Large organizations, such as a government, company, religion, sports team, etc.

OTHER

Entities that don't fit into any of the other entity categories

PERSON

Individuals, groups of people, nicknames, fictional characters

QUANTITY

A quantified amount, such as currency, percentages, numbers, bytes, etc.

TITLE

An official name given to any creation or creative work, such as movies, books, songs, etc.


AWS Comprehend entity types mapped to common types used for testing.

{'COMMERCIAL_ITEM':'concept',
'DATE':'do not use',
'EVENT':'event',
'LOCATION':'location',
'ORGANIZATION':'organization',
'OTHER':'concept',
'PERSON':'person',
'QUANTITY':'do not use',
'TITLE':'concept'}


Example Usage

<tool name> Example
aws comprehend start-entities-detection-job --data-access-role-arn=arn:aws:iam::[access_role] --language-code=en --input-data-config S3Uri=s3://[input_bucket]/myfile.txt, InputFormat=ONE_DOC_PER_FILE --output-data-config S3Uri=s3://[output_bucket]/

Example Output

<tool name> Output
{
    "Entities": [
        {
            "BeginOffset": 16,
            "EndOffset": 20,
            "Score": 0.930534839630127,
            "Text": "17th",
            "Type": "DATE"
        },
        {
            "BeginOffset": 22,
            "EndOffset": 34,
            "Score": 0.9784671664237976,
            "Text": "This morning",
            "Type": "DATE"
        },
        {
            "BeginOffset": 72,
            "EndOffset": 101,
            "Score": 0.7616077661514282,
            "Text": "Bryan Administration Building",
            "Type": "ORGANIZATION"
        },
        {
            "BeginOffset": 156,
            "EndOffset": 166,
            "Score": 0.6579800844192505,
            "Text": "Ballantine",
            "Type": "LOCATION"
        },
        {
            "BeginOffset": 171,
            "EndOffset": 183,
            "Score": 0.9638428092002869,
            "Text": "Rawles Halls",
            "Type": "LOCATION"
        },
        {
            "BeginOffset": 185,
            "EndOffset": 198,
            "Score": 0.9551753997802734,
            "Text": "Five students",
            "Type": "QUANTITY"
        },
        {
            "BeginOffset": 234,
            "EndOffset": 246,
            "Score": 0.8644406795501709,
            "Text": "this morning",
            "Type": "DATE"
        }
    ]
}

SpaCy

Official documentation: https://spacy.io

Language: English

Description: 

Cost: Free (open source)

Social impact: 

Notes: 

Installation & requirements

Install SpaCy as a Python library using pip or other preferred method. 

Download models: https://spacy.io/usage/models (We used en_core_web_lg for testing.)

Example:

python -m spacy download en_core_web_lg

Parameters

Model: pass the model name as an argument in instantiating SpaCy.

Example:

nlp = spacy.load("en_core_web_lg")

Input formats

Plain text

Entity types

Full list at https://spacy.io/api/annotation#named-entities

TYPEDESCRIPTION
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including ”%“.
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL“first”, “second”, etc.
CARDINALNumerals that do not fall under another type.


SpaCy types mapped to common types used for testing:

{'PERSON':'person',
'NORP':'concept',
'FAC':'concept',
'ORG':'organization',
'GPE':'location',
'LOC':'location',
'PRODUCT':'concept',
'EVENT':'event',
'WORK_OF_ART':'concept',
'LAW':'concept',
'LANGUAGE':'concept',
'DATE':'do not use',
'TIME':'do not use',
'PERCENT':'do not use',
'MONEY':'do not use',
'QUANTITY':'do not use',
'ORDINAL':'do not use',
'CARDINAL':'do not use'}


Example Usage

<tool name> Example
import spacy

nlp = spacy.load("en_core_web_sm")
text = u"Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

media = {"filename": "myfile.txt", "characters": len(text)}

entities = []
for token in doc.ents:
    entity = {}
    entity['text'] = token.text
    entity['type'] = token.label_
    entity['beginOffset'] = token.start_char
    entity['endOffset'] = token.end_char
    entities.append(entity)

result = {"media": media, "entities":entities}

Example Output

<tool name> Output
{
    "media": {
        "filename": "myfile.txt",
        "characters": 54
    },
    "entities": [
        {
            "text": "Apple",
            "type": "ORG",
            "beginOffset": 0,
            "endOffset": 5
        },
        {
            "text": "U.K.",
            "type": "GPE",
            "beginOffset": 27,
            "endOffset": 31
        },
        {
            "text": "$1 billion",
            "type": "MONEY",
            "beginOffset": 44,
            "endOffset": 54
        }
    ]
}

Other evaluated tools

Stanford CoreNLP

Official documentation: https://nlp.stanford.edu/software/index.shtml

Language: Java, with bindings or translations for Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages

Cost: Free (open source)

Social impact: 

Notes: Comes with three models-- a 3-class one (Location, Person, Organziation), a 4-class one (Location, Person, Organization, Misc), and a 7-class one (Location, Person, Organization, Money, Percent, Date, Time). Arabic, Chinese, English, French, German, Spanish. Third-party created models for Russian and Swedish

Installation & requirements

https://stanfordnlp.github.io/CoreNLP/

Input formats

plain text

IBM Watson Natural Language Understanding

Official documentation: https://cloud.ibm.com/catalog/services/natural-language-understanding

Language: web service

Cost: Lite plan: 30,000 NLU items/month free (A NLU item is based on the number of data units enriched and the number of enrichment features applied. A data unit is 10,000 characters or less.) For example: extracting Entities and Sentiment from 15,000 characters of text is (2 Data Units * 2 Enrichment Features) = 4 NLU Items

Social impact: On their website, "By default, all Watson services log requests and their results. Logging is done only to improve the services for future users. The logged data is not shared or made public. To prevent IBM from accessing your data for general service improvements, please visit this site: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#data-collection"

Notes: Entities include Anatomy, Award, Broadcaster, Company, Crime, Drug, EmailAddress, Facility, GeographicFeature, HealthCondition, Hashtag, IPAddress, JobTitle, Location, Movie, MusicGroup, NaturalEvent, Organization, Person, PrintMedia, Quantity, Sport, SportingEvent, TelevisionShow, TwitterHandle, Vehicle and subtypes of each of these. Results include relevance, subtypes and dbpedia names and links, and counts. Arabic, Chinese, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish

Installation & requirements

web service

Input formats

plain text, raw html, public URL, 50000 character limit


Google Cloud Natural Language API

Official documentation: https://cloud.google.com/natural-language/

Language: web service

Cost: Calculated in terms of “units,” where each document sent to the API for analysis is at least one unit. One unit per 1,000 characters. First 5000 units free, then $1/1000 units.

Social impact: 

Notes: Categories: Person, Organization, Event, Location, Consumer good, Work of art, Quantity, and Other
- salience (relevance to document)
- Wikipedia URL
- Google Knowledge Graph ID
- distinguishes between proper and common nouns

Languages supported varies by feature, but generally: English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish

Installation & requirements

web service

Input formats

plain text, html

Evaluation summary

Scripts for converting MGM output formats and comparing results are on the project GitHub.

Analysis of custom vocabulary usage with SpaCy is in the project Google Drive.

Precision, Recall, and F1 scores for ground truth testing are in the project Google Drive.

  • No labels