Frank Wang

Home AboutRSS

Two Small Datasets for Semantic Entity Segmentation in Natural Language

  • data
  • NLP

I recently wrote about a lightweight model that I used for segmentation of semantic entities in text. In our discussion, we described an application of this model to extracting references to monetary values. The data used to train our model I recently uploaded to GitHub along with another dataset containing data for segmenting out references to dates. This post will serve as a reference to the interpreting these tiny (\(O(100)\)) datasets as provided.

General Schema

Both datasets covered here consist of scraped submission titles from the r/borrow subreddit, which is a community which enables the coordination of personal loans between redditors. Each of the scraped submission titles comes with a label marking the segments in the strings which contain the semantic entities of interest. We serialize our input/labels label pairs using YAML which trade off storage efficiency for human readability and are delivered as a series of relatively small files. In all, each file might look something like this:


-
    input: '[PAID] (u/verydisappointing) ($180 GBP + Int.) (EARLY)'
    labels:
        -
            - 31
            - 34
-
    input: '[REQ] Need $25 for food in Manchester, CT. Can pay back $30 in a week'
    labels:
        -
            - 12
            - 14
        -
            - 57
            - 59

As you can see each of the labels is given as a list of pairs of integers which mark the beginning and end of each of the semantic entities which we can extract as follows:


>>> s = '[REQ] Need $25 for food in Manchester, CT. Can pay back $30 in a week.'
>>> labels = [[12, 14], [57, 59]]
>>>
>>> s[labels[0][0]:labels[0][1]]
'25'
>>> s[labels[0][0]:labels[0][1]]
'30'

It is worth noting that the scraped input titles in these datasets are not sampled uniformly from the pool of titles. Each input was chosen by hand from the complete set with the goal of selecting titles where the semantic entities which are somewhat non-typical. In the two sections below where we discuss details of each dataset, we discuss some examples of what sorts of expressions we consider non-typical alongside guidelines we use for defining ground truth labels in obscure cases.

Monetary Reference Data

Our initial release of monetary reference labels consists of 80 labeled strings. References to money are labeled as sequences of consecutive numerical characters along with possibly decimal points (.) and commas (,). Detections do not include currency symbols that may accompany the amounts. Below, you can see a selection of example strings with the labels in bold:



    [REQ] ($150) (#Akron, OH, USA)(payback $60 on 1/4, 1/18, and 2/1)

    [REQ] (200) - (Dallas, TX, USA), (01/15/2018), (PayPal)

    [REQ] (80) - (Largo, FL, USA), (Repay $100 on 10/12/2018), (PayPal)

    [REQ] (1000.00) - (#Saco, Maine, USA), (1250.00 by 9/30/17), (Paypal)

    [REQ] (3,000) - (#Fort Dodge, Iowa, U.S.), (03/01/18), (PayPal)

In selecting titles for this dataset, we looked for examples with a couple of particular consideration which seem to summarize most atypical cases.

  1. Monetary references with and without currency markers, with markers of different kinds, and with markers in different positions.
  2. Titles with references to different dates.
  3. Titles with and without the presence of . and , characters.
  4. A diverse selection of monetary amounts.

In practice we have actually found that our data is somewhat insufficient with respect to attributions (1) and (4) though the details of such a question is more appropriate in its own discussion.

Date Reference Data

In addition to our monetary reference dataset, we are also releasing 151 labeled strings segmenting out date references. Each segment is intended to references a specific day of a calendar year (with or without the year specified) including any user included punctuation. References to days of the week (ie. Monday, Tuesday etc.) are not included but we do include labels of generic months without a day specified (ie. Will repay in June). Some examples follow:



    (£350) - (#Runcorn, Cheshire, UK), (£600 returned in two parts 26th October & 26th Novmeber), (Paypal)

    [REQ] ($210) - (#santa clara, CA, USA), (3/17), (paypal repay $235)

    [REQ] ($400) - (#Hyderabad, Telangana, India), (15 January), (PayPal)

    REQ] ($500 ) - (#Manassas, Virginia, USA), (Jan 1-2), (Paypal/will pay back in crypto)

    [REQ] ($150) - (#Myrtle beach, Sc, USA), ($170 ByJune 6th), (Paypal/Pre-Arranged

    [REQ] ($120) - (#Little Rock, Ar, US), (Repayment Date 7-14-17 ), ( payback total $130)

    [REQ] (€200) - (#Dublin, Ireland), (19th of May), (PayPal)

Data is divided into 12 files, each containing 10-15 labeled strings corresponding to months. Strings containing dates from multiple months are included in the file of the earliest relevant month referenced. In addition to finding examples from a broad collection of months as samples, we also made efforts to include a strong mixture of spelled out dates, numerical dates (of different formats), and abbreviations and alternative spellings of month names.