
SSF Statistics
--------------

NB: reading this document does *not* make your runs manual.  The
attached ssf-training-examples span the entire time frame of the
corpus, so they can only be used on an hour-by-hour basis in an SSF
system.  Using this data in a CCR system makes its runs manual, even
if the data is used hour-by-hour.

This is the first year of KBA Streaming Slot Filling.  We generated
truth data from a combination of early run submissions and using the
vital documents identified by NIST assessors during the first phase of
assessing.

As expected, the SSF slots have a long tail in both dimensions:
entities and slots.  The long tail means that systems with very
specifically written extraction routines tied to specific slot types
could be miss slots with large numbers of instances.

The two-parameter nature of SSF data enable multiple analyses.

macro-averaging across slots will put equal weight on CauseOfDeath and
Contact_Meet_PlaceTime, even though they have very different numbers
of instances in the truth data.  Similarily, macro-averaging across
entities will put equal weight on entities with very different numbers
of slot fills.

Each slot fill has these properties:

  * equivalence class identifier "equiv_id_XX" 

  * begin/end, which are the earliest and latest date_hour containing
    documents that substantiate the slot fill

  * aliases, equivalent strings

  * beginState/endState are one of < | > ?  or empty string, which
    signify beginning before the corpus, during the corpus, after the
    corpus, unclear, and not assessed, respectively.

  * stream_ids, which identify only a subset of the documents in the
    corpus that substantiate this slot fill

Assessors focused on slot fills that were likely to change, however
about three-quarters of the couple thousand fills either changed
before the corpus started or were hard to ascertain.  The quarter that
clearly began during the corpus were marked with "|" and are the
target objective of the SSF evaluation.  These examples have been
omitted from the enclosed data dump, which does include the the other
slot fills:

     trec-kba-ssf-training-examples-2013-07-16.json 

The slots with endState="|" have nuanced semantics that make them
harder to use for evaluation, so we have included them in the training
examples.

A successful true-positive assertion for SSF will identify a byte
range in a corpus text that overlaps one of the alias strings for a
slot fill identified by the assessors as beginning during the corpus.

We are organizing different metrics for handling the several forms of
event detection and resolution in this data, including the
complexities of string overlap, finding the earliest substantiation,
and coreference resolving of slot fills.  Metrics will focus on:

  * recall more than precision, since the assessors did not find all
    possible slot fills, and systems will likely return some correct
    slot fills that are not in the truth data.

  * precision of coreference resolution of slots with multiple
    equivalent strings, aka "aliases"

  * finding the earliest evidence for a slot fill




Summary Counts:
---------------
  387	distinct (entity, slot) pairs
  2152	distinct fills
  470	begin during (eval data)
  1682	not begin during (released as training data)
  48	end during and not begin during



Summary Counts of Eval Data
---------------------------
These slot fills have beginState="|" meaning the slot
value was acquired during the corpus time range

num eval fills per slot type:
232 Contact_Meet_PlaceTime
96 Affiliate
70 Contact_Meet_Entity
27 AssociateOf
19 AwardsWon
10 Titles
6 TopMembers
4 FoundedBy
2 DateOfDeath
2 EmployeeOf
1 SignificantOther
1 CauseOfDeath

num entities per slot with eval data:
50 Contact_Meet_PlaceTime
41 Affiliate
12 AwardsWon
10 AssociateOf
10 Contact_Meet_Entity
8 Titles
3 TopMembers
2 DateOfDeath
2 FoundedBy
1 SignificantOther
1 EmployeeOf
1 CauseOfDeath

num begin-during slot fills per entity:
67 http://en.wikipedia.org/wiki/Jamie_Parsley
34 http://en.wikipedia.org/wiki/Hjemkomst_Center
24 https://twitter.com/RonFunches
18 http://en.wikipedia.org/wiki/Ed_Bok_Lee
18 http://en.wikipedia.org/wiki/Richard_Edlund
18 http://en.wikipedia.org/wiki/Marion_Technical_Institute
14 https://twitter.com/FrankandOak
13 http://en.wikipedia.org/wiki/DeAnne_Smith
13 http://en.wikipedia.org/wiki/Great_American_Brass_Band_Festival
12 http://en.wikipedia.org/wiki/Drew_Wrigley
11 http://en.wikipedia.org/wiki/Jennifer_Baumgardner
10 http://en.wikipedia.org/wiki/Jeremy_McKinnon
9 http://en.wikipedia.org/wiki/Paul_Marquart
8 http://en.wikipedia.org/wiki/Clark_Blaise
8 https://twitter.com/RobCaud
7 http://en.wikipedia.org/wiki/Appleton_Museum_of_Art
7 http://en.wikipedia.org/wiki/James_McCartney
6 http://en.wikipedia.org/wiki/Geoffrey_E._Hinton
6 https://twitter.com/GandBcoffee
6 http://en.wikipedia.org/wiki/Charles_Bronfman
6 http://en.wikipedia.org/wiki/Ken_Freedman
6 http://en.wikipedia.org/wiki/Joshua_Boschee
5 http://en.wikipedia.org/wiki/Th%C3%A9o_Mercier
5 http://en.wikipedia.org/wiki/Matt_Witten
5 http://en.wikipedia.org/wiki/Shafi_Goldwasser
5 https://twitter.com/AlexJoHamilton
5 http://en.wikipedia.org/wiki/Bob_Bert
4 http://en.wikipedia.org/wiki/Barbara_Liskov
4 http://en.wikipedia.org/wiki/Haven_Denney
4 http://en.wikipedia.org/wiki/Travis_Mays
4 http://en.wikipedia.org/wiki/Joey_Mantia
4 http://en.wikipedia.org/wiki/Lewis_and_Clark_Landing
4 http://en.wikipedia.org/wiki/Gran%C3%A3_y_Montero
4 http://en.wikipedia.org/wiki/Scotiabank_Per%C3%BA
4 http://en.wikipedia.org/wiki/Paul_Johnsgard
3 http://en.wikipedia.org/wiki/Sara_Bronfman
3 http://en.wikipedia.org/wiki/Lake_Weir_High_School
3 http://en.wikipedia.org/wiki/Stevens_Cooperative_School
3 https://twitter.com/BlossomCoffee
3 http://en.wikipedia.org/wiki/Luz_del_Sur
3 https://twitter.com/MissMarcel
3 https://twitter.com/tonyg203
3 http://en.wikipedia.org/wiki/Buddy_MacKay
3 http://en.wikipedia.org/wiki/Don_Garlits_Museum_of_Drag_Racing
2 https://twitter.com/KentGuinn4Mayor
2 http://en.wikipedia.org/wiki/Jeff_Severson
2 http://en.wikipedia.org/wiki/Carey_McWilliams_(marksman)
2 http://en.wikipedia.org/wiki/Gretchen_Hoffman
2 http://en.wikipedia.org/wiki/William_H._Miller_(writer)
2 http://en.wikipedia.org/wiki/Ruben_J._Ramos
2 https://twitter.com/BobStovall
2 http://en.wikipedia.org/wiki/L%C3%A9on_Bottou
2 http://en.wikipedia.org/wiki/Jack_Lazorko
2 http://en.wikipedia.org/wiki/Clare_Bronfman
2 http://en.wikipedia.org/wiki/Edgar_Bronfman,_Jr.
2 http://en.wikipedia.org/wiki/Benjamin_Bronfman
2 https://twitter.com/CorbinSpeedway
2 http://en.wikipedia.org/wiki/SIMSA
2 http://en.wikipedia.org/wiki/Mark_SaFranko
2 http://en.wikipedia.org/wiki/Jasper_Schneider
2 http://en.wikipedia.org/wiki/Phyllis_Lambert
2 http://en.wikipedia.org/wiki/Cementos_Lima
2 http://en.wikipedia.org/wiki/Susan_Krieg
2 http://en.wikipedia.org/wiki/Edgar_Bronfman,_Sr.
2 http://en.wikipedia.org/wiki/Fernando_J._Corbat%C3%B3
1 http://en.wikipedia.org/wiki/Elysian_Charter_School
1 http://en.wikipedia.org/wiki/Joshua_Zetumer
1 http://en.wikipedia.org/wiki/Reid_Nichols
1 http://en.wikipedia.org/wiki/Intergroup_Financial_Services
1 http://en.wikipedia.org/wiki/Maurice_Fitzgibbons
1 http://en.wikipedia.org/wiki/Judd_Davis
1 http://en.wikipedia.org/wiki/Frank_Winters
1 http://en.wikipedia.org/wiki/Bernard_Kenny
1 http://en.wikipedia.org/wiki/Gwena%C3%ABlle_Aubry
1 http://en.wikipedia.org/wiki/Chuck_Pankow
1 http://en.wikipedia.org/wiki/Zoubin_Ghahramani
1 http://en.wikipedia.org/wiki/Blair_Thoreson
1 http://en.wikipedia.org/wiki/Pat_Dapuzzo
1 http://en.wikipedia.org/wiki/Dunkelvolk
1 http://en.wikipedia.org/wiki/Juris_Hartmanis
1 http://en.wikipedia.org/wiki/Tilo_Rivas
1 http://en.wikipedia.org/wiki/Yann_LeCun
1 http://en.wikipedia.org/wiki/Keri_Hehn
1 http://en.wikipedia.org/wiki/Ana%C3%AFs_Croze
1 http://en.wikipedia.org/wiki/Copper_Basin_Railway
1 http://en.wikipedia.org/wiki/Angelo_Savoldi
1 http://en.wikipedia.org/wiki/Weehawken_Cove
1 http://en.wikipedia.org/wiki/Jim_Poolman
