
This directory contains a simple example system illustrating the
input/output of the task described here:
http://trec-kba.org/kba-ccr-2012.shtml

To run this, you must install the 'thrift' python module.

toy_kba_system.py is the primary example, which uses toy_kba_algorithm.py

You can run it like this:

python toy_kba_system.py --max 10000000 --cutoff 100 filter-topics.sample-trec-kba-targets-2012.json tiny-kba-stream-corpus filter-run.toy_1.txt

You can get sample topics here  http://trec-kba.org/schemas/v1.0/filter-topics.sample-trec-kba-targets-2012.json


Note: the TREC KBA 2012 corpus was originally in JSON.  We have
transformed the corpus to use the thrift.  Thrift is a serialization
format.  In addition to being >10x faster to deserialize than JSON, it
also has binary strings, which allows us to drop the python
string-escape trick for putting byte arrays into JSON strings.

You can read more about thrift here:

    http://thrift.apache.com

The crucial file for using thrift with the TREC KBA 2012 corpus is

    http://trec-kba.org/schemas/v1.0/kba.thrift

It is a human readable file.  Using it, you can create a client
library for loading the KBA corpus in most programming languages.  For
example, to get python or java clients, you would:

   thrift -r --gen py   kba.thrift 
   thrift -r --gen java kba.thrift 


Last modified June 6, 2012 by jrf
