In this assignment, you will implement a supervised machine learning approach to a named entity recognition problem. The particular NER task we’ll be tackling is to find all the references to genes in a set of biomedical journal article abstracts.
The training material consists of around 13000 sentences with gene references tagged with IOB tags. Since we’re only dealing with one kind of entity here there are only three tag types in the data (I, O, or B). The training data consists of sentences formatted as one token per line with the correct tag (tab separated), sentences are separated with blank lines. An example is shown below. In this example, there is one gene mentioned “human APEX gene” with the corresponding tag sequence B I I.
1 Structure O 2 , O 3 promoter O 4 analysis O 5 and O 6 chromosomal O 7 assignment O 8 of O 9 the O 10 human B 11 APEX I 12 gene I 13 . O
In this assignment, you are free to use any relevant machine learning approach or framework you like. The following are three possible approaches:
1. Use an HMM-based approach to tagging as we discussed for POS tagging using either a bigram or trigram approach. To do this, you’ll need to collect the relevant counts, possibly implement some smoothing, and decode using Viterbi.
2. Use a feature-based approach (as in HW 2) with a sliding window approach over the input sequences. Features for NER problems usually include the words themselves, their shape (upper, lower, mixed case, camel-case, includes digits etc). The particular, ML algorithm used is less important than the features in this approach.
3. Use a pre-trained language model like BERT to fine-tune a sequence labeler.
The first approach can definitely be coded from scratch without the need for additional frameworks. For the second approach you might build on your HW 2 solution or you could make use of relative libraries in sklearn. For either of these approaches you’ll have to deal with the issue of unknown words.
The third approach will require you to make use of some pretrained language models and frameworks. A good place to start would be huggingface. Huggingface has a demo/tutorial for doing NER tagging that you’re free to use. You’ll probably need access to additional compute beyond your laptop for this. For this approach, unknown words are dealt with via sub-word modeling. But that does introduce its own complications.
As noted in Ch 8, evaluation of these kinds of systems is not based on per tag accuracy (you can do pretty well on that basis by just saying O all the time). What we really want to optimize is recall, precision and f-measures at the gene level. Remember that precision is the ratio of correctly identified genes identified by the system to the total number of genes your system found, and recall is the ratio of correctly identified genes to the total that you should have found. The F1 measure given in the book is just the harmonic mean of these two.
The output of your system should mirror (exactly) the training set format. Given a set of test sentences, emit tagged sentences with three outputs per line (line number <tab> token <tab>, IOB tag). With a blank line separating the sentences. The test sentences will be missing the tag column; your system should fill it in.
I’ll use F1 as computed by the eval script to evaluate your systems on the withheld test data. You should create training and dev sets from the data I’m providing for use in developing your system. I’m providing an evaluation script you can use along with the training data.
Submit a report of what you did, your python code, and test results. I will post a test set before the due date. Run the test data through your system and submit the output. The output format of your system should be the same as the input training data: one token per line, with a tab separating the token from its tag, and a blank line separating sentences.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx