Skip to content

Commit

Permalink
parsers.nlm: handle maths in abstract
Browse files Browse the repository at this point in the history
Signed-off-by: Szymon Łopaciuk <[email protected]>
  • Loading branch information
szymonlopaciuk committed Jan 17, 2018
1 parent d745f78 commit 3106759
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 18 deletions.
16 changes: 14 additions & 2 deletions hepcrawl/parsers/nlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

from inspire_schemas.api import LiteratureBuilder
from inspire_utils.date import PartialDate
from inspire_utils.helpers import maybe_int
from inspire_utils.helpers import remove_tags
from inspire_utils.name import ParsedName

from ..utils import get_node
Expand Down Expand Up @@ -99,7 +99,19 @@ def bulk_parse(cls, nlm_records, source=None):

@property
def abstract(self):
return self.root.xpath('normalize-space(./Abstract)').extract_first()
abstract_node = self.root.xpath('./Abstract')

if not abstract_node:
return None

abstract = self.normalize_space(
remove_tags(
abstract_node[0],
allowed_tags=['sup', 'sub'],
allowed_trees=['math'],
)
)
return abstract

@property
def title(self):
Expand Down
16 changes: 7 additions & 9 deletions tests/unit/responses/iop/expected.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
abstract: Somatic BRAF mutation in colon cancer essentially excludes Lynch
syndrome. We compared BRAF V600E immunohistochemistry (IHC) with BRAF
mutation in core, biopsy, and whole-section slides to determine whether
IHC is similar and to assess the cost-benefit of IHC. Resection cases
(2009-2013) with absent MLH1 and PMS2 and prior BRAF mutation polymerase
chain reaction results were chosen (n = 57). To mimic biopsy specimens,
tissue microarrays (TMAs) were constructed. In addition, available biopsies
performed prior to the resection were available in 15 cases. BRAF V600E IHC
was performed and graded on TMAs, available biopsy specimens, and
abstract: This is a sample text containing maths,
<math><int/><mfrac><mi>f(x)</mi><mi>1+x</mi></mfrac><mi>dx</mi></math> and <sup>superscript text</sup>.
Resection cases (2009-2013) with absent MLH1 and PMS2 and prior BRAF mutation
polymerase chain reaction results were chosen (n = 57). To mimic biopsy
specimens, tissue microarrays (TMAs) were constructed. In addition, available
biopsies performed prior to the resection were available in 15 cases. BRAF
V600E IHC was performed and graded on TMAs, available biopsy specimens, and
whole-section slides. Mutation status was compared with IHC, and
cost-benefit analysis was performed. BRAF V600E IHC was similar in TMAs,
biopsy specimens, and whole-section slides, with only four (7%) showing
Expand Down
2 changes: 1 addition & 1 deletion tests/unit/responses/iop/xml/test_standard.xml
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@
</PubDate>
</History>
<Abstract>
<AbstractText Label="OBJECTIVES">Somatic BRAF mutation in colon cancer essentially excludes Lynch syndrome. We compared BRAF V600E immunohistochemistry (IHC) with BRAF mutation in core, biopsy, and whole-section slides to determine whether IHC is similar and to assess the cost-benefit of IHC.</AbstractText>
<AbstractText Label="OBJECTIVES">This is a sample text containing maths, <math><int/><mfrac><mi>f(x)</mi><mi>1+x</mi></mfrac><mi>dx</mi></math> and <sup>superscript text</sup>.</AbstractText>
<AbstractText Label="METHODS">Resection cases (2009-2013) with absent MLH1 and PMS2 and prior BRAF mutation polymerase chain reaction results were chosen (n = 57). To mimic biopsy specimens, tissue microarrays (TMAs) were constructed. In addition, available biopsies performed prior to the resection were available in 15 cases. BRAF V600E IHC was performed and graded on TMAs, available biopsy specimens, and whole-section slides. Mutation status was compared with IHC, and cost-benefit analysis was performed.</AbstractText>
<AbstractText Label="RESULTS">BRAF V600E IHC was similar in TMAs, biopsy specimens, and whole-section slides, with only four (7%) showing discordance between IHC and mutation status. Using BRAF V600E IHC in our Lynch syndrome screening algorithm, we found a 10% cost savings compared with mutational analysis.</AbstractText>
<AbstractText Label="CONCLUSIONS">BRAF V600E IHC was concordant between TMAs, biopsy specimens, and whole-section slides, suggesting biopsy specimens are as useful as whole sections. IHC remained cost beneficial compared with mutational analysis, even though more patients needed additional molecular testing to exclude Lynch
Expand Down
6 changes: 0 additions & 6 deletions tests/unit/test_iop.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,6 @@ def record():
return parsed_item.record


def test_abstract(record):
"""Test extracting abstract."""
assert "abstract" in record
assert record["abstract"].startswith("Somatic BRAF mutation")


def test_title(record):
"""Test extracting title."""
title = 'A Modified Lynch Syndrome Screening Algorithm in Colon Cancer: BRAF Immunohistochemistry Is Efficacious and Cost Beneficial.'
Expand Down

0 comments on commit 3106759

Please sign in to comment.