In: Computer Science
how to extract references and citing sentences from PDF in python language?
Please refer to the code given in code segment below. For this you will need to install package "PyPDF2" using the command as following
pip install pyPdf
Code for reading refrences is as following. Note that PDF files are very complex in nature, output of files depends on their structure. I tested the output of this code using the file https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf.
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self):
text = u_("")
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
for operands, operator in content.operations:
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
# new code to add `\n` when text moves to new line
elif operator == b_("Tm"):
text += '\n'
return text
# --- main ---
pdfFileObj = open('16HLT-hierarchical-attention-networks.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
text += page.extractText() # original function
#text += myExtractText(page) # modified function
# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
# print all at once
print(text)
# print line by line
for line in text.split('\n'):
print(line)
print('---')
Change the pdf file directory as per yours file location.