Question

In: Computer Science

how to extract references and citing sentences from PDF in python language?

how to extract references and citing sentences from PDF in python language?

Solutions

Expert Solution

Please refer to the code given in code segment below. For this you will need to install package "PyPDF2" using the command as following

pip install pyPdf

Code for reading refrences is as following. Note that PDF files are very complex in nature, output of files depends on their structure. I tested the output of this code using the file https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf.

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    for operands, operator in content.operations:
        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'
            
    return text
    
# --- main ---

pdfFileObj = open('16HLT-hierarchical-attention-networks.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    text += page.extractText()  # original function
    #text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

Change the pdf file directory as per yours file location.


Related Solutions

Python language!!!!! №1 The translation from the Berland language into the Birland language is not an...
Python language!!!!! №1 The translation from the Berland language into the Birland language is not an easy task. Those languages are very similar: a berlandish word differs from a birlandish word with the same meaning a little: it is spelled (and pronounced) reversely. For example, a Berlandish word code corresponds to a Birlandish word edoc. However, it's easy to make a mistake during the «translation». Vasya translated word s from Berlandish into Birlandish as t. Help him: find out if...
Python Language Only! Python Language Only! Python Language Only! Python 3 is used. When doing this...
Python Language Only! Python Language Only! Python Language Only! Python 3 is used. When doing this try to use a level one skill python, like as if you just learned this don't use anything advanced if you can please. If not then do it as you can. Assume you have a file on your disk named floatnumbers.txt containing float numbers. Write a Python code that reads all the numbers in the file and display their average. Your code must handle...
In a Word document, write an essay (3 page minimum, and citing references) that compares the...
In a Word document, write an essay (3 page minimum, and citing references) that compares the various kinds of data integrity and business rules at the relationship, table, and field levels, and explains why each one is important. Additionally, your essay should provide specific examples of how data integrity can be compromised, and how the lack of appropriate business rules can have a negative impact on the operations of an organization.
Perform a sentiment analysis of a big text file in python Extract each word from the...
Perform a sentiment analysis of a big text file in python Extract each word from the file, transform the words to lower case, and remove special characters from the words using code similar to the following line:w=w.replace(':','').replace('?','').replace(',','').replace('.','').replace('"','').replace('!','').replace('(','').replace(')','').replace('\'','').replace('\\','').replace('/','') Utilize the lists of positive words, found in positive.txt to perform a sentiment analysis on the file (count how many positive words there are in a file) positive.txt crisp crisper cure cure-all cushy cute cuteness danke danken daring ... file.txt ...has a new...
Python I am creating a program that allows a user to extract data from a .csv...
Python I am creating a program that allows a user to extract data from a .csv file and print the statistics of a certain column in that file. The statistics include Count, Mean, Standard Deviation, Min, and Max. Here is the code I have so far: import csv import json class POP: """ Extract the data """ def __init__(self, line): self.data = line # get elements self.id = self.data[0].strip() self.geography = self.data[1].strip() self.targetGeoId = self.data[2].strip() self.targetGeoId2 = self.data[3].strip() self.popApr1 =...
1. Evaluate the current trend and direction in copyright law citing examples and using appropriate references....
1. Evaluate the current trend and direction in copyright law citing examples and using appropriate references. In what ways does copyright differ from patent and trademark law? 2. What are the three main sources of data for solving marketing research problems?  Contrast primary with secondary data and explain the advantages and disadvantages of each.
Explain of how the implementation of a Python dictionary works in 8-10 sentences. In particular, how...
Explain of how the implementation of a Python dictionary works in 8-10 sentences. In particular, how are keys and values stored? What hash function is used? How are collisions resolved? How is the size/capacity of the dictionary maintained?
Healthcare Financing references page citing the source needed Why is financial management important to healthcare organization?
Healthcare Financing references page citing the source needed Why is financial management important to healthcare organization?
Using Python 3. Extract the value associated with the key color and assign it to the...
Using Python 3. Extract the value associated with the key color and assign it to the variable color. Do not hard code this. info = {'personal_data': {'name': 'Lauren', 'age': 20, 'major': 'Information Science', 'physical_features': {'color': {'eye': 'blue', 'hair': 'brown'}, 'height': "5'8"} }, 'other': {'favorite_colors': ['purple', 'green', 'blue'], 'interested_in': ['social media', 'intellectual property', 'copyright', 'music', 'books'] } }
Biased Language Revise the following sentences to eliminate biased language. a. The company decided to create...
Biased Language Revise the following sentences to eliminate biased language. a. The company decided to create a more diverse workforce by encouraged the disabled to apply for the management training program. b. Although each manager was responsible for his own budget, some managers obviously had better accounting skills than others. c. The company policy manual states that each secretary should submit her time card twice a month. d. All Hispanic employees are encouraged to attend the workshop about legal requirements...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT