Question

In: Computer Science

how to extract references and citing sentences from PDF in python language?

how to extract references and citing sentences from PDF in python language?

Solutions

Expert Solution

Please refer to the code given in code segment below. For this you will need to install package "PyPDF2" using the command as following

pip install pyPdf

Code for reading refrences is as following. Note that PDF files are very complex in nature, output of files depends on their structure. I tested the output of this code using the file https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf.

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    for operands, operator in content.operations:
        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'
            
    return text
    
# --- main ---

pdfFileObj = open('16HLT-hierarchical-attention-networks.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    text += page.extractText()  # original function
    #text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

Change the pdf file directory as per yours file location.


Related Solutions

In a Word document, write an essay (3 page minimum, and citing references) that compares the...
In a Word document, write an essay (3 page minimum, and citing references) that compares the various kinds of data integrity and business rules at the relationship, table, and field levels, and explains why each one is important. Additionally, your essay should provide specific examples of how data integrity can be compromised, and how the lack of appropriate business rules can have a negative impact on the operations of an organization.
Perform a sentiment analysis of a big text file in python Extract each word from the...
Perform a sentiment analysis of a big text file in python Extract each word from the file, transform the words to lower case, and remove special characters from the words using code similar to the following line:w=w.replace(':','').replace('?','').replace(',','').replace('.','').replace('"','').replace('!','').replace('(','').replace(')','').replace('\'','').replace('\\','').replace('/','') Utilize the lists of positive words, found in positive.txt to perform a sentiment analysis on the file (count how many positive words there are in a file) positive.txt crisp crisper cure cure-all cushy cute cuteness danke danken daring ... file.txt ...has a new...
Explain of how the implementation of a Python dictionary works in 8-10 sentences. In particular, how...
Explain of how the implementation of a Python dictionary works in 8-10 sentences. In particular, how are keys and values stored? What hash function is used? How are collisions resolved? How is the size/capacity of the dictionary maintained?
1. Evaluate the current trend and direction in copyright law citing examples and using appropriate references....
1. Evaluate the current trend and direction in copyright law citing examples and using appropriate references. In what ways does copyright differ from patent and trademark law? 2. What are the three main sources of data for solving marketing research problems?  Contrast primary with secondary data and explain the advantages and disadvantages of each.
Healthcare Financing references page citing the source needed Why is financial management important to healthcare organization?
Healthcare Financing references page citing the source needed Why is financial management important to healthcare organization?
Using Python 3. Extract the value associated with the key color and assign it to the...
Using Python 3. Extract the value associated with the key color and assign it to the variable color. Do not hard code this. info = {'personal_data': {'name': 'Lauren', 'age': 20, 'major': 'Information Science', 'physical_features': {'color': {'eye': 'blue', 'hair': 'brown'}, 'height': "5'8"} }, 'other': {'favorite_colors': ['purple', 'green', 'blue'], 'interested_in': ['social media', 'intellectual property', 'copyright', 'music', 'books'] } }
Biased Language Revise the following sentences to eliminate biased language. a. The company decided to create...
Biased Language Revise the following sentences to eliminate biased language. a. The company decided to create a more diverse workforce by encouraged the disabled to apply for the management training program. b. Although each manager was responsible for his own budget, some managers obviously had better accounting skills than others. c. The company policy manual states that each secretary should submit her time card twice a month. d. All Hispanic employees are encouraged to attend the workshop about legal requirements...
Python Language: Similar to Project 3, write a program that loops a number from 1 to...
Python Language: Similar to Project 3, write a program that loops a number from 1 to 10 thousand and keeps updating a count variable according to these rules: if the number is divisible by n1, increase count by 1 if the number is divisible by n2, increase count by 2 if the number is divisible by n3, increase count by 3 if none of the above conditions match for the number, increase count by the number. Before the loop begins,...
Add the result envidented by the python compiler. use python language. 1(a) A Give the output...
Add the result envidented by the python compiler. use python language. 1(a) A Give the output for the array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729]) a. a[:6:2] = -1000 (ii) a[ : :-1] ) b. Display the values of 1D array using for loop?
 c. Write a code to print a 3D array?
 d. How to print the transpose of a 2D array?
 e Write a function to sort the array row and coloumn wise ?...
The following extract appeared in the financial statements of Sihle Limited: Sihle Limited Extract from the...
The following extract appeared in the financial statements of Sihle Limited: Sihle Limited Extract from the Statement of Comprehensive Income for the year ended 31 December 2019: R Sales 4 140 000 Cost of sales (3 490 000) Opening inventory 710 000 Purchases 3 630 000 Gross profit 650 000 Operating expenses (429 000) Operating profit 221 000 Interest  expense (35 000) Profit before tax 186 000 Taxation (52 800) Profit after tax 133 200 Additional information: Dividends paid during the...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT