Skip to content Skip to sidebar Skip to footer

Extract Academic Publication Information From IDEAS

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing

Solution 1:

You can get the desired information like this:

from requests import get
import pprint
from bs4 import BeautifulSoup

url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
    title_list.append([x.text for x in panel.find_all('a')])
    author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))

pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)

outputs:

[   (   2020,
        ['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
        ['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-PiƱeiro']),
    (   2019,
        [   'Probability Forecasts and Prediction Markets',
            'R&D Financing And Growth',
            'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
            'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
            'Consumption Smoothing Channels Within And Between Households',
            'A critical analysis of the secular stagnation theory',
            'Further evidence of the relationship between social transfers and income inequality in OECD countries',
            'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
        [   'Julia Mortera & A. Philip Dawid',
            'Luca Spinesi & Mario Tirelli',
            'Matteo Deleidi & Mariana Mazzucato',
            'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
            'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
            'Stefano Di Bucchianico',
            "Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
            'Giovanni Scarano']),
    (   2018, ...

I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.


Post a Comment for "Extract Academic Publication Information From IDEAS"