Skip to content Skip to sidebar Skip to footer

Beautifulsoup Parsing - Dealing With Superscript?

This is the HTML segment I am trying to extract information from: Market Cap (intraday)5&

Solution 1:

The span is not a sibling, it is a child of the sibling of the grandparent first cousin, once removed (thanks, 1.618).

from bs4 import BeautifulSoup as bs
soup = bs("""<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)
<font size="-1"><sup>5</sup></font>:</td><td class="yfnc_tabledata1">
<span id="yfs_j10_aal">33.57B</span></td></tr>""")

soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text
# u'33.57B'

Since you seem to have problems with it, here's my full test script (using python-requests), that reliably works for me:

import requests
from bs4 import BeautifulSoup as bs

url = "https://finance.yahoo.com/q/ks?s=AAL+Key+Statistics"

r = requests.get(url)

soup = bs(r.text)

HTML_MarketCap = soup.find("sup", text="5").parent.parent.find_next_sibling("td").find("span").text

print HTML_MarketCap

Solution 2:

Alternatively, you can simply use find_next() after locating the <sup>5</sup> element, like this:

from bs4 import BeautifulSoup

s = '''<tdclass="yfnc_tablehead1"width="74%">Market Cap (intraday)<fontsize="-1"><sup>5</sup></font>:</td><tdclass="yfnc_tabledata1"><spanid="yfs_j10_aal">33.57B</span></td></tr>'''

soup  =BeautifulSoup(s)

sup = soup.find('sup', text='5')

sup.find_next('span')
Out[5]: <spanid="yfs_j10_aal">33.57B</span>

sup.find_next('span').text
Out[6]: u'33.57B'


>>>help(sup.find_next)

Help on method find_next in module bs4.element:

find_next(self, name=None, attrs={}, text=None, **kwargs) method of bs4.element.Tag instance Returns the first item that matches the given criteria and appears after this Tag in the document.

Post a Comment for "Beautifulsoup Parsing - Dealing With Superscript?"