pycharmers.utils.soup_utils module

pycharmers.utils.soup_utils.str2soup(string)[source]

Convert strings to soup, and removed extra tags such as <html>, <body>, and <head>.

Parameters

string (str) – strings

Returns

A data structure representing a parsed HTML or XML document.

Return type

bs4.BeautifulSoup

Examples

>>> from pycharmers.utils import str2soup
>>> string = "<title>Python-Charmers</title>"
>>> type(string)
str
>>> soup = str2soup(string)
>>> soup
<title>Python-Charmers</title>
>>> type(soup)
bs4.BeautifulSoup
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(string)
<html><head><title>Python-Charmers</title></head></html>
pycharmers.utils.soup_utils.split_section(section, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]

Split bs4.BeautifulSoup.

Parameters
  • section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.

  • name (str) – A filter on tag name.

  • attrs (dict) – A dictionary of filters on attribute values.

  • recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • text (str) – An inner text.

  • kwargs (dict) – A dictionary of filters on attribute values.

Returns

A list of elements without filter tag elements.

Return type

list

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import split_section
>>> section = BeautifulSoup("""
... <section>
...   <div>
...     <h2>Title</h2>
...     <div>
...     <p>aaaaaaaaaaaaaaaaaaaaaa</p>
...     <div>
...     <img/>
...     </div>
...     <p>bbbbbbbbbbbbbbbbbbbbbb</p>
...     </div>
...   </div>
... </section>
>>> """)
>>> len(split_section(section, name="img"))
3
>>> split_section(section, name="img")
[<section>
<div>
<h2>Title</h2>
<div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
<div>
</div></div></div></section>,
<img/>,
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
]
pycharmers.utils.soup_utils.group_soup_with_head(soup, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]

Gouping bs4.BeautifulSoup based on head.

Parameters
  • section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.

  • name (str) – A filter on tag name.

  • attrs (dict) – A dictionary of filters on attribute values.

  • recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • text (str) – An inner text.

  • kwargs (dict) – A dictionary of filters on attribute values.

Returns

A list of elements without filter tag elements.

Return type

list

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import group_soup_with_head
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div>
...   <p>aaaaaaaaaaaaaaaaaaaaaa</p>
... </div>
... <h2>BBB</h2>
... <div>
...   <p>bbbbbbbbbbbbbbbbbbbbbb</p>
... </div>
>>> """)
>>> sections = group_soup_with_head(section, name="h2")
>>> len(sections)
2
>>> sections
[<section><h2>AAA</h2><div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
</div>
</section>,
<section><h2>BBB</h2><div>
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
</div>
</section>]
pycharmers.utils.soup_utils.replace_soup_tag(soup, new_name, new_namespace=None, new_nsprefix=None, new_attrs={}, new_sourceline=None, new_sourcepos=None, new_kwattrs={}, old_name=None, old_attrs={}, old_recursive=True, old_text=None, old_limit=None, old_kwargs={}, **kwargs)[source]

Replace Old tag with New tag.

  • Args named old_XXX specifies “How to find old tags”

  • Args named new_XXX specifies “How to create new tags”

Parameters
  • old_name (str) – A filter on tag name.

  • old_attrs (dict) – A dictionary of filters on attribute values.

  • old_recursive (bool) – If this is True, .find_all will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • old_limit (int) – Stop looking after finding this many results.

  • old_kwargs (dict) – A dictionary of filters on attribute values.

  • new_name (str) – The name of the new Tag.

  • new_namespace (str) – The URI of the new Tag’s XML namespace, if any.

  • new_prefix (str) – The prefix for the new Tag’s XML namespace, if any.

  • new_attrs (dict) – A dictionary of this Tag’s attribute values; can be used instead of kwattrs for attributes like ‘class’ that are reserved words in Python.

  • new_sourceline (str) – The line number where this tag was (purportedly) found in its source document.

  • new_sourcepos (str) – The character position within sourceline where this tag was (purportedly) found.

  • new_kwattrs (dict) – Keyword arguments for the new Tag’s attribute values.

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import replace_soup_tag
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div>
...   <p>aaaaaaaaaaaaaaaaaaaaaa</p>
... </div>
... <h3>BBB</h3>
... <div>
...   <p>bbbbbbbbbbbbbbbbbbbbbb</p>
... </div>
>>> """)
>>> section = replace_soup_tag(soup=section, old_name="h3", new_name="h2")
>>> section
<html><body><h2>AAA</h2>
<div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
</div>
<h2>BBB</h2>
<div>
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
</div>
</body></html>
pycharmers.utils.soup_utils.find_target_text(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, **kwargs)[source]

Find target element, and get all child strings from it.

Parameters
  • soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.

  • name (str) – A filter on tag name.

  • attrs (dict) – A dictionary of filters on attribute values.

  • recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • text (str) – An inner text.

  • default (str) – Default return value if element not found.

  • strip (bool) – Whether to use str_strip

  • kwargs (dict) – A dictionary of filters on attribute values.

Returns

text

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_target_text
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div> <p>aaaaaaaaaaaaaaaaaaaaaa</p></div>
>>> """)
>>> find_target_text(soup=section, name="div")
'aaaaaaaaaaaaaaaaaaaaaa'
>>> find_target_text(soup=section, name="div", strip=False)
' aaaaaaaaaaaaaaaaaaaaaa '
>>> find_target_text(soup=section, name="divdiv", default="not found")
'not found'
pycharmers.utils.soup_utils.find_all_target_text(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, joint='', **kwargs)[source]

Find target element, and get all child strings from it.

Parameters
  • soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.

  • name (str) – A filter on tag name.

  • attrs (dict) – A dictionary of filters on attribute values.

  • recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • text (str) – An inner text.

  • default (str) – Default return value if element not found.

  • strip (bool) – Whether to use str_strip

  • joint (str) – Inserted between target strings.

  • kwargs (dict) – A dictionary of filters on attribute values.

Returns

text

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_all_target_text
>>> section = BeautifulSoup("""
... <div>
...   <p class="lang en">Hello</p>
...   <p class="lang zh-CN">你好</p>
...   <p class="lang es">Hola</p>
...   <p class="lang fr">Bonjour</p>
...   <p class="lang ja">こんにちは</p>
... </div>
>>> """)
>>> find_all_target_text(soup=section, name="p", class_="lang", joint=", ")
'Hello, 你好, Hola, Bonjour, こんにちは'
>>> find_all_target_text(soup=section, name="p", class_="es", joint=", ")
'Hola'
pycharmers.utils.soup_utils.find_target_id(soup, key, name=None, attrs={}, recursive=True, text=None, default=None, strip=True, **kwargs)[source]

Find target element, and get id from it.

Parameters
  • soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.

  • key (str) – id name.

  • name (str) – A filter on tag name.

  • attrs (dict) – A dictionary of filters on attribute values.

  • recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.

  • text (str) – An inner text.

  • default (str) – Default return value if element not found.

  • strip (bool) – Whether to use str_strip

  • kwargs (dict) – A dictionary of filters on attribute values.

Returns

text.

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_target_id
>>> section = BeautifulSoup("""
... <h2>IMAGE</h2>
... <div>
...   <img id="apple-touch-icon" src="https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png">
... </div>
>>> """)
>>> find_target_id(soup=section, name="img", key="id")
'apple-touch-icon'
>>> find_target_id(soup=section, name="img", key="src")
'https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png'
pycharmers.utils.soup_utils.get_soup(url, driver=None, features='lxml', timeout=1)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL.

  • driver (WebDriver) – webdriver

  • features (str) – Desirable features of the parser to be used. This may be the name of a specific parser (“lxml”, “lxml-xml”, “html.parser”, or “html5lib”) or it may be the type of markup to be used (“html”, “html5”, “xml”). It’s recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup