pycharmers.utils.soup_utils module¶

pycharmers.utils.soup_utils.str2soup(string)[source]¶

Convert strings to soup, and removed extra tags such as <html>, <body>, and <head>.

Parameters: string (str) – strings
Returns: A data structure representing a parsed HTML or XML document.
Return type: bs4.BeautifulSoup

Examples

>>> from pycharmers.utils import str2soup
>>> string = "<title>Python-Charmers</title>"
>>> type(string)
str
>>> soup = str2soup(string)
>>> soup
<title>Python-Charmers</title>
>>> type(soup)
bs4.BeautifulSoup
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(string)
<html><head><title>Python-Charmers</title></head></html>

pycharmers.utils.soup_utils.split_section(section, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]¶

Split bs4.BeautifulSoup.

Parameters

section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
text (str) – An inner text.
kwargs (dict) – A dictionary of filters on attribute values.

Returns

A list of elements without filter tag elements.

Return type

list

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import split_section
>>> section = BeautifulSoup("""
... <section>
...   <div>
...     <h2>Title</h2>
...     <div>
...     <p>aaaaaaaaaaaaaaaaaaaaaa</p>
...     <div>
...     <img/>
...     </div>
...     <p>bbbbbbbbbbbbbbbbbbbbbb</p>
...     </div>
...   </div>
... </section>
>>> """)
>>> len(split_section(section, name="img"))
3
>>> split_section(section, name="img")
[<section>
<div>
<h2>Title</h2>
<div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
<div>
</div></div></div></section>,
<img/>,
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
]

pycharmers.utils.soup_utils.group_soup_with_head(soup, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]¶

Gouping bs4.BeautifulSoup based on head.

Parameters

section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
text (str) – An inner text.
kwargs (dict) – A dictionary of filters on attribute values.

Returns

A list of elements without filter tag elements.

Return type

list

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import group_soup_with_head
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div>
...   <p>aaaaaaaaaaaaaaaaaaaaaa</p>
... </div>
... <h2>BBB</h2>
... <div>
...   <p>bbbbbbbbbbbbbbbbbbbbbb</p>
... </div>
>>> """)
>>> sections = group_soup_with_head(section, name="h2")
>>> len(sections)
2
>>> sections
[<section><h2>AAA</h2><div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
</div>
</section>,
<section><h2>BBB</h2><div>
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
</div>
</section>]

pycharmers.utils.soup_utils.replace_soup_tag(soup, new_name, new_namespace=None, new_nsprefix=None, new_attrs={}, new_sourceline=None, new_sourcepos=None, new_kwattrs={}, old_name=None, old_attrs={}, old_recursive=True, old_text=None, old_limit=None, old_kwargs={}, **kwargs)[source]¶

Replace Old tag with New tag.

Args named old_XXX specifies “How to find old tags”
Args named new_XXX specifies “How to create new tags”

Parameters

old_name (str) – A filter on tag name.
old_attrs (dict) – A dictionary of filters on attribute values.
old_recursive (bool) – If this is True, .find_all will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
old_limit (int) – Stop looking after finding this many results.
old_kwargs (dict) – A dictionary of filters on attribute values.
new_name (str) – The name of the new Tag.
new_namespace (str) – The URI of the new Tag’s XML namespace, if any.
new_prefix (str) – The prefix for the new Tag’s XML namespace, if any.
new_attrs (dict) – A dictionary of this Tag’s attribute values; can be used instead of kwattrs for attributes like ‘class’ that are reserved words in Python.
new_sourceline (str) – The line number where this tag was (purportedly) found in its source document.
new_sourcepos (str) – The character position within sourceline where this tag was (purportedly) found.
new_kwattrs (dict) – Keyword arguments for the new Tag’s attribute values.

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import replace_soup_tag
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div>
...   <p>aaaaaaaaaaaaaaaaaaaaaa</p>
... </div>
... <h3>BBB</h3>
... <div>
...   <p>bbbbbbbbbbbbbbbbbbbbbb</p>
... </div>
>>> """)
>>> section = replace_soup_tag(soup=section, old_name="h3", new_name="h2")
>>> section
<html><body><h2>AAA</h2>
<div>
<p>aaaaaaaaaaaaaaaaaaaaaa</p>
</div>
<h2>BBB</h2>
<div>
<p>bbbbbbbbbbbbbbbbbbbbbb</p>
</div>
</body></html>

pycharmers.utils.soup_utils.find_target_text(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, **kwargs)[source]¶

Find target element, and get all child strings from it.

Parameters

soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use str_strip
kwargs (dict) – A dictionary of filters on attribute values.

Returns

text

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_target_text
>>> section = BeautifulSoup("""
... <h2>AAA</h2>
... <div> <p>aaaaaaaaaaaaaaaaaaaaaa</p></div>
>>> """)
>>> find_target_text(soup=section, name="div")
'aaaaaaaaaaaaaaaaaaaaaa'
>>> find_target_text(soup=section, name="div", strip=False)
' aaaaaaaaaaaaaaaaaaaaaa '
>>> find_target_text(soup=section, name="divdiv", default="not found")
'not found'

pycharmers.utils.soup_utils.find_all_target_text(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, joint='', **kwargs)[source]¶

Find target element, and get all child strings from it.

Parameters

soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use str_strip
joint (str) – Inserted between target strings.
kwargs (dict) – A dictionary of filters on attribute values.

Returns

text

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_all_target_text
>>> section = BeautifulSoup("""
... <div>
...   <p class="lang en">Hello</p>
...   <p class="lang zh-CN">你好</p>
...   <p class="lang es">Hola</p>
...   <p class="lang fr">Bonjour</p>
...   <p class="lang ja">こんにちは</p>
... </div>
>>> """)
>>> find_all_target_text(soup=section, name="p", class_="lang", joint=", ")
'Hello, 你好, Hola, Bonjour, こんにちは'
>>> find_all_target_text(soup=section, name="p", class_="es", joint=", ")
'Hola'

pycharmers.utils.soup_utils.find_target_id(soup, key, name=None, attrs={}, recursive=True, text=None, default=None, strip=True, **kwargs)[source]¶

Find target element, and get id from it.

Parameters

soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
key (str) – id name.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True, .find will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use str_strip
kwargs (dict) – A dictionary of filters on attribute values.

Returns

text.

Return type

str

Examples

>>> from bs4 import BeautifulSoup
>>> from pycharmers.utils import find_target_id
>>> section = BeautifulSoup("""
... <h2>IMAGE</h2>
... <div>
...   <img id="apple-touch-icon" src="https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png">
... </div>
>>> """)
>>> find_target_id(soup=section, name="img", key="id")
'apple-touch-icon'
>>> find_target_id(soup=section, name="img", key="src")
'https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png'

pycharmers.utils.soup_utils.get_soup(url, driver=None, features='lxml', timeout=1)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL.
driver (WebDriver) – webdriver
features (str) – Desirable features of the parser to be used. This may be the name of a specific parser (“lxml”, “lxml-xml”, “html.parser”, or “html5lib”) or it may be the type of markup to be used (“html”, “html5”, “xml”). It’s recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

pycharmers.utils.soup_utils module¶

Other contents

Social link

Table of Contents

Previous topic

Next topic

This Page