pycharmers.utils.soup_utils module¶
-
pycharmers.utils.soup_utils.
str2soup
(string)[source]¶ Convert strings to soup, and removed extra tags such as
<html>
,<body>
, and<head>
.- Parameters
string (str) – strings
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
bs4.BeautifulSoup
Examples
>>> from pycharmers.utils import str2soup >>> string = "<title>Python-Charmers</title>" >>> type(string) str >>> soup = str2soup(string) >>> soup <title>Python-Charmers</title> >>> type(soup) bs4.BeautifulSoup >>> from bs4 import BeautifulSoup >>> BeautifulSoup(string) <html><head><title>Python-Charmers</title></head></html>
-
pycharmers.utils.soup_utils.
split_section
(section, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]¶ Split
bs4.BeautifulSoup
.- Parameters
section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True,
.find
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.text (str) – An inner text.
kwargs (dict) – A dictionary of filters on attribute values.
- Returns
A list of elements without filter tag elements.
- Return type
list
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import split_section >>> section = BeautifulSoup(""" ... <section> ... <div> ... <h2>Title</h2> ... <div> ... <p>aaaaaaaaaaaaaaaaaaaaaa</p> ... <div> ... <img/> ... </div> ... <p>bbbbbbbbbbbbbbbbbbbbbb</p> ... </div> ... </div> ... </section> >>> """) >>> len(split_section(section, name="img")) 3 >>> split_section(section, name="img") [<section> <div> <h2>Title</h2> <div> <p>aaaaaaaaaaaaaaaaaaaaaa</p> <div> </div></div></div></section>, <img/>, <p>bbbbbbbbbbbbbbbbbbbbbb</p> ]
-
pycharmers.utils.soup_utils.
group_soup_with_head
(soup, name=None, attrs={}, recursive=True, text=None, **kwargs)[source]¶ Gouping
bs4.BeautifulSoup
based on head.- Parameters
section (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True,
.find
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.text (str) – An inner text.
kwargs (dict) – A dictionary of filters on attribute values.
- Returns
A list of elements without filter tag elements.
- Return type
list
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import group_soup_with_head >>> section = BeautifulSoup(""" ... <h2>AAA</h2> ... <div> ... <p>aaaaaaaaaaaaaaaaaaaaaa</p> ... </div> ... <h2>BBB</h2> ... <div> ... <p>bbbbbbbbbbbbbbbbbbbbbb</p> ... </div> >>> """) >>> sections = group_soup_with_head(section, name="h2") >>> len(sections) 2 >>> sections [<section><h2>AAA</h2><div> <p>aaaaaaaaaaaaaaaaaaaaaa</p> </div> </section>, <section><h2>BBB</h2><div> <p>bbbbbbbbbbbbbbbbbbbbbb</p> </div> </section>]
-
pycharmers.utils.soup_utils.
replace_soup_tag
(soup, new_name, new_namespace=None, new_nsprefix=None, new_attrs={}, new_sourceline=None, new_sourcepos=None, new_kwattrs={}, old_name=None, old_attrs={}, old_recursive=True, old_text=None, old_limit=None, old_kwargs={}, **kwargs)[source]¶ Replace Old tag with New tag.
Args named
old_XXX
specifies “How to find old tags”Args named
new_XXX
specifies “How to create new tags”
- Parameters
old_name (str) – A filter on tag name.
old_attrs (dict) – A dictionary of filters on attribute values.
old_recursive (bool) – If this is True,
.find_all
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.old_limit (int) – Stop looking after finding this many results.
old_kwargs (dict) – A dictionary of filters on attribute values.
new_name (str) – The name of the new Tag.
new_namespace (str) – The URI of the new Tag’s XML namespace, if any.
new_prefix (str) – The prefix for the new Tag’s XML namespace, if any.
new_attrs (dict) – A dictionary of this Tag’s attribute values; can be used instead of kwattrs for attributes like ‘class’ that are reserved words in Python.
new_sourceline (str) – The line number where this tag was (purportedly) found in its source document.
new_sourcepos (str) – The character position within
sourceline
where this tag was (purportedly) found.new_kwattrs (dict) – Keyword arguments for the new Tag’s attribute values.
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import replace_soup_tag >>> section = BeautifulSoup(""" ... <h2>AAA</h2> ... <div> ... <p>aaaaaaaaaaaaaaaaaaaaaa</p> ... </div> ... <h3>BBB</h3> ... <div> ... <p>bbbbbbbbbbbbbbbbbbbbbb</p> ... </div> >>> """) >>> section = replace_soup_tag(soup=section, old_name="h3", new_name="h2") >>> section <html><body><h2>AAA</h2> <div> <p>aaaaaaaaaaaaaaaaaaaaaa</p> </div> <h2>BBB</h2> <div> <p>bbbbbbbbbbbbbbbbbbbbbb</p> </div> </body></html>
-
pycharmers.utils.soup_utils.
find_target_text
(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, **kwargs)[source]¶ Find target element, and get all child strings from it.
- Parameters
soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True,
.find
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use
str_strip
kwargs (dict) – A dictionary of filters on attribute values.
- Returns
text
- Return type
str
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import find_target_text >>> section = BeautifulSoup(""" ... <h2>AAA</h2> ... <div> <p>aaaaaaaaaaaaaaaaaaaaaa</p></div> >>> """) >>> find_target_text(soup=section, name="div") 'aaaaaaaaaaaaaaaaaaaaaa' >>> find_target_text(soup=section, name="div", strip=False) ' aaaaaaaaaaaaaaaaaaaaaa ' >>> find_target_text(soup=section, name="divdiv", default="not found") 'not found'
-
pycharmers.utils.soup_utils.
find_all_target_text
(soup, name=None, attrs={}, recursive=True, text=None, default='__NOT_FOUND__', strip=True, joint='', **kwargs)[source]¶ Find target element, and get all child strings from it.
- Parameters
soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True,
.find
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use
str_strip
joint (str) – Inserted between target strings.
kwargs (dict) – A dictionary of filters on attribute values.
- Returns
text
- Return type
str
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import find_all_target_text >>> section = BeautifulSoup(""" ... <div> ... <p class="lang en">Hello</p> ... <p class="lang zh-CN">你好</p> ... <p class="lang es">Hola</p> ... <p class="lang fr">Bonjour</p> ... <p class="lang ja">こんにちは</p> ... </div> >>> """) >>> find_all_target_text(soup=section, name="p", class_="lang", joint=", ") 'Hello, 你好, Hola, Bonjour, こんにちは' >>> find_all_target_text(soup=section, name="p", class_="es", joint=", ") 'Hola'
-
pycharmers.utils.soup_utils.
find_target_id
(soup, key, name=None, attrs={}, recursive=True, text=None, default=None, strip=True, **kwargs)[source]¶ Find target element, and get id from it.
- Parameters
soup (bs4.BeautifulSoup) – A data structure representing a parsed HTML or XML document.
key (str) – id name.
name (str) – A filter on tag name.
attrs (dict) – A dictionary of filters on attribute values.
recursive (bool) – If this is True,
.find
will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.text (str) – An inner text.
default (str) – Default return value if element not found.
strip (bool) – Whether to use
str_strip
kwargs (dict) – A dictionary of filters on attribute values.
- Returns
text.
- Return type
str
Examples
>>> from bs4 import BeautifulSoup >>> from pycharmers.utils import find_target_id >>> section = BeautifulSoup(""" ... <h2>IMAGE</h2> ... <div> ... <img id="apple-touch-icon" src="https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png"> ... </div> >>> """) >>> find_target_id(soup=section, name="img", key="id") 'apple-touch-icon' >>> find_target_id(soup=section, name="img", key="src") 'https://iwasakishuto.github.io/images/apple-touch-icon/Python-Charmers.png'
-
pycharmers.utils.soup_utils.
get_soup
(url, driver=None, features='lxml', timeout=1)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL.
driver (WebDriver) – webdriver
features (str) – Desirable features of the parser to be used. This may be the name of a specific parser (“lxml”, “lxml-xml”, “html.parser”, or “html5lib”) or it may be the type of markup to be used (“html”, “html5”, “xml”). It’s recommended that you name a specific parser, so that Beautiful Soup gives you the same results across platforms and virtual environments.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup