{"id":9715,"date":"2023-10-08T18:28:05","date_gmt":"2023-10-08T16:28:05","guid":{"rendered":"https:\/\/via-internet.de\/blog\/?p=9715"},"modified":"2023-10-14T17:36:41","modified_gmt":"2023-10-14T15:36:41","slug":"the-complete-beautifulsoup-cheatsheet-with-examples","status":"publish","type":"post","link":"https:\/\/via-internet.de\/blog\/2023\/10\/08\/the-complete-beautifulsoup-cheatsheet-with-examples\/","title":{"rendered":"BeautifulSoup | Complete Cheatsheet with Examples"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Installation<\/h2>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip install beautifulsoup4\n<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from bs4 import BeautifulSoup\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Creating a BeautifulSoup Object<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Parse HTML string:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">html = \"&lt;p>Example paragraph&lt;\/p>\"\nsoup = BeautifulSoup(html, 'html.parser')<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Parse from file:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">with open(\"index.html\") as file:\n  soup = BeautifulSoup(file, 'html.parser')<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">BeautifulSoup Object Types<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When parsing documents and navigating the parse trees, you will encounter the following main object types:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tag<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A Tag corresponds to an HTML or XML tag in the original document:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup = BeautifulSoup('&lt;p>Hello World&lt;\/p>')\np_tag = soup.p\n\np_tag.name # 'p'\np_tag.string # 'Hello World'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Tags contain nested Tags and NavigableStrings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">NavigableString<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A NavigableString represents text content without tags:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup = BeautifulSoup('Hello World')\ntext = soup.string\n\ntext # 'Hello World'\ntype(text) # bs4.element.NavigableString<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">BeautifulSoup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The BeautifulSoup object represents the parsed document as a whole. It is the root of the tree:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup = BeautifulSoup('&lt;html>...&lt;\/html>')\n\nsoup.name # '[document]'\nsoup.head # &lt;head> Tag element\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Comment<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Comments in HTML are also available as Comment objects:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;!-- This is a comment -->\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Copy<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">comment = soup.find(text=re.compile('This is'))\ntype(comment) # bs4.element.Comment\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Knowing these core object types helps when analyzing, searching, and navigating parsed documents.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Searching the Parse Tree<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">By Name<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;div>\n  &lt;p>Paragraph 1&lt;\/p>\n  &lt;p>Paragraph 2&lt;\/p>\n&lt;\/div>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">paragraphs = soup.find_all('p')\n# &lt;p>Paragraph 1&lt;\/p>, &lt;p>Paragraph 2&lt;\/p>\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Attributes<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;div id=\"content\">\n  &lt;p>Paragraph 1&lt;\/p>\n&lt;\/div>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:Copy<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">div = soup.find(id=\"content\")\n# &lt;div id=\"content\">...&lt;\/div>\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Text<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;p>This is some text&lt;\/p>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">p = soup.find(text=\"This is some text\")\n# &lt;p>This is some text&lt;\/p>\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Searching with CSS Selectors<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">CSS selectors provide a very powerful way to search for elements within a parsed document.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Some examples of CSS selector syntax:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">By Tag Name<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select all<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">tags:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"p\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By ID<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select element with ID &#8220;main&#8221;:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"#main\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Class Name<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select elements with class &#8220;article&#8221;:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\".article\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Attribute<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select tags with a &#8220;data-category&#8221; attribute:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"[data-category]\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Descendant Combinator<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select paragraphs inside divs:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"div p\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Child Combinator<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select direct children paragraphs:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"div > p\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Adjacent Sibling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select h2 after h1:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"h1 + h2\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">General Sibling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select h2 after any h1:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"h1 ~ h2\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Text<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select elements containing text:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\":contains('Some text')\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Attribute Value<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select input with type submit:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"input[type='submit']\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Pseudo-classes<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select first paragraph:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"p:first-of-type\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Chaining<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Select first article paragraph:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.select(\"article > p:nth-of-type(1)\")\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Accessing Data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;p class=\"content\">Some text&lt;\/p>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">p = soup.find('p')\np.name # \"p\"\np.attrs # {\"class\": \"content\"}\np.string # \"Some text\"\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Power of find_all()<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;find_all()&nbsp;method is one of the most useful and versatile searching methods in BeautifulSoup.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Returns All Matches<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">find_all()&nbsp;will find and return a list of all matching elements:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">all_paras = soup.find_all('p')\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This gives you all paragraphs on a page.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Flexible Queries<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can pass a wide range of queries to&nbsp;find_all():Name &#8211;&nbsp;find_all(&#8216;p&#8217;)Attributes &#8211;&nbsp;find_all(&#8216;a&#8217;, class_=&#8217;external&#8217;)Text &#8211;&nbsp;find_all(text=re.compile(&#8216;summary&#8217;))Limit &#8211;&nbsp;find_all(&#8216;p&#8217;, limit=2)And more!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Useful Features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Some useful things you can do with&nbsp;find_all():Get a count &#8211;&nbsp;len(soup.find_all(&#8216;p&#8217;))Iterate through results &#8211;&nbsp;for p in soup.find_all(&#8216;p&#8217;):Convert to text &#8211;&nbsp;[p.get_text() for p in soup.find_all(&#8216;p&#8217;)]Extract attributes &#8211;&nbsp;[a[&#8216;href&#8217;] for a in soup.find_all(&#8216;a&#8217;)]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why It&#8217;s Useful<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In summary,&nbsp;find_all()&nbsp;is useful because:It returns all matching elementsIt supports diverse and powerful queriesIt enables easily extracting and processing result data<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Whenever you need to get a collection of elements from a parsed document,&nbsp;find_all()&nbsp;will likely be your go-to tool.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Navigating Trees<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Traverse up and sideways through related elements.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Modifying the Parse Tree<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">BeautifulSoup provides several methods for editing and modifying the parsed document tree.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;p>Original text&lt;\/p>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">p = soup.find('p')\np.string = \"New text\"\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Edit Tag Names<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Change an existing tag name:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag = soup.find('span')\ntag.name = 'div'\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Edit Attributes<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Add, modify or delete attributes of a tag:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag['class'] = 'header' # set attribute\ntag['id'] = 'main'\n\ndel tag['class'] # delete attribute\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Edit Text<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Change text of a tag:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.string = \"New text\"\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Append text to a tag:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.append(\"Additional text\")\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Insert Tags<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Insert a new tag:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">new_tag = soup.new_tag(\"h1\")\ntag.insert_before(new_tag)\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Delete Tags<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Remove a tag entirely:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.extract()\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Wrap\/Unwrap Tags<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Wrap another tag around:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.wrap(soup.new_tag('div))\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Unwrap its contents:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.unwrap()\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Modifying the parse tree is very useful for cleaning up scraped data or extracting the parts you need.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Outputting HTML<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Input HTML:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;p>Hello World&lt;\/p>\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Python:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(soup.prettify())\n\n# &lt;p>\n#  Hello World\n# &lt;\/p>\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Integrating with Requests<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Fetch a page:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import requests\n\nres = requests.get(\"&lt;https:\/\/example.com>\")\nsoup = BeautifulSoup(res.text, 'html.parser')\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Parsing Only Parts of a Document<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When dealing with large documents, you may want to parse only a fragment rather than the whole thing. BeautifulSoup allows for this using SoupStrainers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are a few ways to parse only parts of a document:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">By CSS Selector<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Parse just a selection matching a CSS selector:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from bs4 import SoupStrainer\n\nonly_tables = SoupStrainer(\"table\")\nsoup = BeautifulSoup(doc, parse_only=only_tables)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This will parse only the tags from the document.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">By Tag Name<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Parse only specific tags:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">only_divs = SoupStrainer(\"div\")\nsoup = BeautifulSoup(doc, parse_only=only_divs)\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">By Function<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Pass a function to test if a tag should be parsed:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def is_short_string(string):\n  return len(string) &lt; 20\n\nonly_short_strings = SoupStrainer(string=is_short_string)\nsoup = BeautifulSoup(doc, parse_only=only_short_strings)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This parses tags based on their text content.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">By Attributes<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Parse tags that contain specific attributes:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">has_data_attr = SoupStrainer(attrs={\"data-category\": True})\nsoup = BeautifulSoup(doc, parse_only=has_data_attr)\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Multiple Conditions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can combine multiple strainers:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">strainer = SoupStrainer(\"div\", id=\"main\")\nsoup = BeautifulSoup(doc, parse_only=strainer)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This will parse only<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Parsing only parts you need can help reduce memory usage and improve performance when scraping large documents.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dealing with Encoding<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When parsing documents, you may encounter encoding issues. Here are some ways to handle encoding:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Specify at Parse Time<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Pass the from_encoding parameter when creating the BeautifulSoup object:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup = BeautifulSoup(doc, from_encoding='utf-8')\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This handles any decoding needed when initially parsing the document.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encode Tag Contents<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can encode the contents of a tag:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tag.string.encode(\"utf-8\")\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Use this when outputting tag strings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encode Entire Document<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To encode the entire BeautifulSoup document:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">soup.encode(\"utf-8\")\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This returns a byte string with the encoded document.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pretty Print with Encoding<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Specify encoding when pretty printing<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(soup.prettify(encoder=\"utf-8\"))\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Unicode Dammit<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">BeautifulSoup&#8217;s <em>UnicodeDammit <\/em>class can detect and convert incoming documents to Unicode:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from bs4 import UnicodeDammit\n\ndammit = UnicodeDammit(doc)\nsoup = dammit.unicode_markup\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This converts even poorly encoded documents to Unicode.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Properly handling encoding ensures your scraped data is decoded and output correctly when using BeautifulSoup.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Installation Creating a BeautifulSoup Object Parse HTML string: Parse from file: BeautifulSoup Object Types When parsing documents and navigating the parse trees, you will encounter the following main object types: Tag A Tag corresponds to an HTML or XML tag in the original document: Tags contain nested Tags and NavigableStrings. NavigableString A NavigableString represents text content without tags: BeautifulSoup The BeautifulSoup object represents the parsed document as a whole. It is the root of the tree: Comment Comments in HTML are also available as Comment objects: Copy Knowing these core object types helps when analyzing, searching, and navigating parsed documents. Searching [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[158,65],"tags":[],"class_list":["post-9715","post","type-post","status-publish","format-standard","hentry","category-beautifulsoup","category-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/9715","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/comments?post=9715"}],"version-history":[{"count":6,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/9715\/revisions"}],"predecessor-version":[{"id":9724,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/9715\/revisions\/9724"}],"wp:attachment":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/media?parent=9715"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/categories?post=9715"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/tags?post=9715"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}