{"id":10262,"date":"2024-07-14T11:01:53","date_gmt":"2024-07-14T09:01:53","guid":{"rendered":"https:\/\/via-internet.de\/blog\/?p=10262"},"modified":"2024-07-14T11:35:37","modified_gmt":"2024-07-14T09:35:37","slug":"daily-ai-analyse-webpages-with-ai","status":"publish","type":"post","link":"https:\/\/via-internet.de\/blog\/2024\/07\/14\/daily-ai-analyse-webpages-with-ai\/","title":{"rendered":"Daily AI: Analyse WebPages with AI"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by providing powerful capabilities for understanding and generating human language. Open-source LLMs have democratized access to these technologies, allowing developers and researchers to innovate and apply these models in various domains. In this blog post, we will explore Ollama, a framework for working with LLMs, and demonstrate how to load webpages, parse them, build embeddings, and query the content using Ollama.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Understanding Large Language Models (LLMs)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">LLMs are neural networks trained on vast amounts of text data to understand and generate human language. They can perform tasks such as translation, summarization, question answering, and more. Popular LLMs include GPT-3, BERT, and their open-source counterparts like GPT-Neo and BERT variants. These models have diverse applications, from chatbots to automated content generation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Introducing Ollama<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Ollama is an open-source framework designed to simplify the use of LLMs in various applications. It provides tools for training, fine-tuning, and deploying LLMs, making it easier to integrate these powerful models into your projects. With Ollama, you can leverage the capabilities of LLMs to build intelligent applications that understand and generate human language.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The following example from the <a href=\"https:\/\/github.com\/ollama\/ollama\/blob\/main\/docs\/tutorials\/langchainpy.md\">ollama documentation<\/a> demonstrates how to use the LangChain framework in conjunction with the Ollama library to load a web page, process its content, create embeddings, and perform a query on the processed data. Below is a detailed explanation of the script&#8217;s functionality and the technologies used.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Technologies Used<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>LangChain<\/strong>: A framework for building applications powered by large language models (LLMs). It provides tools for loading documents, splitting text, creating embeddings, and querying data.<\/li>\n\n\n\n<li><strong>Ollama<\/strong>: A library for working with LLMs and embeddings. In this script, it&#8217;s used to generate embeddings for text data.<\/li>\n\n\n\n<li><strong>BeautifulSoup (bs4)<\/strong>: A library used for parsing HTML and XML documents. It\u2019s essential for loading and processing web content.<\/li>\n\n\n\n<li><strong>ChromaDB<\/strong>: A vector database used for storing and querying embeddings. It allows efficient similarity searches.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Code Breakdown<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Imports and Setup<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The script starts by importing the necessary modules and libraries, including <code>sys<\/code>, <code>Ollama<\/code>, <code>WebBaseLoader<\/code>, <code>RecursiveCharacterTextSplitter<\/code>, <code>OllamaEmbeddings<\/code>, <code>Chroma<\/code>, and <code>RetrievalQA<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from langchain_community.llms import Ollama\n\nfrom langchain_community.document_loaders import WebBaseLoader\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain_community.embeddings import OllamaEmbeddings\nfrom langchain_community.vectorstores import Chroma\nfrom langchain.chains import RetrievalQA<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Loading the Web Page<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The script uses <code>WebBaseLoader<\/code> to load the content of a webpage. In this case, it loads the text of &#8220;The Odyssey&#8221; by Homer from Project Gutenberg.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- get web page\")\n\nloader = WebBaseLoader(\"https:\/\/www.gutenberg.org\/files\/1727\/1727-h\/1727-h.htm\")\ndata = loader.load()<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Splitting the Document<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Due to the large size of the document, it is split into smaller chunks using <code>RecursiveCharacterTextSplitter<\/code>. This ensures that the text can be processed more efficiently.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- split documents\")\n\ntext_splitter=RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\nall_splits = text_splitter.split_documents(data)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Creating Embeddings and Storing Them<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The script creates embeddings for the text chunks using the Ollama library and stores them in ChromaDB, a vector database. This step involves instantiating an embedding model (<code>nomic-embed-text<\/code>) and using it to generate embeddings for each text chunk.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- create vectorstore\")\n\noembed = OllamaEmbeddings(base_url=\"http:\/\/localhost:11434\", model=\"nomic-embed-text\")\nvectorstore = Chroma.from_documents(documents=all_splits, embedding=oembed)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Performing a Similarity Search<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A question is formulated, and the script uses the vector database to perform a similarity search. It retrieves chunks of text that are semantically similar to the question.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- ask for similarities\")\n\nquestion=\"Who is Neleus and who is in Neleus' family?\"\ndocs = vectorstore.similarity_search(question)\nnrofdocs=len(docs)\nprint(f\"{question}: {nrofdocs}\")<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Creating an Ollama Instance and Defining a Retrieval Chain<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The script initializes an instance of the Ollama model and sets up a retrieval-based question-answering (QA) chain. This chain is used to process the question and retrieve the relevant parts of the document.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- create ollama instance\")\nollama = Ollama(\n    base_url='http:\/\/localhost:11434',\n    model=\"llama3\"\n)\n\nprint(\"- get qachain\")\nqachain=RetrievalQA.from_chain_type(ollama, retriever=vectorstore.as_retriever())<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Running the Query<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, the script invokes the QA chain with the question and prints the result.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(\"- run query\")\nres = qachain.invoke({\"query\": question})\n\nprint(res['result'])<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Result<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Now lets look at the impresiv result:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"368\" src=\"https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-1024x368.png\" alt=\"\" class=\"wp-image-10265\" srcset=\"https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-1024x368.png 1024w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-300x108.png 300w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-768x276.png 768w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-1536x553.png 1536w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-10.59.11-2048x737.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Try another example: ask wikipedia page<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this example, we are going to use LangChain and Ollama to learn about something just a touch more recent. In August 2023, there was a series of wildfires on Maui. There is no way an LLM trained before that time can know about this, since their training data would not include anything as recent as that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So we can find the <a href=\"https:\/\/en.wikipedia.org\/wiki\/2023_Hawaii_wildfires\">Wikipedia article about the fires<\/a> and ask questions about the contents.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">url = \"https:\/\/en.wikipedia.org\/wiki\/2023_Hawaii_wildfires\"<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">question=\"When was Hawaii's request for a major disaster declaration approved?\"<\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"200\" src=\"https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-1024x200.png\" alt=\"\" class=\"wp-image-10267\" srcset=\"https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-1024x200.png 1024w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-300x59.png 300w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-768x150.png 768w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-1536x300.png 1536w, https:\/\/via-internet.de\/blog\/wp-content\/uploads\/2024\/07\/Bildschirmfoto-2024-07-14-um-11.31.10-2048x400.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by providing powerful capabilities for understanding and generating human language. Open-source LLMs have democratized access to these technologies, allowing developers and researchers to innovate and apply these models in various domains. In this blog post, we will explore Ollama, a framework for working with LLMs, and demonstrate how to load webpages, parse them, build embeddings, and query the content using Ollama. Understanding Large Language Models (LLMs) LLMs are neural networks trained on vast amounts of text data to understand and generate human language. They can perform tasks [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[135,164,65],"tags":[],"class_list":["post-10262","post","type-post","status-publish","format-standard","hentry","category-daily","category-ollama","category-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/10262","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/comments?post=10262"}],"version-history":[{"count":3,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/10262\/revisions"}],"predecessor-version":[{"id":10269,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/10262\/revisions\/10269"}],"wp:attachment":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/media?parent=10262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/categories?post=10262"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/tags?post=10262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}