Data Acquisition

ATA ACQUISITION:

This step scrapes article URLs from the cryptocurrency news website, thenewscrypto.com for five specified coins: Dogecoin, Bitcoin, Ethereum, Solana, and Hamster. The script fetches up to 10 article links per coin from the first page of search results and saves them into individual CSV files (e.g., dogecoin_news_urls.csv). It incorporates error handling, logging, and a user-agent header to mimic a browser request, ensuring robust and reliable web scraping.

Key Features

Scrapes only the first page of search results per coin.
Limits extraction to 10 articles per page to maintain consistency.
Uses a user-agent header to avoid being blocked by the website.
Implements logging to track progress and errors.
Saves results in CSV format with coin names and URLs.

Requests: https://pypi.org/project/requests/

It is used for sending the HTTP request to the website and then retrieving the HTML content of the search results.
Used in the get_article_link() function to fetch the webpage content for a given coin and page number via requests.get(url, headers= HEADERS, timeout=10)
Efficiency:
Requests simplify HTTP interactions, abstracting away low-level socket programming and connection handling.
This makes it highly efficient for fetching web pages compared to manual alternatives like urllib.
Reliability:
The library handles redirects, connection errors, and timeouts gracefully.
The timeout=10 parameter ensures the script doesn’t hang indefinitely, while the try-except block catches RequestException errors (e.g., network failures).
Flexibility:
The ability to pass custom headers (e.g., HEADERS) is crucial for avoiding anti-scraping measures like IP blocking, as many websites reject requests without a valid user-agent.
Without requests, the script would require complex manual HTTP request construction, reducing readability and increasing the likelihood of errors. Its use ensures reliable page retrieval, critical for subsequent parsing.

response = requests.get("https://thenewscrypto.com/page/1/?s=bitcoin",
                        headers=HEADERS, timeout=10)

print(response.text)  # Outputs the HTML content of the Bitcoin search results page

response = requests.get("https://thenewscrypto.com/page/1/?s=bitcoin", 
                        headers=HEADERS, timeout=10)

print(response.text)  # Outputs the HTML content of the Bitcoin search results page

CSV: https://docs.python.org/3/library/csv.html

For providing functionality to read and write CSV (Comma-Separated Values) files.
Used in the scrape_articles_for_coin() function to save scraped article URLs into a CSV file (e.g., dogecoin_news_urls.csv) with columns "Coin" and "Article URL" via csv.writer.
CSV Writer: Creates a writer object to write rows to the CSV file.
newline="": Ensures consistent line endings across platforms (e.g., Windows, Linux).
Encoding: Uses utf-8 to support special characters in URLs or coin names.
Simplicity: csv provides a straightforward interface for writing tabular data, eliminating the need to manually format comma-separated strings. This reduces errors like missing delimiters or improper escaping.
Portability: The library ensures the output CSV files are compatible with standard tools (e.g., Excel, pandas), making them reusable for further processing.
Robustness: Handles edge cases like special characters or quotes in URLs automatically, thanks to utf-8 encoding and built-in escaping mechanisms.
Context in Script: Without csv, the script would need to manually write comma-separated lines (e.g., f.write(f"{coin},{link}\n")), increasing complexity and error risk. Its use ensures clean, structured output files, critical for downstream tasks like PDF conversion or analysis.

with open("bitcoin_news_urls.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)

    writer.writerow(["Coin", "Article URL"])
    writer.writerow(["bitcoin", "https://thenewscrypto.com/bitcoin-news"])

Enables logging of events (info, warnings, errors) during execution for debugging and monitoring.
Log messages like "Scraping bitcoin - Page 1" or errors if the request fails.
Configured at the start with logging.basicConfig() and used throughout to log scraping progress (e.g., logging.info), warnings (e.g., logging.warning), and errors (e.g., logging.error).

o Level Setting: Sets the logging level to INFO to capture all events.

o Format: Customizes log messages with timestamp, level, and message (%(asctime)s - %(levelname)s - %(message)s).

o Log Types: Uses info for progress, warning for non-critical issues, and error for exceptions.

Debugging: Provides a detailed trace of the scraping process (e.g., "Scraping bitcoin - Page 1"), making it easy to identify where failures occur (e.g., HTTP errors).
Monitoring: Logs the number of articles found per page, helping verify the script’s success without manual inspection of output files.
Robustness: Captures exceptions (e.g., network timeouts) with logging.error, ensuring issues are documented rather than silently failing.
Without logging, debugging would rely on print statements, which lack timestamps and severity levels, making it harder to track issues in a long-running process. Its use enhances maintainability and reliability, especially for scraping multiple coins.
Example:

logging.info("Scraping bitcoin - Page 1")
# Output: 2025-03-26 10:00:00,123 - INFO - Scraping bitcoin - Page 1

logging.error("Network error occurred")
# Output: 2025-03-26 10:00:01,456 - ERROR - Network error occurred

Bs4 [beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Used for scraping websites. And parsing the HTML content
Ease of Use: BeautifulSoup simplifies HTML parsing with an intuitive API, allowing quick extraction of article links without writing complex regular expressions or DOM traversal logic.
Flexibility: Handles malformed HTML gracefully, which is common on real-world websites, ensuring the script doesn’t break if the page structure varies slightly.
Precision: The use of class_="card-title fs-17" targets specific article titles, reducing the risk of scraping irrelevant links (e.g., ads or navigation).
Without bs4, parsing HTML would require manual string manipulation or a less robust library like re, increasing complexity and fragility. Its effectiveness lies in reliably extracting structured data from unstructured HTML, making it indispensable for web scraping.

html = '<h3 class="card-title fs-17"><a href="https://thenewscrypto.com/bitcoin-news">Bitcoin News</a></h3>'

soup = BeautifulSoup(html, "html.parser")

article = soup.find("h3", class_="card-title fs-17")

link = article.find("a")["href"]

print(link)  # Outputs: https://thenewscrypto.com/bitcoin-news

5. Text Extraction:

A. trafilatura: https://trafilatura.readthedocs.io/en/latest/

A). Web scraping and content extraction library designed to fetch and extract clean, readable text from HTML pages, removing boilerplate (e.g., ads, navigation). Unlike general-purpose scraping tools like BeautifulSoup, trafilatura specializes in extracting the main article content efficiently, which is ideal for news articles.

Used in process_coin() to download article pages with trafilatura.fetch_url(url) and extract their main content with trafilatura.extract(downloaded).
Key Features Utilized:
URL Fetching: Downloads HTML content from a given URL.
Content Extraction: Strips away non-essential elements to isolate the main article text.

Effectiveness

Specialization: Unlike general-purpose scraping tools like BeautifulSoup, trafilatura is optimized for extracting main article content, making it highly effective for news articles where boilerplate removal is critical. It uses heuristics and machine learning to identify primary content, reducing manual parsing effort.
Robustness: Handles diverse webpage structures and gracefully returns None if downloading or extraction fails, allowing the script to continue without crashing. This is evident in the checks for downloaded and content.
Efficiency: Combines fetching and extraction in a streamlined process, minimizing the need for separate HTTP requests (e.g., via requests) and parsing steps. This reduces latency compared to a requests + BeautifulSoup approach.
Context in Script: Without trafilatura, the script would need to fetch pages with requests and manually parse HTML with BeautifulSoup, requiring custom logic to filter boilerplate. Its use ensures clean, relevant text extraction, crucial for generating meaningful PDFs for downstream analysis.

B). pdfkit: https://pypi.org/project/pdfkit/

Converts HTML files to PDF format using the wkhtmltopdf command-line tool (requires wkhtmltopdf to be installed separately).

PreviousGetting Started NextData Preprocessing

Last updated 29 days ago

response = requests.get("https://thenewscrypto.com/page/1/?s=bitcoin", headers=HEADERS, timeout=10) print(response.text) # Outputs the HTML content of the Bitcoin search results page

with open("bitcoin_news_urls.csv", "w", newline="", encoding="utf-8") as csvfile: writer = csv.writer(csvfile) writer.writerow(["Coin", "Article URL"]) writer.writerow(["bitcoin", "https://thenewscrypto.com/bitcoin-news"])

logging.info("Scraping bitcoin - Page 1") # Output: 2025-03-26 10:00:00,123 - INFO - Scraping bitcoin - Page 1 logging.error("Network error occurred") # Output: 2025-03-26 10:00:01,456 - ERROR - Network error occurred

html = '<h3 class="card-title fs-17"><a href="https://thenewscrypto.com/bitcoin-news">Bitcoin News</a></h3>' soup = BeautifulSoup(html, "html.parser") article = soup.find("h3", class_="card-title fs-17") link = article.find("a")["href"] print(link) # Outputs: https://thenewscrypto.com/bitcoin-news