Data Preprocessing

This step cleans, tokenizes, stems, and lemmatizes the text while preserving cryptocurrency-specific terms, preparing it for downstream tasks like vectorization or sentiment analysis. It processes the text sentence-by-sentence, provides progress updates, and saves the cleaned output to cleaned_data.txt.

Key Features:

Splits text into sentences using a simple newline split for efficiency.
Applies comprehensive cleaning: lowercase conversion, special character removal, tokenization, stemming, and lemmatization.
Uses a progress bar to monitor processing of large datasets.
Preserves a predefined set of cryptocurrency terms to maintain domain-specific context.

Libraries Used:

nltk (Natural Language Toolkit): https://www.nltk.org/

This library is used for tasks like tokenization, stopword removal, and lemmatization. It provides pre-built tools and resources (e.g., tokenizers, stopwords, lemmatizers) that simplify text preprocessing, avoiding the need to write these from scratch.

Provides a broad suite of natural language processing tools and resources, serving as the backbone for tokenization, stopwords, and stemming/lemmatization in this script.
Used indirectly through its submodules (nltk.tokenize, nltk.corpus, nltk.stem) and to download required resources (punkt, stopwords, wordnet) via nltk.download().

Resource Downloads: Ensures necessary datasets are available for tokenization and lemmatization.
Foundation: Acts as the parent library for specialized NLP tasks.
Versatility: nltk is a comprehensive NLP library, offering pre-built tools that save time compared to custom implementations. Its modular design allows specific submodules to handle distinct tasks.
Reliability: The downloaded resources (punkt, stopwords, wordnet) are well-tested and widely used, ensuring consistent preprocessing results.
Without nltk, the script would require manual tokenization, stopword lists, and lemmatization logic, significantly increasing complexity. Its role is foundational, enabling the use of specialized submodules for efficient text processing.

Example:

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

# Downloads tokenization, stopwords, and lemmatization resources

re: https://docs.python.org/3/library/re.html

Provides regular expression matching operations to remove unwanted characters from text.
Used in preprocess_sentence() to remove non-ASCII characters and punctuation via re.sub().
- Pattern Matching: Removes specific character sets (e.g., non-ASCII, non-alphanumeric) using regex patterns.
- Substitution: Replaces matched patterns with spaces or empty strings.
Precision: re allows fine-grained control over text cleaning, effectively targeting non-ASCII characters ([^\x00-\x7F]) and punctuation ([^a-zA-Z0-9\s]).
Speed: Regular expressions are optimized for string operations, making them faster than iterative character checks for large texts.

Without re, cleaning special characters would require slower, less flexible methods (e.g., multiple replace() calls). Its use ensures robust text normalization, critical for consistent tokenization and analysis.

import re

sentence = "Bitcoin’s price rose â€™ today!"

# Remove non-ASCII characters
sentence = re.sub(r"[^\x00-\x7F]+", " ", sentence)

# Remove punctuation (everything except letters, numbers, and whitespace)
sentence = re.sub(r"[^a-zA-Z0-9\s]", "", sentence)

print(sentence)  # Outputs: "Bitcoins price rose today"

Supplies string manipulation utilities, specifically for punctuation removal.
Usage in Code: Used in preprocess.sentence() to create a translation table for removing punctuation via str.maketrans("", "", string.punctuation).
Punctuation Constants: Provides string.punctuation, a predefined set of punctuation characters (e.g., !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`).
Translation Table: Generates a mapping to strip punctuation efficiently.
Simplicity: string.punctuation offers a ready-made list of characters to remove, avoiding the need to define them manually.
Efficiency: The translate() method with a translation table is faster than regex or iterative replacements for punctuation removal.

While re could handle punctuation removal, string provides a cleaner, more efficient alternative for this specific task. Its use complements re by focusing on punctuation, enhancing the overall cleaning process.

import string

sentence = "Bitcoin, Ethereum, and Solana!"

# Remove all punctuation
sentence = sentence.translate(str.maketrans("", "", string.punctuation))

print(sentence)  # Outputs: "Bitcoin Ethereum and Solana"

4. nltk.tokenize: https://www.nltk.org/api/nltk.tokenize.html

Role

Purpose: Provides tokenization tools to split text into words or sentences.
Used in preprocess_sentence() to tokenize a sentence into words via word_tokenize(sentence).
- Word Tokenization: Splits a sentence into individual words based on the punkt tokenizer.
Accuracy: word_tokenize leverages the punkt model, trained on diverse English texts, to accurately handle punctuation and contractions (e.g., "Bitcoin's" → ["Bitcoin", "'s"]).
Convenience: Eliminates the need for custom tokenization logic, which would be error-prone and less robust (e.g., simple split() misses edge cases).
Without nltk.tokenize, the script might use split(), which fails on complex punctuation or multi-word crypto terms. Its effectiveness ensures precise word-level processing, critical for stemming and lemmatization.

from nltk.tokenize import word_tokenize

sentence = "Bitcoin's price rose today!"

words = word_tokenize(sentence)

print(words)  # Outputs: ['Bitcoin', "'s", 'price', 'rose', 'today', '!']

Provides access to linguistic resources, such as a list of stopwords.
Used to load English stopwords into a set via stopwords.words("english") in the setup.
- Stopwords: Supplies a precompiled list of common English words (e.g., "the", "is", "and") to filter out.
Efficiency: Loading stopwords into a set enables fast lookup during processing, crucial for large datasets.
Standardization: Uses a widely accepted stopword list, ensuring consistency with NLP best practices.
Without nltk.corpus, the script would require a custom stopword list, risking omissions or inconsistencies. Its use reduces noise in the cleaned text, improving the quality of downstream analysis.

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

print("the" in stop_words)      # Outputs: True

print("bitcoin" in stop_words)  # Outputs: False

5. nltk.stem : https://www.nltk.org/api/nltk.stem.html

Purpose: Provides stemming and lemmatization tools to reduce words to their base forms.
Usage in Code: Used in preprocess_sentence() to initialize and apply WordNetLemmatizer and PorterStemmer via stemmer.stem(word) and lemmatizer.lemmatize(stemmed_word).
Key Features Utilized:
- PorterStemmer: Applies rule-based stemming (e.g., "running" → "run").
- WordNetLemmatizer: Uses WordNet to lemmatize words to their dictionary form (e.g., "better" → "good").
Normalization: Combining stemming and lemmatization ensures words are reduced to consistent base forms, improving text similarity for analysis (e.g., "mining" and "mined" → "mine").
Precision: WordNetLemmatizer leverages a lexical database for accurate base forms, while PorterStemmer handles simpler cases efficiently.
Without nltk.stem, words like "prices" and "pricing" would remain distinct, reducing the effectiveness of vectorization or retrieval. Its use enhances text consistency while preserving crypto terms.

from nltk.stem import WordNetLemmatizer, PorterStemmer

lemmatizer = WordNetLemmatizer()

stemmer = PorterStemmer()

word = "mining"

stemmed = stemmer.stem(word)

lemmatized = lemmatizer.lemmatize(stemmed)

print(stemmed, lemmatized)  # Outputs: "min" "min"

PreviousData Acquisition NextSentiment Analysis

Last updated 26 days ago

import re sentence = "Bitcoin’s price rose â€™ today!" # Remove non-ASCII characters sentence = re.sub(r"[^\x00-\x7F]+", " ", sentence) # Remove punctuation (everything except letters, numbers, and whitespace) sentence = re.sub(r"[^a-zA-Z0-9\s]", "", sentence) print(sentence) # Outputs: "Bitcoins price rose today"

import string sentence = "Bitcoin, Ethereum, and Solana!" # Remove all punctuation sentence = sentence.translate(str.maketrans("", "", string.punctuation)) print(sentence) # Outputs: "Bitcoin Ethereum and Solana"

from nltk.stem import WordNetLemmatizer, PorterStemmer lemmatizer = WordNetLemmatizer() stemmer = PorterStemmer() word = "mining" stemmed = stemmer.stem(word) lemmatized = lemmatizer.lemmatize(stemmed) print(stemmed, lemmatized) # Outputs: "min" "min"