Welcome to Affaan's research repository, housing various research projects conducted at UCSD, UW and personally. This repository is a testament to the commitment towards pushing the boundaries of knowledge and innovation, particularly in the fields of quantitative finance, economics, and the intersection with cutting-edge technology. Below is an overview of the current projects and their corresponding descriptions.
Research Project 1: Discovering Market Manipulation Using Sentiment Analysis in Microcap Cryptocurrencies
This project aims to explore the relationship between Reddit discussions and the price movements of microcap cryptocurrencies. Leveraging various APIs, including CoinMarketCap, Twitter, and Reddit, the study focuses on identifying correlations, understanding sentiment, and attempting to infer causality.
The project follows a systematic approach, divided into multiple steps:
- Sample Selection: Selecting 500 microcap cryptocurrencies with a market capitalization of under 1 million USD.
- Data Collection: Gathering mentions from Reddit posts and constructing a rich dataset.
- Data Preprocessing: Cleaning and aggregating the data, including sentiment analysis.
- Correlation Analysis: Investigating linear relationships between post counts and price.
- Statistical Significance Testing: Confirming the statistical significance of the correlations.
- Multivariate Regression Analysis: Conducting a panel data regression model to understand influences on price, including lagged price, Reddit posts, and total market cap.
- Positive Sentiment: An overwhelmingly positive sentiment across the posts.
- Strong Correlations: A few microcaps showed strong positive correlations between Reddit discussions and price.
- Statistical Significance: The significant correlations suggest potential underlying relationships.
- Model Complexity: A more complex model could refine insights.
- Data Scope: Additional data sources and variables may enhance the analysis.
- Focused Analysis: Further studies could explore the mechanisms driving specific correlations.
This project offers valuable insights into the dynamic interplay between social media and financial markets, particularly within the realm of microcap cryptocurrencies. It serves as a foundation for further research and innovation in the field of financial technology and market behavior.
Title: Investigating Market Manipulation in Cryptocurrencies through Advanced Sentiment Analysis and Machine Learning
Objective: To identify and mitigate market manipulation in the cryptocurrency sector through advanced sentiment analysis and machine learning techniques.
Scope:
- Expanded Sample Size: Increasing the sample size to 2000 microcap cryptocurrencies.
- Enhanced Data Sources: Incorporating additional social media platforms such as Telegram and Discord.
- Advanced Analytical Techniques: Using machine learning algorithms like LSTM for time series analysis and BERT for more nuanced sentiment analysis.
- Collaborative Efforts: Partnering with financial institutions and academic bodies to validate findings and explore practical applications.
Expected Outcomes:
- A comprehensive understanding of market manipulation tactics in cryptocurrencies.
- Development of predictive models to identify potential manipulation.
- Policy recommendations for regulators to mitigate market manipulation.
Methodology:
- Data Collection: Extensive data collection from multiple social media platforms and cryptocurrency exchanges.
- Sentiment Analysis: Advanced sentiment analysis using state-of-the-art NLP techniques.
- Machine Learning Models: Training and testing various machine learning models to predict price movements and identify manipulation.
Collaborators: Financial institutions, academic researchers, and regulatory bodies.
Research Project 2: Comparing Neural Networks with Newton’s Method Gradient Descent, SVM, and Random Forest for Max-Flow Optimization in Supply Chain Networks
This study embarks on a comparative analysis of machine learning techniques for optimizing supply chain networks, focusing on the max-flow problem. Utilizing three key datasets - "Daily Demand Forecasting Orders" from Ferreira et al. (2017), "Wholesale Customers" from Cardoso (2014), and "Online Retail" (2015) - the research aims to model and predict critical factors influencing supply chain efficiency.
The project is systematically structured as follows:
- Data Preparation: Initial Exploratory Data Analysis (EDA) for understanding dataset characteristics, followed by data cleaning and preprocessing.
- Model Selection: Employing Neural Networks using Newton's Method Gradient Descent, Support Vector Machines (SVM), and Random Forests (RF) for the analysis.
- Data Splitting: Dividing datasets into different training-testing splits (80/20, 50/50, 20/80) to evaluate model performance.
- Model Training and Evaluation: Training each classifier and assessing performance based on accuracy, F-score, Lift, ROC Area, average precision, and other metrics.
- Visualization and Analysis: Plotting convergence rates and performance metrics to compare classifiers visually.
- Hyperparameter Tuning and Cross-Validation: Using GridSearchCV to optimize hyperparameters for each classifier.
- Neural Networks: Demonstrated high accuracy but required substantial computational resources for convergence.
- SVM: Showed consistent and robust performance across different datasets.
- Random Forest: Achieved high training accuracy, indicating proficiency in capturing complex data patterns but also a potential for overfitting.
- Computational Efficiency: Logistic Regression and SVM are preferable for limited computational power, while Neural Networks are suitable with ample computational resources.
- Model Stability: Further investigation into feature relationships and model stability is necessary for robust predictive performance.
- Future Research: Exploring hybrid models or ensemble techniques combining the strengths of these classifiers could enhance supply chain optimization.
The study provides a nuanced understanding of how different machine learning models can optimize various aspects of supply chain management. While Random Forest stands out for handling complex datasets, Neural Networks show promise for future applications with sufficient computational support. Logistic Regression and SVM offer accessible alternatives for scenarios with computational or data limitations.
Research Project 3: Enhancing Handwritten Quranic Arabic Recognition through Deep Learning: A Novel Approach Integrating Tajweed-Sensitive Convolutional Neural Networks
This project embarks on an unprecedented exploration into the recognition of handwritten Quranic Arabic, with a special emphasis on integrating the complex rules of Tajweed. Utilizing a rich dataset of Arabic Handwritten Characters compiled by El-Sawy, Loey, and El-Bakry (2017), the research employs advanced convolutional neural networks (CNNs) architectures, particularly focusing on ResNet, to significantly advance the field of Arabic handwritten character recognition.
- Data Loading and Preprocessing: Utilizing the Arabic Handwritten Characters Dataset, which contains 16,800 characters from 60 participants, the data undergoes augmentation, normalization, and reshaping to fit the model requirements.
- Model Architecture: Adopting ResNet with modifications, including custom layers for Tajweed recognition, to handle the complexity of Quranic Arabic script.
- Hyperparameter Optimization: Employing GridSearchCV and other techniques to fine-tune parameters for optimal performance.
- Training and Validation: Utilizing an 80/10/10 split for training, validation, and testing, with a focus on accuracy, precision, recall, and F1 score.
- Comparative Analysis: Conducting a detailed comparison between AlexNet and ResNet architectures to evaluate performance improvements.
- Accuracy: Achieved an accuracy of over 91% in recognizing handwritten Quranic Arabic with integrated Tajweed rules using ResNet.
- Model Robustness: Enhanced through advanced data augmentation techniques and iterative refinement.
- Comparative Analysis: ResNet outperformed AlexNet in both accuracy and efficiency, demonstrating its superior capability in handling complex recognition tasks.
- Comprehensive Optimization: Expanding hyperparameter tuning to include all conceivable combinations for maximal accuracy.
- Advanced Architectures: Further exploring deeper networks and hybrid models to push performance boundaries.
- Real-world Applications: Developing practical tools for educational and religious use, enhancing accessibility to Quranic texts.
This project significantly contributes to the field of computational linguistics and artificial intelligence by addressing the complex task of Quranic Arabic recognition with integrated Tajweed rules. The insights gained lay the groundwork for further advancements and practical applications, aiming to enrich the educational and religious experiences of the global Muslim community.
This project, developed for the Economics Department Research Lab Winter 2024, focuses on efficient sampling and analysis of large-scale user ID datasets. It's designed to work with the San Diego Supercomputer Center, specifically for research on the economic impact of return migrants. The project combines econometric techniques with Natural Language Processing (NLP) and leverages advanced computing methods for data handling and analysis.
This Python script is the core of the sampling process. It uses Dask for distributed computing to efficiently sample and merge large Parquet files.
Key functions and features:
sample_and_merge_optimized_dask(folder_path, output_dir, sample_fraction=0.1)
:- Reads Parquet files from a specified folder
- Samples a fraction (default 10%) of the data
- Saves the sampled data as a new Parquet file
Arguments:
--folder_path
: Path to the folder containing Parquet files--output_dir
: Directory to save the output Parquet file--sample_fraction
: Fraction of data to sample (default: 0.1)
Additional features:
- Memory usage tracking using
psutil
- Progress bar for visual feedback during processing
This script is used to submit the sampling job to the SLURM workload manager on the supercomputer.
Key features:
- Sets job name and output file
- Navigates to the correct directory
- Executes the Python script with appropriate arguments
This script verifies the sampled dataset by counting the total number of rows.
Key features:
- Uses Dask for efficient processing of large Parquet files
- Configures Dask for optimized query planning
- Loads the sampled dataset and computes the total row count
- The bash script submits the job to SLURM, which executes
samplethenconcatenate.py
. samplethenconcatenate.py
reads the Parquet files, samples 10% of the data, and saves the result.- The data verification script then loads the sampled dataset and computes the total number of rows.
- Original dataset: 763,504,805 User IDs
- Sampled dataset: 76,350,473 User IDs (approximately 10%)
- Runtime reduced from 2.5 hours to 26 minutes
- Unix/Terminal: Used for job submission and cluster interaction
- Python: Core language for script development
- Dask: Utilized for distributed computing and efficient data handling
- SLURM: Workload manager for job submission on the supercomputer
- Parquet: Efficient columnar storage format for large datasets
- Developed a Parallelized Computing Algorithm for efficient data sampling
- Significantly reduced runtime from 2.5 hours to 26 minutes
- Achieved 10% sampling of an 800 million user dataset without increasing storage and memory costs
- Set a new benchmark for processing scalability and speed in large-scale data analysis
- Ensure all dependencies are installed (Dask, psutil).
- Submit the job using the provided bash script: sbatch job_submission_script.sh
- Once the job completes, verify the results using the data verification script.
- Further optimization of the sampling algorithm for even larger datasets
- Integration with NLP techniques for in-depth analysis of migration trends
- Development of advanced econometric models leveraging the sampled data
Note: This project is part of ongoing research on the economic impact of return migrants, blending econometric techniques with NLP and leveraging advanced computing resources for comprehensive migration analysis.
This project, developed for the Economics Department Undergraduate Research Lab Summer 2024, is designed to scrape and analyze articles from O Globo's Rio de Janeiro section, focusing on violence-related news. The scraper collects articles, processes their content, and classifies them based on their likelihood of being related to violent events.
This is the entry point of the application. It orchestrates the entire scraping and data processing pipeline.
Key functions:
main()
: Coordinates the scraping process, data saving, and output generation.- Sets up scraping parameters (start page, max pages, time range)
- Initializes CSV file for data storage
- Calls
scrape_oglobo()
from scraper.py - Handles exceptions and keyboard interrupts
- Provides a summary of scraped data
- Saves data to CSV and DataFrame formats
Contains the core logic for scraping articles from O Globo's website.
Key functions:
scrape_oglobo(days=365, start_page=1, max_pages=None)
:- Scrapes articles from specified pages
- Utilizes BeautifulSoup for HTML parsing
- Extracts basic article information (title, URL, publication date)
- Calls
get_article_content()
to fetch full article text - Uses
ViolenceClassifier
to predict violence likelihood - Extracts various data points using functions from data_processors.py
- Yields a dictionary of extracted and processed data for each article
Handles the retrieval and parsing of full article content.
Key functions:
get_article_content(url, max_retries=3)
:- Fetches the full text of an article from its URL
- Implements retry logic for failed requests
- Uses various CSS selectors to locate article content
- Cleans the extracted text by removing unwanted elements and whitespace
Contains functions for processing and extracting specific data from article content.
Key functions:
extract_location(text)
: Identifies mentioned locations in the articleextract_police_involvement(text)
: Determines if police are mentionedextract_gang_involvement(text)
: Checks for mentions of gang activityextract_victims(text)
: Attempts to count the number of victims mentionedextract_gender(text)
: Extracts gender information of individuals mentioneddetermine_violence_level(text)
: Categorizes the level of violence (High/Medium/Low)determine_violence_type(text)
: Identifies the type of violent eventis_violence_related(text)
: Determines if the article is related to violenceextract_journalist(text)
: Extracts the name of the journalist
Provides utility functions used across the project.
Key functions:
extract_date(text)
: Extracts dates mentioned in the textget_coordinates(location)
: Geocodes location names to coordinatesclean_text(text)
: Cleans and normalizes textformat_date(date)
: Formats datetime objects to standard string formatextract_important_metadata(content)
: Extracts key metadata from article content
Implements machine learning models to classify articles based on their likelihood of being related to violent events. This file offers two classification approaches: a simpler scikit-learn based model and a more robust BERT-based transformer model.
Key classes:
-
SimpleViolenceClassifier
:- Uses scikit-learn's TfidfVectorizer and MultinomialNB for classification
train_classifier()
: Trains the model on a predefined set of examples- Suitable for quick classification with lower computational requirements
-
RobustViolenceClassifier
:- Utilizes a pre-trained Portuguese BERT model ("neuralmind/bert-base-portuguese-cased")
- Capable of fine-tuning on domain-specific data for improved accuracy
- Key methods:
fine_tune(texts, labels)
: Fine-tunes the model on a manually labeled dataset of violence and non-violence related articlespredict_violence_likelihood(text)
: Predicts the likelihood of an article being violence-related
Usage notes:
- The BERT-based classifier requires pre-training or fine-tuning on a manual sample of marked violence and non-violence articles before use in the main scraping process.
- Fine-tuning process:
- Collect a diverse set of articles from O Globo or similar sources
- Manually label these articles (1 for violence-related, 0 for non-violence)
- Use the
fine_tune
method with this labeled dataset - The fine-tuned model can then be used for predictions during the scraping process
Choosing between classifiers:
- SimpleViolenceClassifier: Use for faster processing and when computational resources are limited
- RobustViolenceClassifier: Prefer for higher accuracy, especially when dealing with nuanced or context-dependent violence references in Portuguese text
Note: The BERT-based classifier requires more computational resources, especially during the fine-tuning process. GPU acceleration is recommended for efficient fine-tuning and faster inference.
main.py
initiates the scraping process by callingscrape_oglobo()
fromscraper.py
.scraper.py
fetches article listings, extracts basic info, and callsget_article_content()
fromcontent_extractor.py
to get full article text.- The scraped content is then processed using various functions from
data_processors.py
andutils.py
to extract relevant information. violence_classifier.py
is used to predict the likelihood of each article being related to a violent event.- All extracted and processed data is compiled into a dictionary for each article and yielded back to
main.py
. main.py
saves this data to CSV and DataFrame formats, providing a summary of the scraped articles.
To run the scraper:
- Ensure all dependencies are installed (requests, beautifulsoup4, pandas, numpy, scikit-learn, geopy).
- Run
python main.py
from the command line. - The script will start scraping articles, processing them, and saving the results to 'violence_data.csv' and 'violence_data_df.csv'.
Note: Respect O Globo's robots.txt and terms of service when using this scraper. Implement appropriate delays between requests to avoid overloading their server.
Note: This repository is continuously updated with new research and findings. Please feel free to explore the content and contribute or provide feedback.
This project is licensed under the MIT License - see the LICENSE.md file for details.
For any inquiries or collaboration, please feel free to contact Affaan at afmustafa@ucsd.edu.