Previous Projects -
Movie Recommendations with Sentiment Insights
Description:
This Recommendation System with Sentiment Analysis is an integrated application that provides personalized movie suggestions using detailed data from the TMDB API, including titles, genres, and posters. Enhances user experience further with sentiment analysis model trained on IMDb reviews, offering insights into viewer sentiment for each movie. An autocomplete feature further improves usability by helping users quickly locate desired titles. Combining local datasets with web-scraped reviews, this robust recommendation engine adapts to user preferences, creating an engaging, data-driven discovery experience.
Key Features:
Movie Recommendation: Utilizes the TMDB API to retrieve detailed movie information, including titles, genres, runtimes, and posters, to offer users relevant recommendations.
Sentiment Analysis: Employs a Naive Bayes model trained on user reviews to assess sentiment, providing insights into viewer opinions and enhancing the recommendation experience.
Autocomplete Functionality: Features real-time title suggestions, enabling users to quickly locate movies of interest.
Gen-AI based Custom Trained Medical Chatbot
Description:
An end-to-end conversational AI chatbot designed to answer medical-related queries using a large language model (LLM) and a custom knowledge base. It provides accurate and context-aware responses by leveraging "The Gale Encyclopedia of Medicine" as its primary data source. The chatbot features a user-friendly interface and supports flexible customization for enhanced user interaction.
Techstack Highlights:
Llama-2-7B-Chat-GGML - Large language model fine-tuned for conversational understanding.
LangChain - Framework for building and deploying language models.
Flask - Web framework to create the user interface.
PineCone - Vector database for efficient storage and retrieval of embeddings.
Python - Core programming language for logic and functionality
Realtime Data Analysis Pipeline
Description:
Ecom- Realview is a real-time data analysis pipeline designed for processing and analyzing e-commerce data from ingestion to visualization. It leverages Kafka for event streaming, Spark for real-time processing, Cassandra for raw data storage, MySQL for structured data storage, and Superset for interactive dashboards. This pipeline supports real-time decision-making and insights through end-to-end data flow and visualization.
Techstack Highlights:
Apache Kafka - Real-time event streaming and message brokering.
Apache Spark Streaming - Real-time data ingestion, processing, and analytics.
Apache Cassandra - NoSQL database for raw data storage, offering scalability and high availability.
MySQL - Relational database for structured data storage and advanced querying.
Apache Superset - Dashboarding tool for data visualization and real-time analytics.
End to End Insurance Premium Prediction
Description:
This end-to-end project utilizes MLOps (Machine Learning Operations) principles to streamline the development, deployment, and maintenance of an insurance premium prediction model. By predicting insurance costs based on individual health data, the project empowers users to make informed insurance decisions, comparing premiums across providers with an understanding of their health factors. MLOps is integrated to automate the pipeline from data ingestion to model deployment, ensuring a consistent and scalable solution that can be updated with new data seamlessly.
Tech Stack:
Programming Language: Python 3.7
Frontend: HTML, CSS
Backend: Flask, RESTful APIs
Machine Learning: Scikit-Learn (Regression models for premium prediction)
Data Analysis: Pandas, NumPy
MLOps Tools: Docker (containerization), GitHub Actions (CI/CD), Heroku (deployment)
Development Tools: Jupyter Notebook, Git
Deployment Platform: Heroku
Ai Based Schedule Generator
Description:
Techno Green Phase 2 advances precision agriculture by building an AI-powered irrigation schedule generator tailored for greenhouse farming. This project is a continuation of the initial "Techno Green" research on automated irrigation systems. It uses machine learning to predict optimized irrigation schedules based on crop-specific and environmental data. Through a user-friendly interface built with Flask, the system allows farmers to generate custom irrigation schedules by entering a crop name, ensuring accessible technology adoption.
Tech Stack:
Machine Learning: Decision Tree Regression (for predictive scheduling based on environmental factors)
Data Processing: Python libraries (pandas, numpy) for dataset preparation,
Development: HTML, CSS, Python, Flask
Data Collection and Management: Custom agricultural datasets (temperature, humidity, soil type, etc.)
Evaluation Metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R2) for assessing model performance
Key Project Features:
Automated dataset creation and preprocessing
Decision Tree Regression model for non-linear, interpretable predictions
CSV output with an irrigation schedule
Landing Forecaster for SpaceX's Falcon 9
Description:
This capstone project focuses on predicting the successful landing of SpaceX’s Falcon 9 rocket’s first stage, a crucial factor in estimating the cost-efficiency of SpaceX’s reusable rockets. Given that SpaceX can significantly reduce launch costs by reusing the first stage, accurately forecasting a successful landing can help a competing startup make informed, competitive bids for rocket launches. The project takes on the end-to-end data science process, emulating the real-world experience of working with complex datasets and developing predictive models.
Tech Stack:
Programming Languages: Python
Data Analysis and Wrangling: Pandas, NumPy, SQL
Data Visualization: Matplotlib, Seaborn, Plotly
Machine Learning Algorithms: Logistic Regression, Decision Trees, Support Vector Machines (SVM)
Deployment and MLOps Tools: Docker (for containerization), GitHub Actions (for CI/CD)
Development Tools: Jupyter Notebook, Git
Most Recent Highlight:
This project investigates gender bias in professor ratings and identifies key predictors of teaching effectiveness using data from RateMyProfessor.com (RMP). I analyzed disparities in ratings, examined descriptive tags, and built predictive models to uncover patterns in evaluations. The findings aim to improve fairness in academia and provide actionable insights for educators and institutions.
The dataset was created using the following steps:
Web Scraping:
Scraped professor profiles, ratings, tags, and institutional data from RMP using Python libraries like BeautifulSoup and Selenium.
Data Collation:
Aggregated individual ratings into average scores (e.g., quality, difficulty) and cleaned duplicates.
Anonymization:
Removed identifying information such as professor names to preserve privacy.
Validation:
Cross-checked with a manually collected sample, achieving a 98% match rate for accuracy.
Missing Data:
Several records were incomplete, with missing values for critical fields like average ratings and difficulty levels.
Problem: Missing values reduce the dataset’s analytical integrity and may lead to biased conclusions.
Extreme Ratings:
Some professors had ratings based on only one or two students, leading to extreme averages of 1 or 5.
Problem: Such low-sample averages are unreliable and can skew statistical analysis.
False Positives in Statistical Analysis:
The large dataset provided sufficient statistical power, increasing the risk of false positives when conducting multiple hypothesis tests.
Problem: Without proper corrections, spurious results could compromise the validity of findings.
Raw Tag Counts:
Tags like "Tough Grader" were recorded as raw counts, biased by the number of ratings a professor received.
Problem: Professors with more ratings appeared to have higher tag counts, distorting meaningful comparisons.
Ambiguous Gender Entries:
Some records had unclear or conflicting gender classifications.
Problem: This ambiguity could mislead gender-based analyses, reducing credibility.
Handling Missing Data:
Do: Rows with missing average ratings or difficulty values were dropped to ensure data integrity.
Why: Excluding incomplete records guarantees that analyses rely on complete and valid information.
Find: This step reduced noise in the dataset, allowing for more reliable statistical results.
Filtering Extreme Ratings:
Do: Professors with fewer than five ratings were excluded.
Why: A minimum threshold prevents skewed averages caused by a small number of ratings, fostering more stable insights.
Find: After filtering, the average rating distribution remained centered around ~3.5, reflecting typical evaluations.
Mitigating False Positives:
Do: Applied a stricter significance threshold (( \alpha = 0.005 )) for all statistical tests and employed Bonferroni correction for multiple comparisons.
Why: These adjustments reduce the likelihood of false positives, ensuring robust and credible findings.
Find: The adjusted threshold identified significant patterns while avoiding overinterpretation.
Normalizing Tags:
Do: Converted raw tag counts to proportions (tag count ÷ total ratings).
Why: Normalizing tags ensures fair comparisons, regardless of the number of ratings a professor received.
Find: This step revealed meaningful differences in tag usage across genders, with certain characteristics like "Tough Grader" appearing more frequently.
Resolving Ambiguous Gender Entries:
Do: Created binary columns (gender00, gender11) and assigned ambiguous cases a value of 0 for both Male and Female.
Why: Explicitly handling ambiguous entries prevents misclassification and strengthens gender-based analyses.
Find: Gender data became cleaner and more interpretable, supporting accurate statistical tests.
Seeding the Random Number Generator:
Do: Seeded the RNG with a team member's unique identifier (N-number).
Why: Ensures reproducibility of results, allowing consistent splits in train-test datasets and bootstrapping processes.
Find: Reproducibility helped maintain computational integrity throughout the project.
Objectives
With this dataset in hand, the project aims to answer the following questions:
Gender Bias in Ratings:
Activists have asserted that there is a strong gender bias in student evaluations of professors, with male professors enjoying a boost in ratings. However, skeptics have criticized the quality of such research, citing small sample sizes, failure to control for confounders, and potential p-hacking. This project examines whether there is evidence of a pro-male gender bias in the dataset.
Hint: A significance test is probably required.
Variance in Ratings:
Is there a gender difference in the spread (variance/dispersion) of the ratings distribution? Statistical significance of observed differences in variance will be evaluated.
Effect Size Estimation:
What is the likely size of both of these effects (gender bias in average rating, gender bias in spread of average rating)? This will be estimated using a 95% confidence interval.
Gender Differences in Tags:
Are there gender differences in the tags awarded by students? Each of the 20 tags will be tested for potential gender differences, identifying the most and least gendered tags based on statistical significance (lowest and highest p-values, respectively).
Gender Differences in Difficulty Ratings:
Is there a gender difference in terms of average difficulty? Statistical testing will be used to determine significance.
Quantifying Effect Size in Difficulty:
What is the likely size of the gender difference in difficulty ratings at a 95% confidence level?
Predicting Average Ratings:
Build a regression model to predict average ratings using all numerical predictors from the rmpCapstoneNum.csv file. The model should include metrics such as ( R^2 ) and RMSE, and address collinearity concerns to identify the most significant predictors.
Predicting Ratings Using Tags:
Build a regression model to predict average ratings using all tags from the rmpCapstoneTags.csv file. Compare this model's performance (e.g., ( R^2 ), RMSE) with the numerical model and identify which tags are most strongly predictive of ratings.
Predicting Difficulty Using Tags:
Build a regression model to predict average difficulty using all tags from the rmpCapstoneTags.csv file. Include metrics like ( R^2 ) and RMSE, and identify the most significant predictors while addressing collinearity.
Predicting Pepper Badges:
Build a classification model to predict whether a professor receives a "pepper" badge using all available factors (tags and numerical). Evaluate model quality with metrics such as AU(RO)C and address class imbalance concerns.