jushijun@gmail.com ⋄ Toronto, ON ⋄ linkedin.com/in/shijunju ⋄ medium.com/@jushijun
SUMMARY
Results-driven Data Scientist, Data Engineer, and AI Engineer with a Ph.D. in Economics and experience in AI, machine learning, and data analysis. Proficient in Python, with a strong background in statistical modeling and AI technologies. Proven ability to conduct complex AI projects and deliver innovative solutions that drive business outcomes.
- Keywords: Data Engineer, AWS, Azure, LLM, LLMOps, RAG Speech-to-text, Fine-tuning, LoRA, QLoRA, Economics
RELEVANT EXPERIENCE
Building a Course-Specific AI Study Assistant: Integrating RAG, AWS, GitHub CI/CD, and Docker; Canada; Mar 2025
- Developed a course-specific AI study assistant using Retrieval-Augmented Generation (RAG) to deliver tailored, real-time responses for students.
- Integrated AWS for scalable cloud deployment, implemented GitHub CI/CD for automated testing and deployment, and utilized Docker for containerization, ensuring consistency across environments.
- Showcased skills in AI integration, cloud computing, DevOps, and containerization while addressing educational challenges.
Building a Weather Data Pipeline with Apache Airflow, AWS, and Amazon RDS; Canada; Mar 2025
- Designed and implemented an automated weather data pipeline using Apache Airflow, AWS EC2, S3, and Amazon RDS (PostgreSQL).
- Orchestrated ETL workflows to extract data from the OpenWeatherMap API, transform it, and load it into a cloud database, ensuring scalability and reliability.
- Demonstrated proficiency in Python, SQL, and cloud-based data engineering, with a focus on workflow automation and data integrity.
Building an Incremental Data Pipeline with dbt, Snowflake, and Amazon S3; Canada; Mar 2025
-
Configured AWS CLI and Snowflake for seamless data transfer.
-
Wrote Python scripts to generate and upload realistic order data to S3.
-
Implemented dbt models for data transformation and incremental loading.
-
Validated the pipeline’s ability to handle both new and historical data correctly.
Building an End-to-End Real-Time Streaming Data Pipeline with Azure; Canada; Feb 2025
- Designed and implemented an end-to-end real-time streaming data pipeline on Azure to ingest, process, and visualize weather data, utilizing services like Databricks, Functions, Event Hubs, Key Vault, and Fabric.
- Optimized costs through architectural decisions and ensured security using Key Vault for secret management.
- Delivered interactive Power BI reports with real-time updates and configured alerts for weather conditions.
KaggleX Fellowship: LLM Fine-tuning, PyTorch, Keras; Remote; Sep 2024 – Dec 2024 Presentation Video, PPT with links to codes and models, Project Details
- Selected as one of 81 fellowship recipients from over 3,000 applicants for the KaggleX fellowship program.
- Final project ReguGuard AI selected as one of the 16 final showcases in the end of program celebration
- Developed an AI chatbot ReguGuard AI for financial risk compliance by fine-tuning large language models (Gemma-2b-en and Gemma-7b) using advanced techniques (LoRA, QLoRA) on GPU and TPU.
- Achieved an accuracy rate of 78.6% with the model, demonstrating expertise in machine learning and model optimization.
- Adapted LLamaIndex's RAFT DatasetPack module and used OpenAI API to generate over 14,000 question-answer pairs for training
- Conducted extensive testing on model parameters, enhancing understanding of machine learning performance metrics.
Skinopathy: Student AI Researcher; Speech-to-text model Fine-tuning, PyTorch, QLoRA; Remote; Sep 2024 – Dec 2024 LINK
- Fine-tuned the OpenAI Whisper Large-v3 Turbo model to improve audio recognition of dermatology terminology using QLoRA.
- Achieved a 33% reduction in Word Error Rate (WER), improving accuracy from 14.5% to 9.5%.
- Cleaned and processed over 100 dermatology documents, which were used to synthesize more than 780 English audio samples with varied accents and genders using the Google Text-to-Speech API for training purposes.
Data Engineering Projects: Microsoft Fabric, Azure Data Factory; Canada; Aug 2024
- Optimized end-to-end data engineering pipeline leveraging Azure services Medium Article
- Designed and implemented ETL pipelines to ingest on-premise SQL Server data into Azure Data Lake via Azure Data Factory,
- Performed data transformations from Bronze to Gold layer with Azure Databricks,
- Loaded transformed data into Azure Synapse Analytics
- Developed interactive dashboards and reports using Power BI
- Developed a sentiment analysis pipeline leveraging Microsoft Fabric Medium Article
- Orchestrated data ingestion from Bing API using Data Factory and stored data in One Lake
- Performed transformations with Synapse Data Engineering
- Applied machine learning models for sentiment analysis via Synapse Data Science
- Created interactive visualizations and reports using Power BI
AI Projects: AutoML, LLM, NLP; Canada; May 2024 – July 2024
- Ranked in the Top 1% (15th out of 1,847) in the 2024 KaggleX Skill Assessment Challenge, showcasing exceptional data science skills.
- Successfully combined AutoML frameworks (AutoGluon, LightAutoML) to predict used car prices, demonstrating proficiency in automated machine learning techniques.
- Executed comprehensive topic modeling on 3,000+ multiple-choice questions using BERTopic, providing insights into educational content trends.
3auk.com, LiuXueWangXiao.com: Founder; China; Jun 2021 – Aug 2023
- Founded and led an online learning platform, enhancing user experience through data-driven design and interactive features.
- Spearheaded the development of educational tools, fostering engagement and improving learning outcomes.
Educational Applications: Programmer; JS, Unity(C#); China; Jun 2019 – Aug 2021
- Designed and implemented a Virtual Reality Classroom, integrating immersive technology to enhance educational experiences.
- Developed cloud-based applications that personalized learning and improved efficiency, showcasing strong programming and analytical skills.
Web Scraping and Data Collection: Programmer; China; Jan 2020 - Feb 2020
- Utilized Python (Scrapy) and MongoDB for data collection and transformation, demonstrating expertise in data engineering and management.
TECHNICAL SKILLS
- Programming Languages: Python, PySpark, R, SQL, SAS, LATEX, JavaScript
- Data Analytics / AI Tools: AutoML (AutoGluon, LightAutoML, MLJAR), PyTorch, Keras, LoRA, QLoRA, BERTopic
- Big Data, Cloud Technologies: Azure (AI Services, Data Factory, Data Lake, Databricks, Synapse Analytics, OpenAI, Power BI), AWS Machine Learning
EDUCATION
- Graduate Certificate in Artificial Intelligence, Georgian College, Canada - Dec 2024 (expected)
- Graduate Certificate in Marketing Analytics, Centennial College, Canada - May 2024
- Ph.D. in Economics, University of Pittsburgh, United States - 2015
- B.A. with Honorary M.A. in Economics, University of Edinburgh, Scotland - Jun 2006
PROFESSIONAL CERTIFICATIONS & TRAINING
- Passed all three levels of Chartered Financial Analyst (CFA) exams
LANGUAGES
- English (fluent), Mandarin Chinese (native)