CV
EDUCATION
Columbia University, New York, NY
Master of Science, Data Science, [Expected] Dec 2024
University of Liverpool, Liverpool, England
Bachelor of Science, Mathematics with Finance, Oct 2021 - Jun 2023
- GPA: 3.85/4.0 (Rank: Top 1%)
Xi’an Jiaotong-Liverpool University, Suzhou, China
Bachelor of Economics, Financial Mathematics, Sep 2019 - Jun 2023
SKILLS
Languages: Python (Pandas, Numpy, Seaborn, Matplotlib, PySpark), SQL, C++, R (Tidyverse, ggplot2), MATLAB, Git
Machine & Deep Learning frameworks: Sklearn, PyTorch, Tensorflow, Spacy, Keras, Hadoop, HuggingFace
Cloud Computing: Amazon Web Service (SageMaker, Athena etc.), Azure, Google Cloud Platform (GCP), Unix
GenAI Related: Faiss, ChromaDB, GPT API, LlamaIndex, LangChain, Streamlit
Office Software: Agile, Confluence, Jira
PROFESSIONAL EXPERIENCE
Trepp, Inc., New York, NY
Data Science Intern
May - Aug 2024
- Developed a chatbot application integrating PDFs and text data, using AWS and Python for end-to-end implementation
- Explored advanced techniques for unstructured prospectus documents PDF extraction, conducting exploratory and statistical analyses on large datasets
- Leveraged advanced prompt engineering techniques and Batch GPT-4.0 mini API to clean unstructured in-house credit stories text, achieving a 95% accuracy in data quality and saving 90% cost
- Developed and implemented a RAG model to generate CMBS-specific answers with ChromaDB on AWS SageMaker
- Researched and developed experimental evaluation datasets for RAG application by the help of GPT API, paving the way for future evaluation datasets in the industry
- Enhanced RAG model with techniques such as metadata tagging filtering and post-retrieval reranking algorithms, increasing domain-specific retrieval index accuracy from 60% to 90%
Shengang Securities Co., Ltd (AUM: $20B), Shanghai, China
Data Science Intern
Jun - Sep 2023
- Developed a Python pipeline to analyze 60+ mutual funds, retrieving 9 performance metrics with SQL, and scored funds to enhance portfolio managers’ decision efficiency by 50%
- Utilized historical financial data to refine trading strategies, reducing risks for the Asset Management Department
- Preprocessed 6k+ daily rebar prices spanning 15 years, along with large financial news datasets, using the Wind API
- Developed and backtested SMA crossover strategies combined with sentiment analysis in Python and LLM API, reducing drawdown rates by 30% for a $1M hypothetical asset
- Visualized backtesting results with Matplotlib line plots, and assessed its efficiency through Monte Carlo simulations
Harvard University, Remote
Research Assistant (Funded by Microsoft Research)
May 2022 - Mar 2023
- Aided in exploring discriminative language models’ (DLM) potential in low-resource biomedical scenarios
- Conducted probe studies on DLMs’ zero-shot performance in biomedical tasks using prompt tuning and recommended continual pretraining with domain vocabulary to Microsoft NLP group for BioDLM development
- Validated the new model’s superior few-shot accuracy, indicating potential savings of millions in data costs for the low-resource biomedical industry
- Deployed in-house LLMs with PyTorch and HuggingFace on Azure VMware, tracking progress through GitHub
- Co-authored a publication in the Association for Computational Linguistics (ACL) (Top-tier NLP conference)
Xi’an Jiaotong-Liverpool University, Suzhou, China
Research Assistant
Jan - Sep 2021
- Built pipeline to make binary classification predictions about insurance purchase, leading to potential 2M profit increase
- Conducted data preprocessing for 50k+ product entries and performed EDA of 10+ features (e.g. marital status, holding duration) with Matplotlib and Pandas, to assess skewness, normality and eliminate possible outliers
- Led the team to build models, including undersampling the unbalanced train data, feature engineering, and fine-tuning 4 models: Logistic Regression, Decision Tree, Bagging, and Random Forest
- Evaluated 4 models via confusion matrix and F1 score; Logistic regression had the best performance with 0.697 F1, outpacing others by 16%
ADDITIONAL INFORMATION
Programming Languages: Python (Pandas, Numpy, Seaborn, Matplotlib, PySpark), SQL, C++, R (Tidyverse, ggplot2), MATLAB, Git
Machine & Deep Learning Frameworks: Sklearn, PyTorch, TensorFlow, Spacy, Keras, Hadoop, HuggingFace
Cloud Computing: Amazon Web Service (SageMaker, Athena etc.), Azure, Google Cloud Platform (GCP), Unix
GenAI Related: Faiss, ChromaDB, GPT API, LlamaIndex, LangChain, Streamlit
Office Software: Agile, Confluence, Jira