Big Data Analytics for Healthcare

3.69 / 5 rating4.11 / 5 difficulty27.24 hrs / week

Quick Facts and Resources

Something missing or incorrect? Tell us more.

Name: Big Data Analytics for Healthcare
Listed As: CSE-6250
Credit Hours: 3
Available to: CS and AN students
Description: Big data systems, scalable machine learning algorithms, health analytic applications, electronic health records.
Syllabus: Syllabus
Textbooks: No textbooks found.

d3eVBAwQdb8Rfx/qdWkkyQ==2025-05-13T18:52:48Zspring 2025
Quite easy to get an A, very easy to get a B. Homework code is graded using an autograder, so you can just resubmit until correct.

Definitely a lower workload than previous reviews suggest, but very front loaded (the final paper was more chill). If you don't come in with strong coding experience, you will struggle on the HW because the course offers no guidance. The lectures only discuss theory, and the HW are skeleton code you have to figure out. I felt like the HW was just me bashing my keyboard until the autograder passed.

Horribly organized course. Instructions for almost everything were unclear, and TAs were unresponsive. The final felt more like random trivia than an evaluation of our understanding. There's no guidance on how to study, and it included topics from the "optional" labs and several things that were never explicitly covered. The other assignments are graded pretty leniently, so try to get 100s on all the homeworks to have some leeway here.

The assignments were:
1. Data ETL and prediction in Python
2. Data ETL and logistic regression in PySpark (including some calculus to derive formulas)
3. Rule-based and clustering methods for diabetes phenotyping in PySpark
4. Deep learning for mortality prediction (MLPs, CNNs, and RNNs), including a Kaggle competition among the class. Final project: replicating a ML paper with a teammate. If you pick a paper with a repo, this is pretty painless (but still time-consuming)
TL;DR: Poorly organized course, but easy to do well if you put in the time (or have lots of ML/DL coding experience). I got an A but feel like I learned almost nothing.
Rating: 2 / 5Difficulty: 3 / 5Workload: 13 hours / week
xCxGr03Uzsb+ePAog5Lzxw==2025-05-01T18:01:50Zspring 2025
This class wasn't as useful as I thought it would be. Coming in I expected to learn a lot about Big Data and how to apply concepts to different projects but it was mostly just PySpark assignments. Exam is short but not awful if you study with the lectures.
Rating: 2 / 5Difficulty: 2 / 5Workload: 15 hours / week
fCf6jfUwtOyCfGWgLWlHUQ==2025-04-26T16:14:04Zspring 2025
I believe the course content has been changed a few semester earlier. The workload now is less and I believe can be managed in combination with another class if you are planning that. Overall if anybody is looking for basics of big data tools, healthcare data concepts and ML, this course would serve as a very good introduction and get more details. The course videos serve well to introduce each topic. TA are very helpful if you are struggling with concepts. Only change would be if the course can be restored to old difficulty/time commitment, would be even better.
Rating: 5 / 5Difficulty: 1 / 5Workload: 5 hours / week
3qQmCBJOJRzqHJmRUisYTw==2025-01-09T15:06:27Zfall 2024
I'd say the content is not bad honestly - it’s good, even the assignments, exam and final project are nice - there’s a lot of learning.

But a few factors made the course very disappointing for me - doing this with the Bayes course, the TA and staff involvement was the polar opposite. There have been literally zero discussions about things, instead the open questions asked by the students are left unanswered for days, sometimes never answered. And there are many errors in the hws - which the TAs don’t care to rectify even though it would be very easy to do (although not all TAs, a couple were really good). It’s like they literally want to spend the least time possible for the course.

Other cons - some of the lectures are nice and cover a huge breadth of topics that are very interesting and relevant to big data and even system design. But I feel the lectures are too shallow to be able to cover such a wide variety of topics, some of which are really complicated concepts. For most of the ones which were new to me, I had to supplement with YouTube videos. I really like the labs but they are of course outdated and totally forgotten- I really wish they would put more spotlight on it and improve them.

I am really triggered by how badly such a great and important course was managed and run. Like I said I can assure everyone the quality of topics covered is great, there’s a lot to be learned which could be a great value add to ml,DL etc.
Rating: 2 / 5Difficulty: 3 / 5Workload: 15 hours / week
1+UQ8mEBeh72I1+cEeM8Dg==2025-01-01T03:40:00Zfall 2024
My prior experience consists of a bachelor's in CS from Georgia Tech, having taken undergrad AI, ML, and CV and 1 YoE in data engineering. This was my first course in OMSCS.

Overall, having received a high A, I felt that this course was not that difficult, however the homework at some points felt time consuming.

Firstly, there are four homework assignments. The homework stresses a lot of importance on joins and filtering data with Python (pandas and Pyspark mostly), and the last one is DL related. There are coding, calculations, and reports in each homework assignment. Some of the calculations in the later HW felt like busy work and took a while. The homeworks took roughly 10-20 hours each, and there are 2 weeks given to complete each one.

There are a wider variety of topics provided in the short lecture videos. I found that the lectures are a bit disconnected from homeworks, as the lectures are mostly high level information. There are also ungraded hands-on labs exploring topics like Hadoop, Scala, and DL in a provided Docker image. The final consists of multiple choice questions based on information from the lectures, so simply watching the videos and reviewing before the final should prepare you just fine.

The final project is straightforward and graded leniently. If you want to have an easy time on this, select a paper with a code repo provided.
Rating: 3 / 5Difficulty: 3 / 5Workload: 17 hours / week
1OH+fPR2qV+jaLfsUVKZdQ==2024-12-18T04:48:54Zfall 2024
This course has changed a lot from what I can gather from past warnings and reviews. It's not hard, and not especially time consuming.

Generally, the most challenging part of the assignments was data wrangling, but this seemed to be of secondary importance as far as the lectures went. Assignments were reasonable; sometimes I got a few points taken off for things I wasn't quite sure were actually wrong, but I did solidly get an A overall.

The group project, as many mentioned, is also graded leniently, and more on process than results, which does seem sensible.

The final is absolute garbage, and is a mix of trivia questions from the lectures and trivia questions which may have used to be in the lectures.

I think this course would do well as a lighter/one semester alternative to ML/DL with some healthcare focus. I think that MPH/Epi students could use something like this, actually.

There are some missed opportunities; I'd love to explore the semantic attributes of medical coding/ontologies, but that's not what this course is really about. Additionally, it'd be nice if the course went more practical into MLE kind of stuff. However, the course does neither of these things now, so if you've taken ML/DL already, I'm not sure what you'd get out of this, especially if you're not in health.
Rating: 3 / 5Difficulty: 3 / 5Workload: 10 hours / week
vXVdSdTLx9SqaING9vcP6g==2024-04-03T03:07:51Zfall 2023
The course syllabus have changed and Scala and Hadoop was removed, and now the course is manageable below 10 hours per week.

However, I felt like I didn't learnt much, there's some simple spark processing and a really cool kaggle competition, but most of the content is simply some data wrangling with spark / numpy / pandas.

The project on healthcare paper reproduction is also interesting. Choose a paper that is not difficult with Github written with clean code and you should do fine.

Overall, I felt I've learnt more about Machine Learning than Big Data.
Rating: 3 / 5Difficulty: 3 / 5Workload: 8 hours / week
e0KUjdqAVcl35UfRdTIymQ==2023-11-09T22:12:38Zfall 2023
My perspective is a CS student in the ML specialization taking this course. Prior to this course, I had taken ML, RL, and DL.

The bad:

First, the homework assignments were not something I was a huge fan of. They had a somewhat similar template to DL; little bit easier but they came with some tedious aspects. They seem to have changed assignments this semester and some directions were ambiguous and not well communicated, resulting in needing to spends many extra hours. I think I got something from the HW assignments given prior background, getting really good at joins and filtering of data.

I think you can get more out of them if you want to. For me, they were a blur of stress and work to slash through.

The good:

The final project was the best part of the class. We replicated a research paper. I felt like I really learned a lot about some specific deep learning algorithms, data processing, natural language processing, and SQL based on the nature of our selected paper. Your results will vary based on the paper you select and the work you put into it. I felt this part of the class alone was worth doing and will have career/resume benefits for me.

The lectures were concise. Some bits weren't great, a complex topic would be thrown at you bird's eye such that you will not learn it without a lot of your own research or prior exposure. All-in-all though I think the lectures were a good aspect to this class. A lot of them were good reviews in topics from ML/DL, and would be an okay decent first exposure if you didn't have that background.

The neutral:

I have not taken the final yet but based on other reviews and the format, it doesn't seem to be a major component of the class. Mainly a reason to watch the lectures at least once.

Grading:

Grading is generous if you do the work. But yes, there is a lot of work..
Rating: 4 / 5Difficulty: 4 / 5Workload: 25 hours / week
fSCmQXFvyfSsoe2WxkpPJA==2023-05-11T17:20:36Zspring 2023
To succeed in this course, students need to have certain skills and knowledge beforehand. They should be familiar with concepts such as classification and clustering in machine learning and data mining. Proficiency in programming languages like Scala, Python, and Java is also necessary. Knowing how to work with data and understand the ETL process, including skills in SQL and NoSQL like MongoDB, is recommended.

Having these skills is important to do well in the course. Without them, it can be overwhelming, like drinking from a fire hose. The course requires students to go through lectures, understand technology, and implement what they learn on their own. The course covers medical data properties and data mining issues related to healthcare applications such as predictive modeling, computational phenotyping, and patient similarity. Students will also learn about big data analytics technology and its uses, which can also be applied in other sectors.

The course includes five homework assignments (50%), a project (25%), a final exam (20%) , and participation (5%).

Homework1: On the very first day of class, we were tasked with completing the CITI certification to ensure the utmost care and respect when handling sensitive medical data. But that was just the beginning! We delved into the exciting world of descriptive statistics, feature engineering, predictive modeling, and model validation, all leading up to the ultimate challenge: creating the best model. Personally, I had a blast and was able to complete all my schoolwork in under 20 hours. However, for those who are not well-versed in python, sklearn, numpy, and sql fundamentals, it may be a challenging journey ahead

Homework2: Just like homework one, but with a Pyspark twist! Pyspark is an awesome Python interface for Apache Spark that lets us build spark applications and analyze data using Python APIs. As I dove deeper into the project, I got to explore some really interesting concepts like RDD, Spark, and execution plans. And the best part? We didn't just stop at descriptive analysis - we also tackled feature engineering, created an svm lite dataset, and even implemented SGD Logistic Regression. Needless to say, this project was a standout experience!

Homework3: We got to work with Scala, implementing Rule based phenotyping and then diving into Unsupervised phenotyping using Clustering with K-Means, GMM, and Streaming K-means. I have mixed feelings about Scala; while it's type safe, debugging can be challenging. With enough practice, though, I believe anyone could become comfortable with it. Personally, I still prefer Pyspark. Overall, it was a pretty cool assignment!

Homework4: We tackled a Scala-based homework that involved some exciting graph modeling. Our task was to represent Electronic Health Record (EHR) data using the Graph X model. To accomplish this, we implemented both the Random Walk with Restart and Power Iteration algorithms.

Homework5: Out of all the homework assignments, the one that brought me the most joy was working with Pytorch to build various models, including MLPs, CNNs, RNNs, and custom RNNs. It felt like a refresher of the deep learning class I took before, but I managed to complete it in just one week thanks to the techniques I learned there. The highlight of the task for me was calculating the trainable parameters and FLOPS for these complex models, which made the whole experience a lot of fun!

OMSA Computation data track more reviews here : https://www.linkedin.com/pulse/georgia-tech-omsa-program-review-sid-gudiduri/
Rating: 5 / 5Difficulty: 3 / 5Workload: 12 hours / week
up31qYQKCikpIZNkgnjHYg==2022-12-24T18:05:51Zfall 2022
I really enjoyed this course, as I work in a health provider and the Homework was exciting. TAs were very helpful. I like the course design.

if you are trying to enroll in this class then I would suggest you adopt python, PyTorch, scala, and spark. HW is weighted as 50% and time-consuming, don't wait to start it on the last day. I did and I lost 30% on one homework. I got 100% on the Final Project and 16/20 on the Final exam. So, I ended up with A :)
Rating: 5 / 5Difficulty: 4 / 5Workload: 16 hours / week
Georgia Tech Student2022-04-18T06:02:04Zspring 2022
This class could have been really great. The material was interesting and the assignments were pretty cool to learn about.

Unfortunately the teaching staff for this class was just awful. Nothing they would post was very clear and it took way longer to understand what they wanted then it should have. There were CONSTANT bugs in the homework. For example, the auto grader would throw a weird error if you were using a different package version than what it was using. However there was no strict package version requirement and we were left to fend for ourselves to find the solution.

Really needs a teaching overhaul.
Rating: 2 / 5Difficulty: 3 / 5Workload: 12 hours / week
Georgia Tech Student2022-02-25T07:05:04Zspring 2022
This course isn't for students with limited experience of machine learning and programming because it covers a ton of materials. It is not possible for people like that to figure it all out within a reasonable time. As an experienced data analyst with good practice of SQL and some experience with python from other courses though, I found that this is a useful course.

Also, the final project is to reproduce a recent paper, which I think you will have to have some good understanding of deep learning to do so as deep learning become increasingly hot in the field and what was covered in the class is not enough. If you happen to not enjoy deep learning then you will struggle in final project.

Pros:

The homework basically guides you to do machine learning on Hadoop, Spark, Solr with different tools/languages supported by the platform. You will be able to learn how working on big data platform different from traditional class projects.

Labs are very helpful. You can tweak the code provided in labs in most of the homework.

Cons:

It was pain in the butt to set up the environment if you are using a new Mac especially for HW2. I spent more than 50 hours on that homework, while for other homework the time spent including labs is usually around 30 hours.
Rating: 4 / 5Difficulty: 4 / 5Workload: 30 hours / week
Georgia Tech Student2022-01-26T16:38:37Zfall 2021
- BD4H is a fairly heavy course even if you have good experience with coding. Assignments start with python, then in hive, Hadoop, Scala. You have a group project which would have a good amount of coding and writing if you do it properly.
- I typically spent 15-16hrs in a week for this course including lectures and assignments.
- I don’t think I would have been able to manage another course with BD4H, especially in my first semester.
- But in the end, it was a great course, learned so much about the health standards and how ML is being used in the healthcare industry. How I can actually use it. And had a great time learning so many different programming paradigms. I got an A in the end and I feel it is achievable with decent efforts
Rating: 4 / 5Difficulty: 4 / 5Workload: 15 hours / week
Georgia Tech Student2022-01-14T22:03:31Zspring 2022
The instructions for homework are poorly designed. There are quite some mistakes there and the TAs seem to have difficult time understanding our questions as well. I will update as the class progresses.
Rating: 1 / 5Difficulty: 4 / 5Workload: 25 hours / week
Georgia Tech Student2022-01-03T13:11:01Zfall 2021
This is my first course and it was not as hard as people made it out to be. I came into this with 1 year working after a stats & algorithm bachelor (programming-related courses include 3 using Java and 3 using R). Did not do any prep, but I had half a year of fiddling-with-Python experience to build predictive models, and a tiny bit of Spark knowledge through a 3-week project.
- The assignments were not hard. As I'm not an experienced programmer, imho the most important prerequisite is the ability to read codes and think logically. The skeleton codes are extremely well-structured and reading into the provided tests helped a great deal. The sunlab materials are also great, I only learnt Scala from there but with the provided fundamentals, I'm quite confident to proceed in case Scala is needed for any future project.
- The grading is rather generous, especially for the project. As long as the process was explained well and neat comments were included in the codes, you are good to go. Got an A in the end even though my final exam score is 23/30.
- The TA actually helped. I had the most difficulties with the deep learning assignment and failed some test on Gradescope while passing locally. I emailed the TA but afterwards did not want to go through all the trouble to have it regraded. The TA actually told me that if it was him/her, he/she'd give it a shot. Proceeded and got 13 points extra!
- The lectures were not great, but illustrative, clear and fun enough. I like the comparison of Bagging as a Japanese car and Boosting as the Italian sport car.
- The pace is not to harsh. My job is relaxing in the last few months, so I have quite some free-time. There are 3 weeks for the 4th assignment which is too long, better be for the 5th one.
In general, I don't see myself applying most of the knowledge from the course anytime soon, but I actually think it is a good first course. A deeper understanding into the core of all these hyped big data stuffs, coding experience, team-work experience and a heated-up studying rhythm to crawl through the rest of omscs.
Rating: 4 / 5Difficulty: 3 / 5Workload: 10 hours / week
Georgia Tech Student2021-12-22T01:13:16Zfall 2021
To SURVIVE this class, come in with these 4 criteria, all completed before starting.
- Either ML (OMSCS) or CDA (OMSA). Even though on paper, there is no need to, but the lectures are so dry and sleep-inducing that you wished you have such a background.
- Scala Functional Programming in IntelliJ and/or the free chapters on Hands-on Scala Programming by Li Haoyi. Those eagle eyed Asians amongst the community would know that the author of the latter is the son of the current Prime Minister of Singapore, who himself was the Senior Wrangler of Cambridge. So, legit stuff.
- The in-house, specially curated Sunlabs. Your homeworks are based on them, there is no way you can run from it. At least it's hands on here so you won't fall asleep. This includes on how you set up your own Docker environment, which you will NEED.
- DL, or HDDA, or have a healthcare professional background. I put this as any one of them because if you don't have one of these knowledge, you gonna play a huge catch-up overall.
If you lack any of these points, you will regret it, instantly.

You will know it when the semester starts, just when HW1 is released, that you would have wished that you have prepped during
- the late Summer - in the case of the Fall session; or
- the few weeks of Christmas - in the case of Spring session.
Obviously if you pair this with another course while working full-time, RIP to your social life.

So, you have been warned.
- Lectures are dry and sleep-inducing. I have repeated that countless times, hopefully that sticks into your head.
- The homeworks brings you way deeper than the lecture content. Start them early, like literally during the day when it is released.
But why did I like this class?
- Cause only the strong-willed ones take it and you can network with the cream of the crop pretty well.
- Even though it takes a hell load of time, the efforts which you have contributed to your own project are way worth it, and could even be cited by other people in the future.
How to survive the exams?
- Use your expertise of the teammates (they are not just here for the projects, y'know?) to find out what are the generally-asked quiz questions in such domain.
- Hopefully, one is good in ML-general knowledge, the other is great at Big Data Tech like RDD/Scala, and another one with DL.
- Help one another by getting to frame questions that are within the confines of the syllabus content.
- You are sure that at least such questions will appear at the 20% final exam.
Rating: 5 / 5Difficulty: 3 / 5Workload: 100 hours / week
Georgia Tech Student2021-12-21T02:25:37Zfall 2021
This is a review for people with experience (especially work experience) with both Machine Learning and Spark (especially with the Scala API)

This is not a difficult or time consuming course if you have several years of hands on Spark data science work. It is still worth taking though to fill in gaps.

The good
- Exposure to GraphX is interesting
- Fairly complete overview of big data tools
- Tons of hands on programming
- Projects are interesting enough to motivate learning the tools
- Scope is so broad that depth suffers (in particular treatment of supervised and unsupervised learning methods is minimal)
- Provides some interesting historical context
- Health focus is not too obtrusive
The bad
- Lots of the technologies are outdated. Why are we using the RDD API for Spark? Why are we not using Pyspark at all, especially as it gets closer to feature parity with Scala? Why are we using an ancient pre 1.0 version of PyTorch?
- Group project
- Final exam is trivial to the point of irrelevance
Rating: 4 / 5Difficulty: 3 / 5Workload: 8 hours / week
Georgia Tech Student2021-12-03T21:32:42Zfall 2021
The course gives a broad survey of many topics in big data, healthcare, machine learning, deep learning. You will get the chance to practice scala and I find it useful to learn about it since i am in data analytics field. The homework is not hard, but it is very length. be prepared to spend a lot of hours to finish the work. It do help you practice scala and spark and hadoop.

It might be quite a easy course for classmate who have worked with spark /hadoop for very long time.

The project is fun, probably because I get a good team and we all from the same time zone

All in all I like the course, it is not hard, but a lot of work and takes a lot of time.

If you are doing only one course per semester, you can spend time deep dive to each area each week.

If you prefer to take two course, it is doable as well. I did this course together with HDDA. so I can only spend 1 week for each course to ensure I submit all HW on time.
Rating: 5 / 5Difficulty: 3 / 5Workload: 25 hours / week
Georgia Tech Student2021-09-25T01:16:54Zfall 2021
Might come back to do more but I really really NOT LIKE this course. Lectures are dry(not helpful for homework either, very high level, not helpful for anything really), homework/project is not interesting. I'm taking computer networks at the same time and have to say, compared to big data, computer networks is A LOT more fun....

Really really dry course and looks useless so far. Sigh
Rating: 1 / 5Difficulty: 3 / 5Workload: 12 hours / week
Georgia Tech Student2021-05-09T21:27:43Zspring 2021
For some background, I have previously taken AI4R/RAIT, RL, AI, and ML before this course, so I came in with a reasonably strong ML/python background. Plus my full time job involves quite a bit of SQL. I mention this as your perceived difficulty in this course is heavily dependent on your prior experience with the technologies.

The course consists of five homework assignments, a research project, and a final exam, covering a number of technologies: python/pandas, scikit, hadoop, pig, spark, scala, and pytorch.

Homework 1: If you've taken CS7641, this will be a breeze, I finished it in a day. It's basically A1-lite, or a brief dive into supervised learning using pandas and scikit. If you haven't taken 7641, you are being thrown a lot to deal with right away and if your python is weak, you'll struggle. I did not use the provided docker environment for this, just Pycharm and I felt that was easier. This one uses gradescope, so it's real easy to check your work.

Homework 2: Hadoop, lots of hadooping, in many ways it's implementing HW1, but with hadoop/HIVE. Hadoop is not a particularly pleasant language to work in, but useful. To get a headstart do the sun labs.

Homework 3: Scala phenotyping and GMM/K-Means. Scala sucks, there's no real way around it. You really should start this one early and if you can take a scala prep class ("Big Data Analysis with Scala and Spark" on Coursera is good), you'll be thankful for it after.

Homework 4: More Scala! Lots of grapth theory, but... scala sucks. More or less the same tips as HW4, but also the provided unit tests ARE NOT ENOUGH. You should make your own to verify your results with.

Homework 5: More pandas, but now with pytorch. I really enjoyed this one. You cover MLPs, CNNs, RNNs, and do some prediction with them. You don't need a GPU, but it helps. One tip is that there are hidden scores in gradescope when the assignment is finally graded they're testing the models on, and they're much higher than what you can see in gradescope before the due date. Train your model until it's getting pretty good percentages.

Final Project: Get yourself a good team and start early. Also if you take on something that requires training a model, hope someone has a strong GPU or setup a Google Colab. Fortunately I had a 3080, but I was running tests that took 10 hours to run on it still. Follow the requirements given and you'll be fine.

Final Exam: This kind of sucked. It was not at all based on the homeworks, but just on the lectures. So really covering material you have not done any work on, and at times it felt like a vocab quiz. If they want to cover the material on the exam, the homeworks should reflect the material, or they should make the exam based on the homeworks. I think they wanted to make the lectures not totally ignorable, but... I think they failed in that regard that now the lectures are just something you need to watch and memorize everything for the exam.

Things I really liked:
- Covering a number of technologies I've never used before, Hadoop, Scala, PyTorch, it was nice to have a chance to try them
- The Final Project was a lot of fun to try new things and work on a problem without strict guidance, lots of room to explore
Things that could be improved:
- For the most part, The lectures are not really that useful to completing the homeworks/project. The only reason you really need to watch them is for the final and the occasional problem on the homework.
- Scala could probably be replaced now, I think PySpark is generally preferred to scala these days and if the course used it instead, it would both be more useful, and allow for gradescope grading which would make it much easier to grade.
- The final needs to be more relevant to the homeworks, either by adding lecture content to the homeworks, or changing the exam to be based on the homeworks. It sucks that the lectures are not that useful to the class, but... the solution then is to make new lectures that are useful.
Tips to do well:
- As much as you can, go through the sunlabs in advance. They're all public and help out a ton with the homeworks. Figure out the docker env if you haven't used docker before
- Start all the homeworks early, some of them take a long time to figure out
- If you can, have a strong GPU for HW5/final project. Not a requirement, but it helps.
- For the final, the best I can say is rewatch the lectures and take notes. Everything is fair game for the final.
- Your participation is based on piazza's statistics (go to View Statistics on the top of Piazza to see yours). Add Piazza to your list of daily links to check, view all the posts, and post on things and you'll get full credit. If you have a question, someone else does too probably.
I can't speak to what technologies are most useful in healthcare or big data tech, but it felt like this course could use a refresher to bring it up to technologies used more frequently today. Regardless, you will learn plenty in this class and I would say it's a very worthwhile course to anyone who wants to actually implement ML on something more practical than random problems.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2021-05-08T07:25:12Zspring 2021
This course was hard. I am a health scientist but came into it with no Hive Pig or Spark experience and no Hadoop experience. It was hard because I had to learn the languages in a short amount of time. In one assignment I had to implement the Google page range algorithm with health data in Scala and GraphX in 9 days (this is not an easy thing to do, especially if you are new to the algorithm and the language). This course pretty much consumed my entire life for the semester.

Nevertheless, I learned a lot. When I look at Data Science job postings I can now say I have some training/experience in Hadoop Spark Hive Pig PySpark and Scala.

But this is the strange thing. These applications seem critical for Industry, so why is there not a dedicated course in the OMS program that teaches this? I had to learn about these in Data Visualization (through PySpark) and Big Data for Health courses (which is an elective). And maybe many students don't take the Big Data for Health course because they think it is about Health. But it is not. If they don't take it, then how do they learn about these Big Data applications like Spark Hadoop etc. In this course, the Health part relates only to the fact that you work with health data (and some of the online lectures speak about health).

It seems strategic that this course be renamed to "Concepts and Applications in Big Data" or something. It could be useful to remove the Health part in the title so as to appeal to wider range of students. To me, it seems like you would need some training in Hadoop, Spark, Scala, Hive/Pig etc to be competitive on the job market.
Rating: 4 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2021-05-02T06:20:51Zspring 2021
This is my first course in the program - well because everything else was full except for this one.

The course content overall is good - I really learnt a lot from the course. I came in as a data scientist who spent 2 years in the industry, knowing almost all the machine learning algorithms in mentioned in class, but still found myself in a rush doing homework, final projects and final exams. The graph theory part was my favorite, yet the most hated section of the course. Graph theories are absolutely must-learn for data scientists like me, but WHY IN SCALA? Saying Scala is the worst language in this world is an understatement. Who is still using Scala these days?

The communication/interaction/organization is pretty bad. TA answers the questions on piazza, but that support pretty much felt non-existent. Grade distribution is pretty bad - a 30% final exam that lies right in between the week when we finish our Homework 5 and final project. It just felt like we were finishing the project in 1 week, fighting for a 15% grade with an effort that warrants a 50% grade.

If this course get rid of scala and that superfluous final exam, then it would be a great course IMO
Rating: 2 / 5Difficulty: 3 / 5Workload: 30 hours / week
Georgia Tech Student2021-01-29T05:51:44Zfall 2020
This was my first OMSCS class. This was a very time-consuming class. There was an assignment every two weeks, so you'd alternate between a light week and a heavy-load week, and that average to about 30 hours per week. More hours on the week before the final project is due.

You will pull through if you spend a lot of time on weekends reading through every piazza posts and debugging code. However, I feel like I spent most time debugging instead of learning. For example, a good chunk of time in homework 1 was spent reading panda documentation and getting the right table format in panda dataframe. I was doing all the individual assignments by myself, maybe by working with someone you know you can save a lot of time.

I would say: only take this if you're interested in ML and data science and plan to go into that field. It will get you appreciate how easily it is to mess up your analysis and model, so that'd help you be a better data scientist in the future. If you do take the class, don't try to take another class. Just this one.

I didn't plan to go into data science and I took this because I already had some experience in ML and data science (Python, some statistic and ML understanding) from my bioinformatics master, and I thought it'd be a good warmup class. Now I know I definitely don't want to be in the field :)

Thinking back, the things I got out of the class was understanding neural network better (though I probably won't use the knowledge), when I used to only know it on a more superficial level. I also enjoyed coding in scala since the way that language works is new concept to me. Other than that, as a software engineer I don't think I could reuse a lot of stuff I learned from the class.

At least succeeding in this class made me more confident taking two classes at a time in the following semester (now).
Rating: 3 / 5Difficulty: 4 / 5Workload: 30 hours / week
Georgia Tech Student2021-01-13T15:38:11Zfall 2020
In the end, I can't say I know much about either big data or health analytics. You touch many technologies at a very superficial level. Some practical tips can be read here
Rating: 2 / 5Difficulty: 3 / 5Workload: 12 hours / week
Georgia Tech Student2020-12-15T20:12:09Zfall 2020
This is a mix between survey and practical course, with emphasis on health-informatics techniques using big data tools. The lecture material was a touch light (in coverage of the field) but covered most of the important techniques in detail. The frameworks used were still relevant as of 2020; and choices to decide them was clearly articulated.

Contrary to other reviews, inclusion of Pig as a dataflow language is an important framework paradigm -- it is a declarative, managed dataflow computational engine. Spark is its 'modern' replacement but it is much more akin to a functional distributed parallel engine (with fault tolerance) i.e. it is lower level than Pig.

The real gold in this course is its coverage of two important topics: tensors, and message passing graph networks. These two topics are essential to a whole -family- of ML algorithms.

The assignments were not difficult. They were topically relevant. The final project (team-based) went well (with some of the usual hickups you read about in team-based projects).

Overall it was an enjoyable course.

If you are ML specialization I would consider this course a 'strong take', simply for exposure to issues you will encounter applying ML in a big data setting.
Rating: 5 / 5Difficulty: 2 / 5Workload: 20 hours / week
Georgia Tech Student2020-12-10T13:34:45Zfall 2020
About me:

OMSA program, 3rd semester

Strong Python

No prior Spark experience, not so strong with math

Grade: A

This class is focused around how to implement big data systems. This is through frameworks like Spark or even just getting you out of the habit of using for loops. The 5 homework assignments are long and take at least 20 hours of effort each but I felt they were really effective in teaching me how to implement the concepts. The assignments mainly focus around Spark and Python. There are some basic proofs on things like stochastic gradient descent. The effort level drops off significantly after the last homework assignment is submitted.

The project was not such a big deal. I felt like the grading was very, very lenient. Frankly, my group did not accomplish much at all but we still went through the motions and got good marks. The final was kind of bogus. It asked questions like “what is the name of this platform”. This is just memorization but more than that, it really did not focus on what the rest of the class was about.

Overall, I learned a lot in this course and it was less work than I expected. It was a good mix of theory and practical application (very heavy on coding which I liked personally).
Rating: 5 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2020-06-28T17:25:40Zspring 2020
Not my review, someone posted this very detailed review on omsa slack and i thought it had to be copied here.

quote

Well, I'll start off with, you dont have to be in, interested, or affiliated with healthcare to take this course. Healthcare is more or less the "usecase" example for this course. Tbh, I knew little to nothing about the healthcare industry before the course, and after the course, I know a little to nothing about the healthcare industry. Next, if you've ever looked at the reviews for the course on the omscentral.com website, you'll see that the course is rated as the toughest course for both omsa and omscs. This can POSSIBLY be true. It actually depends on you and what skills you already have. If you're really really really good with Java, Scala, or Python AND great at learning things on the fly AND familiar with SQL AND have a solid background/understanding of the machine learning process, the course wont be "hard" for you. Itll just be time consuming. In fact the course isnt "hard" at all. It's just really time consuming.

Now, if you're lacking in 2 out of the 3 things I listed above, DO NOT DO THIS COURSE, YOU WILL FAIL. The course isnt meant to be an intro to healthcare, or specific algorithms, it's a survey of big data ecosystems and techniques. For example, the first homework alone threw, python, hadoop map reduce, hdfs, pig latin, and hadoop hive, all at you at once. All hws you get 2 weeks to do. And trust me, for most of them, you will need the 2 weeks. This is because you are learning the whole hadoop and spark ecosystem. You will do a lot of self learning. Fortunately in my class, there was a lot of great communication on piazza, and the TAs were quick to respond and help.

Theres 5 hws, and a group project. The HWs have you basically implement from scratch, distributed machine learning on hadoop and spark. Scala was used alot. If you're familiar with python, theres some similarities in syntax. If you're familiar with java, scala is abstracted java, so you should feel at home using JARs and SBT to compile your code. Ask specific questions if you have them. But in all, I highly recommend this course if you have a sufficient background. I learned a lot of useful cool things, and it is def a resume booster. Doing anything distributed is a resume booster. But be prepared to dedicate a lot of time to it. I took it paired with the mgmt 8803, thankfully 8803 was a joke and didnt require much time. Otherwise, I would recommend you take this course by itself if you're on the fence of any of the requirement I stated. If you lack in the background requirement, dont be discouraged. Just get comfortable with the machine learning end to end process and get very good with one of the 3 languages I said, and get basic SQL knowledge. I do recommend everyone take this course. It's the route all the big companies are moving due to the shear amount of data being generated. Gone are the days of doing meaningful data science on your laptop.

I was so inspired by the course that I built my own hadoop and spark cluster using raspberry pi 4s and external SSD cards No matter what itll be time consuming. Someone ran a poll in piazza on hours spent on the HWs for the course and I think the average was close to 30 hours a week. There were 120+ students in my class. It was mostly omscs students. But from your background, I'd say you'd do just fine. Also, be comfortable from a command line and with docker. The environments were provided as a docker container, and hdfs is pure command line. All the test scripts were command line. To submit spark jobs is command line. If you use linux or bash a lot. You'll have no problems. Not saying CLI and docker are a hard requirement. But if you are comfortable with them, that's less time you'll waste setting up your environment. That was probably the biggest gripe on piazza. The troubles some students were having just getting their environment running. But I'd say those students were the ones who were greatly lacking in foundational skills.

One other advice....find your project members early. Like by 2nd week of class early. And find good ones. Someone started an excel sheet on google docs for everyone to put their name, location, email, job, technical skills and linkedin page, so we all could check each other out and form groups based off skills and location (timezone differences and all). Once you have your team use them to work on the HWs. My butt was saved greatly by one of my teammates on a hw when we compared solutions and we were greatly off. I missed a complete step and wouldve cost me 30 points. They're a good sanity check to make sure you're headed in the right path. One thing I wouldve liked to see on that google spreadsheet would be how many classes this semester each student was taking. Had a tm8 bail on us because drop the course because he was overloaded. Had I known that, i wouldn't have even bothered with him. And had another member ghost from time to time because he was taking the 2nd hardest class with this one. He was still a big help, but just a little annoying he wouldn't respond occasionally or very sporadically because he was too consumed with the other course to bother.

unquote
Rating: 4 / 5Difficulty: 3 / 5Workload: 20 hours / week
Georgia Tech Student2020-06-21T02:34:12Zspring 2020
The course has awful graders and TAs and lectures but once you look past that it is one of the few courses where you truly learn practical skills that you can carry with you to an ML job.

A few things to note. The course is nowhere near as hard as people make it out to be. I would not consider this to be a hard course as much as a very fast bootcamp. All HWs need only one solid week from start to finish.

The final project is free points for all. They don't care as long as you did something. You can spend as little as 25 hours or as much as 100 hours and still get an A either way. However, you will miss out on learning state of the art Deep Learning techniques if you just care about a grade. I spent 100 hours on the final project and learned so much but others threw in a Pytorch wrapper and trained a pretrained CNN in 10 hours and called it a day. Both of us got an A easily but I feel I learned much more.

So getting an A in this course is not hard regardless of what anyone else might say. I knew no spark, graphx, scala, pig, pytorch and needed a week for each HW. (I do however use Python, SQL, Hive daily for my job)

Having said that I took this towards the end after taking ML, AI, CV, RL etc. Not taking those earlier will make this course a lot harder.

One way to improve this course however is to spend more time on Spark I feel. Drop the Pig assignment it's not really used anyway and instead maybe focus more on deploying ML models online using Flask, Docker, Kubernetes etc and this course would be perfect.
Rating: 4 / 5Difficulty: 3 / 5Workload: 15 hours / week
Georgia Tech Student2020-05-20T09:31:26Zspring 2020
Great course.

Pros: Touches a wide range of topics (pandas, hadoop, scala, deep learning with pytorch).

Cons: Group project. This course is difficult and a lot of people will drop out. 2 people in my team dropped out and the final person didnt contribute. I spent many nights working on the project alone. But YMMV.

Overall great course. But brace yourself for a lot of work!!
Rating: 5 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2020-05-14T19:19:22Zspring 2020
Workload: There are 5 homework assignments. I spent 60 hours on the first one and 30 hours on each of the next three. For the last one I spent some extra hours on the Kaggle competition otherwise it should be no more time consuming than hw2 to hw4. The workload for the team project can vary. Homework is reasonably interesting especially when there are two Kaggle competitions involved.

Project: Project is very helpful. It provides you with the overview of several interesting topics in the healthcare sector including papers you should be aware of, the workflow you should follow and the structure your report should have. You can take the material and develop your own research project after this course, which I think is the main purpose of the project. The project we submitted was half-baked at best, given the time constraint in the course. If you put the project in a 8903 I think you will have time to do the project justice.

The teaching staff: Unlike the earlier comments suggested, I found that the TA team was solid. Before the first assignment was due there was only one TA who was obviously overwhelmed. After that other TAs started showing up in Piazza and things got back to normal. Software was somehow up-to-date, e.g., we used 2.2.x for spark while the most recent released version was 2.4.x
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2020-05-11T02:27:01Zspring 2020
Wow, what a class. It was tough but you learn soo much. I felt like I actually learned skills that I could apply to a real job and real data analysis. There's an emphasis on Spark, Scala, and distributed large scale data analysis. It was tough. I had to relearn the chain rule and try to understand stochastic gradient descent but it was quite a great course. Personally, I felt that the Andrew Ng videos were excellent in help me understand quite a few of the concepts taught in the course.
Rating: 5 / 5Difficulty: 5 / 5Workload: 38 hours / week
Georgia Tech Student2020-04-27T19:02:49Zspring 2020
This class has pros and cons.

Pros: The TAs are extremely helpful - Our TAs were very responsive and friendly

The homework workload is pretty manageable. - I took this with another class and was afraid after reading all these reviews. Maybe I came in with a better background than most but I think 10 (15 hours max) a week was far enough to do well and get an easy A. I would do the homeworks right when they came out which is was more fun for me as when you get close to the deadline it seems all the answers start appearing on piazza which doesn't help you learn much but if you like a high grade on every assignment, remember that!

You learn python really well and become ok in scala. Definitely should be proficient in both by the end of the class.

A great team will go a long way in this class for the final project. I loved my team!

Cons: The TAs made the class too easy at several times. I won't call out people but lets just say we had people complain they didn't realize we had to use a big data tool on our final project... its literally a big data class :) There were several moments like this or people not following instructions and then the TAs decided to not take points off. This is great from a GPA standpoint but it does feel like it diminishes the 'graduate level' part of the program.

The lecture videos are pretty irrelevant. A couple times they helped but it was very rare.

Overall, I think this class is hyped a bit too much in these reviews in terms of the workload. Is it a lot when the assignment first comes out? Sure. Is it a lot over 2 / 2.5 weeks when most of the answers in some form or another end up appearing on piazza? Nah.

Finally, some encouragement: you can totally do this class! I have full confidence anyone in this program (especially in the ML track) can do this class without going crazy. Just remember: You don't have to struggle alone. There are a ton of smart people in this class who have probably struggled on the exact same part. Reach out, make friends and you shouldn't have any major issues. You will get really good at some data science work and if you are interested in big data, being better at python, graph networks, etc. this is a great class to get some experience in them!
Rating: 3 / 5Difficulty: 2 / 5Workload: 10 hours / week
Georgia Tech Student2020-04-02T22:42:52Zspring 2020
About me: non-CS undergrad, no professional ML or SWE experience. ML specialization. This was my first class (alongside GIOS). I took the semester off of work to focus on school; I am quite confident that I wouldn't have survived if I hadn't.

Take the name of this class seriously: it is supposed to be a "big data" theory class, not a machine learning class. While the homeworks and lectures take you through a survey of different technologies, the overall goal is to make you think through things like pipeline architecture and computational complexity, using machine learning and healthcare data as a springboard for accessing those concepts. Does it succeed in those respects? Somewhat, I think. The lectures are split roughly 33/33/33 between describing learning algorithms, big data technologies, and discussing healthcare applications. Each homework assignment has you performing essentially the same task - ETL and some kind of ML on a health dataset - using a different set of tools, but asks you to approach the problem from slightly different angles to tease out differences in computational resource expenditure between those toolsets and contexts. There is a fair amount of lecture and assignment time devoted to working out pretty standard ML algorithms (which will probably feel unecessary for those who have already taken standalone ML coursework) but some attention is given to think about how those algorithms perform at scale. However, because you are basically picking up a different programming language or framework with every assignment, I spent much of my time working out syntax, reading documentation,and experimenting with code in notebooks rather than thinking rigorously through the 'big data' components, which all tended to feel fairly straightforward, and the writeup that is submitted alongside the code tended to be something I pulled together at the end.

Personally, I actually enjoyed this. I was happy to get more Python practice, and while I'm certainly not ready to apply Hadoop/Spark at a professional level, I now know a lot more about the ecosystem than I did before. While the lectures were a little slow, I think that watching them on 2x speed and learning what you can afford to skip yields a pretty informative experience. The labs were also extremely helpful, often showing you step-by-step instructions for completing significant chunks of the homework, although they could use a proofread (I may be more pedantic about this than most). In spite of all of that, I feel comfortable saying that the homework is hard, and certainly time-consuming. A component (maybe 25%) of each assignment involves applying or implementing some learning algorithms; I found this the least satisfying section because it was the hardest part to verify. I only lost points on this section in one assignment, but I was worried about my chances with each of my submissions. Fortunately, there is an autograder covering the non-ML elements for HW1 and HW5.

I also like the project. I am not sure if all students were aware, but the TAs all recorded lectures describing each of the topic areas, and it was pretty easy to use that material to find a good research area and get working. Obviously, your experience probably depends a lot on your team (I liked mine), but it seems like the grading is pretty reasonable, so as long as you don't blow it off, you will come out okay. I'm not really sure what other teams did, but my project was much less about "big data" (the ETL was fairly trivial) and was more focused on deep learning, since that's what my team was interested in. I appreciated that level of flexibility.

The test was canceled for me thanks to COVID19; I'm a little relieved, because that was definitely the big question-mark in my mind. It's not altogether clear to me what would be tested or how, since the lectures are fairly high-level and the assignments are, mostly, pretty low-level/applied.

Lastly, I'll mention that before the semester started, I sat down and watched like 3 hours of YouTube videos about Docker, and then spent some time getting the class environment (posted on the course website) set up so that I didn't have to worry about it while class was in session, and I'm glad that I did.
Rating: 4 / 5Difficulty: 4 / 5Workload: 25 hours / week
Georgia Tech Student2019-12-16T16:01:51Zfall 2019
This was a tough class, but it was fair in my opinion. Definitely get started on things as soon as you can as they usually take longer than expected. The assignments are released on a schedule so you cannot work ahead. TAs are good. Lectures were pretty good though you definitely learn more from the projects and the labs. Note you can look at labs now at http://www.sunlab.org/teaching/cse6250/fall2019/env/ The group project was interesting.

The one part I did not love is they tried an exam this semester. Some of the questions were good, but some seemed more like trivia where you could narrow it down to two options by knowing the material but then just had to guess.

Small amount of extra credit offered for placing high on in class Kaggle competitions.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2019-12-06T16:03:37Zfall 2019
BD4H is a beast

The primary aspects of its beasthood are:
1. Computing environments (Docker, HDFS, local machine)
2. Plethora of languages (Pig, Grunt, SQL, Spark, Python, Scala)
3. Quantity of homeworks (every two weeks)
Video lectures were very much in the "CSE-style" that I've seen before, top-level and conceptual, providing an idea of "what" and "why" but hardly ever "how" to do something.

Homeworks are every 2 weeks through about 2/3 of the term, then the focus shifts to the project.

Project is fairly open-ended, with some structure provided. Make your own teams.

Final exam was new for the course. ProctorTrack, 60 minutes, 30 multiple-choice questions, closed everything (one sheet of blank paper allowed, but no help there). Mean is hovering around 22/30 and a lot of grumbling from students.

Piazza was a bit of a mess (1500+ questions/notes from 347 students), though responses were fairly timely.

I paired this with ISYE 8803 High Dimensional Data Analysis, which was a mistake. I recommend doing CSE 6250 as a lone offering or pairing with a lighter course.
Rating: 4 / 5Difficulty: 5 / 5Workload: 20 hours / week
Georgia Tech Student2019-11-06T14:06:58Zspring 2018
1. It will be good if the the course are revised with more updated version of tools, e.g. using scala v2 with dataframe instead of scala v1 with RDD
2. I am lucky that I toke the course early, and heard that they now even add one more exam to it. The workload seems has now gone from very heavy to insane....
3. This course is famous for it workload, and difficulties. Make sure that you are well prepared before taking it, better in the middle or later half of the program.
4. Group project for group at most 4 people. Definitely helpful if can form reliable groups in advance.
5 individual assignment + 1 group project.
Rating: 4 / 5Difficulty: 4 / 5Workload: 30 hours / week
Georgia Tech Student2019-09-28T13:56:15Zfall 2019
Context:
- 3 years experience as a data scientist
- Strong background applying and teaching machine learning
- Strong background in Python and moderate experience with big data tools
Pros
- Lots of technologies covered. Each assignment introduces 1-2 new technologies
Cons
- Lectures are only surface level and useless if you know anything about data science or machine learning.
- The labs are the only place where you learn to use the technologies, but they are outdated. Moreover, the techniques are depreciated, even in the environment you are given to use.
- The environment you are expected to use is also outdated, making troubleshooting a challenge.
- The homeworks force you to code in a cookie cutter skeleton code, which forces you to perform things that invalidate your models just so you can meet test criteria (i.e. fitting and scaling train/test sets independently).
I came into this class expecting to learn how to use a bunch of Big Data tools. I was not wrong, but I was not expecting such an inefficient system.

The material of the class is not challenging. It's very simple conceptually, but don't expect to learn any implementation from the lectures. The labs are supposed to teach you the technologies, but the techniques used in them are depreciated, even in the environment they give you to use. If you use Google to learn the current best practices, there is no guarantee the environment they give you supports those operations.

Suggestions to the staff
- Keep the environment for the class up to date
- Scrap the high level ML lectures, and teach the technology theory instead
- Keep the labs up do date with technology implementation. There is no reason the labs should contain techniques that are depreciated in the configured environment.
- If you are going include machine learning as part of the class, make sure the homeworks don't violate simple ML principles like rescaling your test data based on the test values.
Rating: 1 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2019-05-10T17:43:05Zspring 2019
If you read the reviews, feedback is pretty spot on. Therefore I will not add to it. However I will list the reasons I loved this class. First, this is hands on class. You have to learn new things and program with it. I loved that approach. Second, I learnt a lot of new things. Even where there was some theory, we had to implement it. That added a new dimension to the learning. Third, the project was awesome. I had a very good team, but again generally, OMSCS students are all very smart (exceptions are always there) and want to learn/study. Therefore it was a very engaging project.
Rating: 5 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2019-05-08T03:12:28Zspring 2019
This class was a firehose. Most of my learning and time spent was through completing the homework rather than through the lectures. A successful student would do the following 2 things:
1. Start homework immediately when it is released and work on it every waking hour until complete -- do not wait until last weekend or it's too late
2. Read every post and reply on Piazza -- there are often hints or ideas that can save hours and hours of troubleshooting
I did each of the above, and I was still awake past midnight each time an assignment was due.

This was an amazing first class for me as it gave me the breadth of experience with topics I was hoping to learn:
- big data technologies like Spark, Hive, Pig, and Hadoop
- machine learning and pandas
- deep learning with neural networks
Overall, it was a great experience, and I would recommend the class for someone with both the time and the interest.
Rating: 5 / 5Difficulty: 5 / 5Workload: 25 hours / week
Georgia Tech Student2019-05-06T05:17:55Zspring 2019
Spring 2019 just finished and I received an A in this class. This was my first semester in OMSCS and I decided to take this class by itself after reading previous reviews and I believe I made the right decision.

I am working full-time as an engineer for my job, I have a decent coding and database background, but I have never worked with Scala, Hadoop, Hive, and pig.

My job was awesome in how flexible they were about letting me work from home when I needed to finish some assignments, particularly the final project that I did.

I will be straightforward: If you aren't familiar with python, scala, SQL, and pyspark, you will most likely have a hard time in the areas you don't know, however if you are already working as a data scientist/ML engineer, or familiar working in these languages, you'll most likely manage just fine.

There are labs which you need to cover on your own time, I highly recommend you go through all of the labs because that is where you will find code samples and examples on how to execute certain functions which should give you an idea on how to attempt to do the assignments.

I really struggled on the Scala portion of the assignments because I had never worked with that language before, I really disliked having to learn a new language while also trying to figure out how to create a ML model in aforementioned language. The python portions were difficult but manageable because I always had an idea of what I needed to do to get the work completed and I enjoyed it the most because I really feel like I learned and improved my knowledge and skills, but the Scala assignments had me feeling like I was running around like a headless chicken at times desparately searching google and stackoverflow for answers on error messages and the correct syntax to call a function, I would be up burning the midnight oil mindlessly copying and pasting snippets of how to do a table join in Scala hoping that it would work. I don't think I learned anything on these assignments, at least code-wise. I understood the concepts already but if I had to write a program in Scala, I'd be totally lost and need to start from scratch. I would imagine the same would happen to me if I had not already done a lot of work in python/SQL.

Despite that, I still really enjoyed this class. If you are interested in data science /machine learning, this is a great class to take. You will get the opportunity to do assignments that feel like they have real meat to them, you'll gain skills you can easily take back to your professional field to reapply, or possibly prepare you for your next job. There was a lot of hacking away at code and testing to get things working, and it felt very similar to doing real-work on the job because you are working with real medical data for this class.

A few notes I will make about the class:
1. Piazza was the go-to and very active with TAs very engaged. If you want to succeed in the class, I recommend you stay active on Piazza and try not to fall behind. Start the assignments early so you can ask your questions early or you will regret it...
2. I would not recommend this to someone as their first class like I did, nor would I recommend you take this class in tandem with another. The workload was really exhausting. I think a good pre-req to this class would be Machine Learning and ML4T, but unfortunately both were completely full by the time my registration opened so I took the first class that was available for me.
3. When you create your homework assignment to submit, make sure you are able to run it end-to-end before you submit it, I had an issue with one of my assignments which caused the code to fail without some local files because I had changed the paths for testing. This resulted in a nasty deduction on my 2nd assignment which had me doing an uphill climb the rest of the semester.
I think I got lucky on my group project, I had very smart and engaged teammates who all contributed and did their part. I was very satisfied with what was submitted. I think the only advice I can give on this is to be proactive and search for group members early, and be honest and specific about your strengths when you are looking for team members so that others can adequately determine if they are well suited to working with you.
Rating: 4 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2019-05-04T00:16:42Zspring 2018
I was very excited for this course and ended up with a neutral feeling towards it.

I think the course instructions/environment setup must have been reworked relative to the earliest reviews, as I did not find that aspect to be more than a mild inconvenience early on.

This course is very geared towards practicing with tools - Spark, a little Hadoop/MapReduce, ML in python (scikit-learn, PyTorch, pandas).

It was helpful in getting used to the idea of building an ETL pipeline, and it did get me off the ground in deep learning. The application examples and project topics are really interesting and needed. The TA's are great. The lectures are overviews, they will give you awareness of various data science topics but you'll have to study on your own if you want to understand more deeply. However, beyond that, I felt like most material was basically what you would learn if you forced yourself to work 20+ hours a week on problems from O'Reilly books. That can be good or bad depending on your goals, but I found the applied focus a little less engaging after taking the less-applied (but still somewhat applied) CS7641 last fall.

The best way to describe this course would be a sequal to DVA/CSE6242 with less D3 and more machine learning + Spark.

On a practical note, I was decent in Python but had no Scala experience and was fine. If you know some Java, that will help with Scala but overall the course does not require software development, just data munging within the context of skeleton code.

The project will determine a lot of your experience, so YMMV. If you want to really learn big data tools, make sure you pick a project with large data sets that will force you to use a cluster. The project I worked on ending up fitting in the memory of a typical laptop which made it harder to prioritize implementing a pipeline in Spark. The second reason for this is that all homework uses small data and can be run locally. That makes sense for a number of reasons, but I think you have to feel the pain of a large dataset to really start caring about using Spark correctly on a cluster.
Rating: 4 / 5Difficulty: 3 / 5Workload: 25 hours / week
Georgia Tech Student2019-04-27T23:44:13Zfall 2018
The class is pretty much a semester long overview of a bunch of different technologies used in the big data industry / academia. There are no exams, five projects each relating to a different big data technology, and one final group project. The lectures are not related to the projects at all, but they may be useful if you don't have any kind of background in machine learning or data science.

While the workload seems light on paper, the projects are not easy - each one takes a log of time and requires a non-trivial understanding of the technology at play. I had some understanding of about 50% of the technologies coming into the class and there were still quite a few times where I seriously considered dropping this class because the projects were so time consuming. However, if you're interested in working in data science (IMO) this class is a must to get your feet wet.

Keep in mind that the TAs and Professor are pretty inconsistent in this class so don't expect a lot of help over piazza.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2019-02-25T04:59:23Zspring 2019
Really active TAs but homework assignments are really time consuming. One could get ahead by doing all the sunlab tutorials and getting your environments setup. http://www.sunlab.org/teaching/cse6250/spring2019/env/

There's no environment to upload and test that your code runs so it's pretty frustrating when you find out your code didn't run.

Overall a good class after taking ML if you have the time.
Rating: 4 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2019-01-12T15:54:29Zfall 2017
This course was amazing! Contrary to what was stated in some of the reviews, the assignments were well prepared and the docker image given really helped the environment setup. I honestly dont remember struggling that much to setup my environment (except Spark in IntelliJ, but that is also not that bad). In this course, you will learn most of the big buzzwords in Big Data such as Hive, Pig, Spark, SparkML, and PyTorch. You will learn even more if your group project used other frameworks (my team used combination of PySpark, Tensorflow, and Keras).

To succeed in this course:
1. Be ready to invest huge amount of time in group project which includes writing the proposal, draft, final paper, presentation slides and videos.
2. You should be comfortable in creating quick prototype and analyze the results directly.
3. It is beneficial to have a pro-active mindset so it is easier to allocate the task as you might find yourself lost if your group move at a faster phase.
4. It is also important to decide your research objective early and stick to it. The TA grades you more on how you utilize the big data stack and less on how accurate or advanced your algorithm are (At least they did for my group). After all, it is only a 2-month research project.
Overall, I think it is one of the must-take course in OMSCS
Rating: 5 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2019-01-08T15:02:51Zspring 2017
Cool class, but there was a lot of issues with setup.
Rating: 4 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2018-12-21T00:27:43Zsummer 2018
Overall this is a useful course to get introduced into data analysis area. After the course, you will have some idea of the spark and big data pipelines.

Assignments: Every assignment is tough and time-consuming especially if you are not familiar with pandas, numpy , spark API. You will spend a lot of time on stack overflow.

Video: not quite useful for the course
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2018-12-18T15:30:42Zfall 2018
Please read all others review carefully before you choose the course. I feel all reviews are relatively fair.

I will just generally say this course will definitely force you learning something that is very practical, but you will learn it in a very hard way (I believe no matter what your background is).

Is it worth? It depends on your own purpose and expectation.
Rating: 4 / 5Difficulty: 4 / 5Workload: 30 hours / week
Georgia Tech Student2018-12-11T07:02:40Zfall 2018
This course is very true to what you read about it here. It has a lot of technologies introduced in every homework from Python, Hadoop, Spark, Scala, Deep Learning in PyTorch. Very few lectures but a lab portal for environment setup and labs, which was extremely helpful. Make sure you do the labs ahead of time if possible. There are 5 homeworks and a group project.

You cannot slack in this course on homeworks AT ALL. If you do not start homework the day you get it, you will not be able to finish on time. No Exam but after finishing homeworks, people had comments on piazza of "Feeling like superman in the house". Two late days are allowed for all homeworks combined. So, make sure to use them wisely towards the later homeworks as they are tougher.

The instructors/TAs are sometimes not present for days but in general helpful in what way they could be. People had a lot of complaints about that but all-in-all, if I look back, I enjoyed the course a lot. Deep learning homework was new and extra tough but I appreciate the learning and be able to do it first hand. At times, the course will be very overwhelming with work, life, family. But be ready to sacrifice a lot if you want to succeed in this course.

There's also a group project at the end. It has usual issues of group formation, participation, ownership etc. Personally I am not a fan of group projects and it stays so after this course too.
Rating: 5 / 5Difficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2018-08-10T03:35:25Zspring 2018
I would not recommend this as a first course.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2018-05-01T03:38:15Zspring 2018
Context
- Software engineer with many years of experience.
- Significant experience with SQL, Unix, Python, Git, Docker.
- Some experience with Scala.
- No prior experience with Spark, Hive, Pig, Hadoop
- Taken (ML4T, RL, AI) previously.
Pros
- Hands on experience with the Data Science Pipeline. Data exploration and pre-processing, feature construction, Model training, etc.
- Hands on experience with Big Data Tools, brief enough to touch several tools aspects, but difficult enough to force you to think through the problems and grasp the basics.
- Interesting and relevant Healthcare related Problems. You will actually use ML techniques to tackle these types of problems.
- A taste of the variety of problems in Healthcare that can be tackled with ML.
- The Homework is well structured, from easier to hardest.
- The project is based on some suggested topics, but it is fairly open ended.
- The professor participated more in Piazza than in other courses. The TA were usually responsive.
- An automated environment with all the required software for Homework was provided. After the initial setup investment it worked the rest of the semester without issues for me.
- A server with enough resources was provided to do our projects, turned out to be really useful.
Cons
- The auto grading process is tricky, it is really hard to know how/what caused point deductions, even when all local tests passed. You have 50% chance of getting a very good grade or a really bad grade, regardless of your efforts and local checks.
- The grading and re-grading experience depended on the particular style of the given TA. They should have common principles and be more consistent.
- The automated environment setup could be tricky for some people. In my case, I spent like 5 hours setting it up, then it worked fine.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2018-04-30T20:14:33Zspring 2018
Took this course on my first sem in OMS. It was a useful course to me because I got to learn big data technologies like spark, hadoop, scala etc., - but you could have learned more if you just read a book on each of those subjects as this course does not go deep into each technology.

There were a lot of difficulties setting up env in the beginning. TAs were kinda responsive, but not very helpful. The homeworks were not structured well at all! I did have plenty of time to complete each homework, but I had no idea whether my code was right or wrong! There was no feedback or guidance on this part. So, for many assignments - students who check their code on few test cases after submission learned that they had missed a minor edge case which was only in the TA's test set and lost marks due to that. IMHO, if you actually want to learn big data technologies, grab a few books with big data tech. names on them and just do the programs in the book. You'll learn a lot more that way.

The only interesting part of the course was the project (which fortunately was a huge chunk of your final grade). You are allowed to choose from a wide range of pre-selected topics. The project can be done in a group of 3 or solo. You can also get access to datasets which are normally not available for free access online. I would say this was the only useful thing I got out of this course.

I would assume this course is going to be restructured or rebuilt soon because of the complaints that were made during my semester. Overall, if you are not comfortable learning and implementing with 3-4 technologies in a short period of time, you are going to have a hard time. If you want to take this course, pair it with some lighter course and read additional big data material on your own to get the most out of this course.
Rating: 3 / 5Difficulty: 3 / 5Workload: 20 hours / week
Georgia Tech Student2018-04-26T18:28:06Zspring 2018
Important note: I took another course which was very time consuming (not GA) as well while I was taking this course, so that may influence my view of this course as well. And just for you to know I had taken more than half the labs of this course and all OMSCS ML-related courses before day 1 of this course.

Do you want to learn Big Data? OK, then take the best course or MOOC you can find online or offline or start some coding by yourself. Don't take this course. That's my best advice. Why?
- Professor is almost never around. This is perhaps the biggest issue, and from my point of view, where all the other problems come from (he just doesn't see what is going on). You don't see him around in Piazza and he doesn't respond emails. I've been in other very few courses where professor is never around, but this course takes it to the next level. He participates with teams when it's time to start the project though. I can attest that.
- It's outdated. They use spark 1.3 in virtual environment for homeworks. Latest spark version by the time of this writing is 2.3. So, it's mainly focused on RDDs, which is being left behind in favor of DataFrames. You can still use DataFrames in spark 1.3, but it's not mature enough. The same happens to other tools used. Some of the tools are not even the most used tool of their type out there. So sometimes you feel like you're learning it for academic purposes mainly or to get the grade. Having to give almost 40 hours every week to something like that can be frustrating.
- Feedback takes a long time to arrive. There were times I had to wait a lot to get an official response, way more than other courses. TAs do this, I don't remember having seen professor do it.
- Fighting against environment. They give you an environment you have to set up at the start of semester. It's good help, but you'll end up fighting that environment the rest of the semester.
- So much work is worth very few points. 4 Homeworks take lots and lots of hours every week, but they are worth 10% of the whole grade each. Only 40% of the grade after 8 weeks of no social life, maybe 200 hours of work or more (each homework is given 2 weeks).
- Forget you had a family. Your family will miss you for the whole semester. If you have a spouse, he/she won't be happy you decided to take this course.
From my point of view, this is one of the most important subjects for Machine Learning Specialization because of how often you are asked to know this stuff for a job, and I liked the fact that GT had a course on this. I waited to have a good amount of ML knowledge before taking it, and I think the intention is good, but this course needs a revamp desperately and a professor with time well spent on Piazza.

This course was way too hard, and half of it was because of the wrong reasons (getting environment to work correctly, lack of documentation, lack of lecture on subjects needed for homeworks).

I think they are thinking in improving this course, hope they do. I could have taken 2 other courses in the time it took to finish this course.
Rating: 1 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2018-04-04T22:56:28Zfall 2017
This was a very fast paced class. There is a mountain of technical + industry vertical knowledge that is being covered in the first few weeks and students are expected to be self-starters. If you expect the lectures to cover the material you will be disappointed. I spent close to 20 hours every week learning + iterating on the homeworks/projects. It has been a great experience but the time commitment is a must.
Rating: 4 / 5Difficulty: 5 / 5Workload: 20 hours / week
Georgia Tech Student2017-12-12T18:48:45Zfall 2017
I came into the class already knowing SQL and Spark in scala fairly well, and had taken the rest of the ML specialization already, so take my workload and difficulty ratings with a grain of salt. Here are some things that might be helpful to know:
- You need to know how to calculate partial derivatives, use the chain rule, etc. You have to derive various update equations and then code them. In other words, if you weren't able to derive the equations, you wouldn't be able to do the entire rest of the homework and would get a terrible score. That could really rain on someone's parade.
- You are required to use scala for Spark (except on the Group Project). You are not allowed to use pyspark.
- There are 4 homework assignments and then a group project. No exams.
- You need to know SQL for most of the homework
- The first homework uses Pandas and sklearn. The rest do not.
- The second homework uses Hive, Pig, Map-Reduce, and scala Spark in Zeppelin. You don't need to know Hive, Pig or Map-Reduce very well. Just do the practice lab before the course starts and you will be fine on those.
- The 3rd and 4th homework are scala Spark. You need to know it pretty well.
- The 4th homework uses Spark's Graph-X.
- The Group Project is worth a lot of your grade. A lot of people did projects based on Deep Learning.
WARNING: when testing your homework using their test scripts, do not assume your code is correct if it passes. Their tests are inadequate. Many students complained of having passed all the tests but then losing lots of points.

Pros of the course: Great practical experience in big data. The homework was very worthwhile and I enjoyed it. Cons of the course: This course could be great if anyone on the teaching staff would dedicate even a little bit of effort. They will give you a set up for Zeppelin using Docker that takes many hours to complete and then you find out at the end that the set-up doesn't work, and in fact, when they gave it to you, THEY ALREADY KNEW IT DIDN'T WORK because it didn't work in the previous semesters either. And they still keep handing it out. No respect for students' time whatsoever. The instructions in general are frequently ambiguous or wrong, and no efforts are made to correct them. The TAs make brief appearances on piazza to give 3 word responses that do not answer the question that was asked. The professor of record has no interaction with the class whatsoever -- not even a single piazza post -- except each group for the group projects were allowed to meet with him for 10 minutes over 3 days of his choosing. Then after we all cleared out our schedule for the great honor, at the last minute, he cancelled and told everybody on that day they needed to move to day x. Which of course most of us could not do without any advanced notice. So I never had a single interaction with him. Lastly, the pacing of the course is bizarre -- almost backwards. The more time a homework or task takes, the less time you are given to do it. And the easier it is, the more time you are given to do it. The homework and project are not given out ahead of time so there is no way for students to correct for this.

I still liked the class despite all this in that it really did help me become confident with big data technologies and scala Spark in particular. It just frustrates me to know end that the course could be SO much better with just a little effort.
Rating: 4 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2017-10-30T09:53:55Zfall 2017
Pros: challenging and interesting, you will learn the basics of the big data ecosystem, Hadoop, Spark, learned the basic of Scala, and refresh your machine learning algorithms(it is hard to believe one can survive without taking some basic ML course).

Cons: professor was never seen before team project, some TAs are not engaging(I am however very impressed with one TA who helped with nice instructions to set up zeppelin in vagrant while the original setup does not work for most), grading is something you should worry about since the tests are not well designed, you may end up with lots of points deducted though it passed the sample tests they provided. Severely insufficient teaching material compared to what you need to accomplish the homework. Homework instructions are at times obviously unclear or insufficient.
Rating: 3 / 5Difficulty: 4 / 5Workload: 30 hours / week
Georgia Tech Student2017-09-19T23:47:57Zspring 2017
I dropped this class about a month into the semester so I'm not sure if things get better, but so far the class has been pretty awful. The professor has only made 3 very minor appearances on Piazza, only one TA is active, and the lectures are short and have almost nothing to do with the course content. Everything that I've learned has been through self-learning, which 1) is not a very efficient way to learn a large amount of new material as I find myself constantly reinventing the wheel and 2) I can do this for free on my own time, focusing on exactly what I'm interested in. The assignments do force you to learn new technologies and techniques, but I'm sure that I could have learned a lot more in a lot less time if there was guidance.

I don't have a specific interest in data science, but in general I can't imagine myself liking any course where there's no meaningful instruction and instead we're given time-consuming (40+ hour) homeworks to hack away at using whatever resources we can find. The amount of work required for the amount I was learning was too great and in all likelihood none of this will ever be useful to me professionally so I cut my losses and dropped. If you actually work in this area or want to work in this area, the course might be worth it since it forces you to use a lot of different technologies, but otherwise I wouldn't recommend it.

You'll learn a lot in this class like the other reviews state, but it's because you'll spend almost as much time on it as you do at your full time job, not because it's actually a good class. On the bright side it's probably an easy A if you have 600 hours to devote to it over the semester.
Rating: 1 / 5Difficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2017-08-10T21:28:16Zspring 2017
I'll try to share my personal experience (which is very different from the other reviews). I'm aware that my background definitely helped a lot, so I hope this review could be helpful for people with a background similar to mine.

The main skills needed for this class are:
- Experience on Linux CLI, environment variables, how to install packages/libraries, etc. Also knowing how to use a Virtual Machine.
- Some background in Machine Learning: what cross-validation is, typical metrics, how to train a model, and how to run predictions on it for supervised and unsupervised methods.
- Experience working with data in both databases and structured files: SQL, aggregation functions, joins, constraints, etc.
- Having the maturity/confidence to quickly learn on your own how to hack your way in some new programming language.
I have 15 years industry experience with databases (not big data), scripting languages in Linux, and I had previously taken ML, ML4T, RL. The assignments took me like 20-30 hours each (spread through 2 weeks, a workload of 15 hours/week).

On the other hand I can see how for someone with no experience AT ALL in the skills mentioned above, this class could take as much as 30-40 hours/week.

Some people have mentioned "you HAVE to LEARN Pig/Hbase/Hive". I don't think that's exactly the case. I found myself googling things like "how to left join pig" and adapting some snippet from S. O. On the other hand if you don't really know what an outer join, group by, etc. are, of course it will be hard to look for something you don't understand.

Completing the labs before the class started helped me a lot: http://www. sunlab. org/teaching/cse6250/spring2017/lab In that way, I didn't waste time setting up the environment for each assignment (the labs are closely related with assignments 2, 3, and 4).

As other reviews as mentioned, you are mostly on your own on the 2nd part of the class, but I think the grading was very lenient. Sadly I ran out of space for this review
Rating: 4 / 5Difficulty: 4 / 5Workload: 15 hours / week
Georgia Tech Student2017-06-08T19:39:29Zspring 2017
This is a very hard class but you will learn a lot and feel proud at the end. The homeworks require you to know Python, Hive, Pig, Scala, Spark. I knew the first 3 and took the "labs" before starting the semester. The labs are basically non graded tutorials you do on your own to learn Scala/Spark and other tools. Even with the labs and some background knowledge (I took Machine Learning before), the homeworks could take you 60 hrs. My life could have been much easier if I knew Scala before hand, but I learn while taking the class. If you survive the 4 homeworks with a decent grade, the second part is much easier - basically a project (a paper) about a Healthcare related topics from a list of topics suggested by the Professor. TLDR: Do the labs and learn basic scala before the semester starts and you will be fine. Still time consuming
Rating: 4 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2017-05-01T21:14:40Zspring 2017
As mentioned by others, this is a VERY challenging course unless you already use big data technologies and know Scala. I had completed half of the labs prior to taking the course and have taken all other ML / data courses in the program, so the machine learning and general data concepts weren't extremely challenging. The catch, though, was trying to drink from a fire hose in order to complete the homework assignments. The challenge w/ the assignments was picking up Scala, working with provided virtualized / containerized environments (for Hadoop, Spark, Scala, Zeppelin, etc. ) that weren't fully functional or compatible with the requirements of the assignment. Get ready to troubleshoot the tools you're given.

The instructor and TA were not at all very interactive on Piazza. IF we got an answer, it was often not necessarily answering the question given, wasn't complete, or so general the question needed to be asked again from a few additional angles. Many times there just wasn't a response. We completed our final project submissions and hadn't received feedback on an assignment submitted about 1. 5 months prior... most recent grading feedback was 2 months ago. TA office hours were non-existant... really were "message the TA and hope for a response". The prof' did host a couple opportunities to have a brief chat session to discuss the project, but this was about it.

Scala is nice now that I've done it, but was a bit of a paradigm shift for me, so work on it ahead of time. Aim to complete all of the labs prior to the beginning of the semester. If one could hang with the class pace of assignments for a little over half of the semester, then the 2nd portion was dedicated to an individual/group project and was much more tolerable as you could select/control the technology and environment to work within. Also, as w/ any course w/ a team project, vet a good team early.

You will pick up a lot of pertinent technology, but will work for it (find help early).
Rating: 4 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-12-20T18:41:01Zfall 2016
Let me start out by saying that this course is "lost", literally.

At the beginning of the semester, the instructor wrote "You are expected to learn additional materials beyond lecture and solve the programming problems through largely self-learning. " I would highly emphasize "self-learning. " There is so small a connection between the video lectures and the assignments that you may as well skip all the videos. And it shows. At the end of the semester, a student-run survey found that 15 students learned more on their own versus 2 students learned more from instructions. This is a lost of instructional goal.

The course composes of 2 unrelated parts. The first part has 4 assignments. The second part has 1 open-ended project. You are assumed to know Python, Scikit-Learn, HDFS, HDMR, Hive, Pig, Scala, Spark, GraphX, MLlib, Zeppelin. The provided environments are versions of both local and AWS Docker containers, and Vagrant images with CentOS, CoreOS, and Ubuntu. This gives a feeling of a kid playing with various toys and can't decide which one to take home. The issue is worse when you find out that the provided scripts don't even work out of the box! It is a lost of focus.

The open-ended project suffers from many problems. When the assignments are graded based on accomplishments and correctness, the project is graded like an English literature class. You can earn good chunk of points by making a video capturing your presentation skills, even if your project is a complete failure. Yet you don't earn any more point for taking more innovative or difficult approaches than other students. This is a lost of consistency.

This course is completely lost in the big world of data science. You'll be wandering aimlessly too if you follow this course.

Having said that, this course would be an easy A if that was all you'd care about. You just simply have to work really hard on it. It is a "work hard, not smart" kind of course that loses my interest quickly.

Good luck!
Rating: 1 / 5Difficulty: 3 / 5Workload: 35 hours / week
Georgia Tech Student2016-12-17T07:02:14Zfall 2016
This is definitely THE class you need to take if you are into big data technologies. It is the most brutal and most rewarding class that I have taken in OMSCS, and it is the hardcore class I expect to see in a top MSCS program. I personally learned a ton from this course and loved it very much. But be aware of the survival bias. Every two weeks, there were students withdrawing from this class, and only 30+ survived till the end of this semester.

First half of the class is about 4 homeworks. Within two weeks, you need to learn new tools, watch tutorials online (in addition to lectures and labs) and finish the coding assignment for each homework. Don't be misled by the difficulty of the first homework, which only takes less than 20 hours of work if you are good at python. Homework 2~4 easily costs you 60-80 hours each. If there were no such intensity, I would not have learned so much in less than 2 months.

Second half of the class is about a research project on a topic of your choice. Personally, it's my favorite part. At this point, most survivors should be familiar with all the big data tools (spark, hadoop, etc. ) and getting ready for the real meat of this class. There is a pool of topics for you to select, and you are also free to come up with your own topic. Your task is to use the big data tools you've learned to "reproduce and improve" a recently published journal paper in the field of Big Data for Healthcare. Given the complexity and shear size of the data, even partially "reproducing" other's work is not as easy as it sounds. You can secure a good grade by simply reproducing the result itself. From what I have read as reviewer (you can see all papers of the class), most people couldn't get to the point of matching the published papers. "Improving" basically means a publishable work, and very few (maybe 2~3 per semester) can meet that standard.

You need true passion and serious commitment to succeed in this class.
Rating: 5 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-05-10T18:39:56Zspring 2016
This has been the most intensive, hardcore, and outrageously awesome course ever. Unfortunately, I could not enjoy it as much because I took 6505 CCA along with this course, but I still got a lot out of it.

Things I liked
- hands on exposure with real big data in a practical field (medical)
- exposure to technologies that are currently being used in the field today.
- the guest lectors.
- Hands on experience on what machine learning really is capable of doing.
- good community of help through Piazza.
Things that I wasn't so fond of but are good
- hacking away, just trying to get things set up before I could even run my code. You should really have experience working on a linux server so you can start using AWS and the hadoop environment.
- Trying to differentiate equations that weren't ever mentioned in the course lectors or course prerequisites
- making matrix multiplication work through Spark.
- Easy Chair. it was anything but easy... or a chair.
Things I didn't like
- sometimes I felt TAs were very short. I can't entirely blame them, there is a lot on their plate. I wanted to turn in a newer version of my final project which I knew was still within the deadline. Not much luck talking about it.
- Class still has a lot of kinks that need to be worked out. How to give feedback on the final project wasn't very clear until the last moment.
- First homework assignment was using python pandas/numpy environment. Easy, know it already. Anyone that has already taken Machine Learning probably already knows about that environment. it would have been better if there was an assignment on the AWS EMR environment with a little bit of python to follow through.
- Main lectures were interesting but almost completely useless.
I was deceived. Somewhere probably on the syllabus, I remember reading that this course would most likely take over 15 hours every week. So I assumed it wouldn't be too bad. 20 or so a week was something I could manage. it was more than double that.
Rating: 4 / 5Difficulty: 5 / 5Workload: 45 hours / week
Georgia Tech Student2016-05-09T22:28:07Zspring 2016
This is the most "practical" course I've taken so far ( out of 9 ). It is very harsh at the beginning if you're not well prepared ( to meet the pre-req ). I had some experience with hadoop, mapreduce and piglatin before, thus the first 2 homework were not that bad. But learning scala and spark within 2 weeks was brutal. TA, professor and especially classmates are very helpful. The second half of the course is about project. I did learn a lot for writing and coding. It is also almost self-paced which made life much easier if compared to the first half. It is a very useful course if you plan to learn big data and use the related tools. But be prepared it is not an easy one.
Rating: 5 / 5Difficulty: 4 / 5Workload: 20 hours / week
Georgia Tech Student2016-05-07T19:55:08Zspring 2016
This course is very hands-on and you'll get very awarding learning experience, but it's best to take it with another easy course or on its own.

The difficulty highly depends on how much time you can allocate. Nothing is impossible but every single part of every assignment is time consuming. There are 4 homework assignments due every 2 weeks for the first part of the course and a project for the second half. Every assignment involves serious effort. Expect to spend several full days on each. This course covers several big data tools, but it's not a programming course. So you need to learn on your own well enough to finish the assignments. Make sure you start early. Assignments are assessed by autograding using different data sets than the ones you have, so I kind of miss the Udacity autograder that I used to hate, since at least you know you get it when you get it. The project focuses more on your ability to interpret data and understand the topic you choose. Your report/paper will be graded for the project, not your code.

The instructors and TAs are very active on piazza and helpful. It's a new course so there are some problems every once a while, so make sure you keep an eye on the announcement and pinned messages, in case there's any code or instruction update. The teaching staff are aware of the problems and are trying their best not to make things confusing. It's not perfect but they listen to suggestions and will likely make this course more organized.
Rating: 5 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2016-05-02T19:09:51Zspring 2016
This is purposely designed to be a "learn by doing" course and that comes with a lot of pros and cons. On the positive side, you get your hands dirty with the nitty gritty details of running big data experiments on Hadoop and Spark. I personally did not have much big data experience but by the time I completed the homework assignments I was ready to design and implement a large scale experiment on Hadoop for my project without major issues.

On the negative side, the lectures and other course content are pretty thin. There was a fairly large gap between content the course provides and what was needed to actually complete the assignments, so most of what I learned in the course was through looking up tutorials and other course materials on the Internet. By far the biggest negative was the workload required for the first three homework assignments, I spent at least 40+ hours per week on the assignments over the six weeks that were given to complete them. The assignments themselves were not necessarily difficult to complete (although one assignment required deriving an equation used throughout the rest of the assignment which the course itself in no way prepares you for), but it's very much as if an entire semester's worth of assignments are crammed into that first part of the course. There were some other minor negatives to the course, mostly in the form of the course being disorganized and instructions being vague or conflicting.

Overall I would say this could be a great course and I personally learned a lot from it, but I can't recommend it in its current form.
Rating: 2 / 5Difficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2016-05-02T00:17:55Zspring 2016
This is a hard course and it lacked a bit of organizing. Its probably because it was the first iteration for OMS, but the TA's were very helpful. The assignments are meant to be hard so you actually spend your time working and learning. I liked it that way. I had to actually implement a lot of it on my own with little help from piazza which I thought was very good. You might actually not find code snippets online as well since we are using multiple technologies.

If you want to go in prepared, this is what you need to know. These are not buzz words, but actual tech stuff I used this course.
1. Python, sklearn, pandas, matplotlib (HW1)
2. hadoop, hive, pig (HW2)
3. Scala , Spark, mllib, graphx (HW3, HW4, project)
4. vagrant (I dint have to go beyond setting it up and running some scripts)
5. aws-docker, I dint use it just played with for project
6. Heath informatics and machine learning concepts
7. IntelliJ and Pycharm were the editors I used. Makes your life tad bit easier.
You are allowed to use any of the big data framework for project but I used scala and spark.

Its a great course and I took it along side ML. They kind of complemented with each other and they deadlines dint overlap much, so worked for me.
Rating: 4 / 5Difficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2016-04-30T22:50:33Zspring 2016
This course is tough but you will learn a lot. Don't take this with any other tough course. It would be too hard to manage your time between office family and studies. But all said.. this is the course if you really want to learn the big data tools. If you dont have any ML experience or advanced statistic experience you will suffer in the beginning and you will have to work hard to get the homeworks done on time. If you are taking this course make sure you start working on assignments as soon they are released. They are not easy to work on if you start in last 4-5 days.
Rating: 5 / 5Difficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2016-04-30T06:57:38Zspring 2015
Summary: Learned a ton, very interesting, very rewarding. Difficult, very time consuming. Grading seems lenient.

The first half of the class involves 4 homework assignments, each given 2 weeks to complete. These are all pretty time consuming and require the full 2 weeks to complete. Just make sure you don't try to start them late. The second half of the class involves a more open-ended project. The workload was probably lower in the second half. Also, there is lots of extra credit available throughout the class. It seems a sizeable amount of students did not finish all of the assignments, but if you do, there is a good chance you can get >100%.

There is not much info in the lectures, but there is some good info in the labs. The professor wants students to learn by doing. Personally, I agree that this is the best way to learn (by trying to tackle something I am not quite sure how to do yet), but I can understand some students preferring a more structured approach, where they are told more explicitly how to do what they need to do. I would say the most important skill in this class is just to be a quick learner.

Overall, I learned more in this class than any other I have taken. I would recommend this class if you have any interest in using big data tools. Just don't expect an easy class. I would probably take this class by itself or paired with an easier class.
Rating: 5 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-04-30T02:23:52Zspring 2016
There are 4 assignments and 1 final project and I wanted to quit 5 times. All of them are excruciating and there are 0 scores all the time, even during the stage of final project.

Compared with Machine Learning, the workload for this one is waaaay much heavier and the assignments are way much more difficult. For ML, the assignment does not have a fixed answer, you can basically write it in any reasonable way; however, for all 4 assignment here, your scripts must generate a set of unique right answers (some of them need to be within in a certain range), if your script does not compile, you got 0 (because they were auto-graded). For each homework, there are several questions. Although TAs claimed that they would grade them independently, it is highly possible that, if you can not solve the 1st one, you cannot go any further. The videos on Udacity are "useless", you can barely learn nothing from there; but you will learn everything else via assignments, final project, labs, and student discussion posts on Piazza.

This course is about Health, but it's not healthy at all to take it - so many sleepless nights, excruciating debuging process, having no idea where to go, etc. etc. :)
Rating: 5 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-04-24T17:23:09Zspring 2016
Do you want a challenge? Do you want to use all the big data software we hear so much about (e. g. hadoop, spark)? Do you also want less sleep? :)

If you answered yes to all of those questions, then take this class. The class was incredibly interesting, with lots of work and investigation required by the student. The lectures are really light on material, as the professor says, it is expected that the student goes out and learns and experiments on their own.

I would highly suggest being familiar with Python, some Scala, and Machine Learning concepts BEFORE taking this course. I really did love the course, it was incredibly challenging, but very rewarding at the end.

The professor does need to get things a bit more organized in terms of expectations of deliverables, but given this was the first semester and the professor seems to react well to lots of questions and a bit of criticism, I am certain it will improve.
Rating: 5 / 5Difficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2016-03-25T16:29:21Zspring 2016
A few things on expectations before taking this class:

you should already have a background in ML as the topic is lightly covered in class. Also, you should already know Python and Pandas and Scala. Surprisingly, very little Java is used.

As far as the big data aspect, you will have to learn on your own as the class doesn't teach as much (there are some labs available for you to do on your own but you're better off with vendor based training or other tutorials).
Rating: N/ADifficulty: 5 / 5Workload: 41 hours / week
Georgia Tech Student2016-03-19T18:35:33Zspring 2016
The amount of work expected in this class is insane for a 3 credit class. When the professor designed the course, I doubt he realized how time-consuming it would actually be.

It took me 2-2. 5x as much time as Machine Learning (7641) did. As mentioned below, the time-spent-to-amount-learned ratio is not very good. I probably spent about 5% of the time actually learning and 95% of the time hacking away / throwing everything against the wall to get something to work. Many classmates had similar feelings.

One of the (4) homeworks took me about 12 hours just to get the very first part correct - worth 5 points; and that step was necessary to move on with the rest of the assignment. On another assignment I was stuck on the first section for an entire week and spent around 30 hours trying to figure out what was wrong. The students on Piazza and the TAs are helpful, but when you are truly stuck, too bad - you'll get a bad grade. What's worse is that the TAs don't share the correct solutions with you afterwards. Because of this, I feel like there's a lot I didn't take away from the class.

The class frequently has guest lecturers, which seem extremely interesting. Unfortunately, the instructors never shared any of the videos past March 1st, so the majority of talks were never shared with OMSCS students. Students asked about this on Piazza (several times) with no response from the teaching team.

The final project is enjoyable, but my mentor didn't make things easy. For the project, my mentor was supposed to upload a large. csv file to a Postgres database, but 2 days before the due date he still couldn't get it working -- which led to me having to re-write a significant portion of my code. I spent about 20 hours debugging this, but still couldn't get it working - because of that, I was forced to use a smaller dataset (which I'll be punished for).

This course has a lot of potential, but desperately needs to be re-designed and have the professor more actively involved.
Rating: 1 / 5Difficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-03-19T10:38:46Zspring 2016
Its a great course. I had some experience with big data tools, Python and ML before I opted for this course and yet I got hammered. Sometimes, its very frustrating. The assignments take a lot of time and nothing from the videos or lab prepares you enough and one needs to figure out a lot of stuff on their own. I am currently halfway through the course and have learnt a lot. The course could be structured a little better.
Rating: N/ADifficulty: 5 / 5Workload: 50 hours / week
Georgia Tech Student2016-03-15T09:21:25Zspring 2016
This course, as it was for Spring 2016, is an enormous amount of work. Every homework introduced multiple different tools, techniques and programming languages. You need to have a strong mathematical background in linear algebra. You should also be familiar with python (numpy and pandas especially) and scala. I would strongly recommend doing Martin Odersky's coursera scala course, a linear algebra refresher and ML before considering this course.
Rating: N/ADifficulty: 5 / 5Workload: 35 hours / week
Georgia Tech Student2016-03-14T15:47:52Zspring 2016
This class has been difficult for me because I lack some of the required foundation in both Math and Machine Learning. The lectures are somewhat superficial but easy to follow. The majority of learning (so far) has taken place in the labs (super helpful) and the assignments. I have found other students and TAs in the class to be very helpful in Piazza. I think the material covered (big data tools such as Hadoop, Hive, Pig, Spark) has been very cool, but I would have preferred more of the Health Informatics focus. If you are not comfortable with mathematic notation (like me!) I would recommend brushing up before taking this course.
Rating: N/ADifficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2016-03-09T16:29:54Zspring 2016
This is the first class that I have taken the first semester it was offered, so I'm not sure if they're all this half-baked and poorly planned to start, but I would not recommend this course to anyone until the professor fixes the multitude of issues. The workload in this class is insane - 40+ hours a week always - which I would not be as upset about if I were learning a lot too, but you definitely do not get out of this class what you put in. Projects are expected to be completed so quickly you often don't have time to actually learn the tools you're using. The labs don't teach you anything beyond how to copy the code snippets the professor gives you, and the lectures have essentially nothing to do with the projects. It's a pity, because this class seems so interesting in theory, but the execution is remarkably poor. I dropped this course because I feel like I'm putting in too much work to be learning so little - hoping it gets better so I can retake another semester.
Rating: N/ADifficulty: 5 / 5Workload: 50 hours / week
Georgia Tech Student2016-03-09T16:06:21Zspring 2016
By far the most time-consuming and frustrating class I've taken in this program (out of 7). Granted, I took it the first semester it was offered in OMS so I was expecting some issues with course content, coordination, and communication. However, it was much worse than I expected and I decided to drop the class and possibly come back to it in a future semester if I get the sense things have improved. This could be a really good class but, as it was, I spent 95% of my time hacking and 5% learning which is not a good educational return on my time. Some things to know or review prior to this class: SQL (particularly joins), how to derive equations , Vagrant VM. Also, be aware that you will be using newer data tools and technologies which could be very valuable to learn but good documentation seemed to be very lacking and the labs barely scratched the surface of what you will need to know to get through the homework. I would strongly recommend taking this class on its own.
Rating: N/ADifficulty: 5 / 5Workload: 40 hours / week
Georgia Tech Student2016-03-08T23:39:23Zspring 2016
This class will require you to learn and work on the problem with all time that you have. You also need to work with the TA and professor to clarify a lot of issue in the assignment. The assignment is well-structured but misses a lot of information. There is so much research that you have to do in each assignment. Strongly recommend to take this class by itself. You will learn a lot at the end of the course.
Rating: N/ADifficulty: 5 / 5Workload: 50 hours / week
Georgia Tech Student2016-03-01T19:50:21Zspring 2016
I like this class a bit. The only downside honestly is the pace, it's way too fast. The content is interesting and challenging (like a challenge you want to take on) but it can at times be very hard as it requires you have a good math and programming background so you might get stuck on either the math or programming at times. If the pace was slower, the workload would be better, and the class would be easier. Get a study group, and be find someone in real life that can help with either math or the programming portion of the content. The tools you learn here are things you would use in real life.
Rating: N/ADifficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2016-02-29T06:40:13Zspring 2016
Of the 10 OMSCS courses I have taken, this has by far the heaviest workload. For me, it is approximately double that of CS7641 (ML), which is itself fairly heavy. Some students are putting in over 40 hours per week. In this class, you will learn surprisingly little about BD or ML - I would estimate roughly 20% of your time covers ML or BD. Instead, you will be mired in details such as debugging the class virtual environment or tangling with syntax and type coercion for the new language of the week. Inexplicably, the instructor thinks this is a good thing.

This class is not far from being completely self-guided as the lectures provide little value. Consequently, we are often unprepared for homework assignments. This leads to focusing on getting things to work rather than learning how to solve problems the right way. If you want to learn BD, you're better off going directly to Google - this class is just a detour for the same.
Rating: N/ADifficulty: 5 / 5Workload: 25 hours / week
Georgia Tech Student2016-02-26T00:57:22Zspring 2016
For me, this is the single hardest class I have taken in the program (it is my 10th). The material is very difficult and the assignments have unclear directions. You can get stuck on something for 10 hours that is only worth 5% of a project, but completing that is required to proceed onto other part of the projects. I have spent 30 hours on each of the first two projects and wasn't able to get a B on either. The teaching team is very responsive on Piazza but it can be difficult to find answers there because it seems a lot of students are struggling.
Rating: N/ADifficulty: 5 / 5Workload: 30 hours / week
Georgia Tech Student2016-02-23T13:42:22Zspring 2016
This class is amazing. You will learn SO much about data mining, ETL, and modern model building techniques. However, you will be expected to learn a LOT by yourself, and FAST. Before touching this class I would look at being able to write both Python and Scala. In addition, familiarity with Vagrant and general Linux admin tasks are pretty mandatory. Finally, if it's been a while since you've taken an advanced statistics course, you should look into that-- you'll be expected to derive some pretty advanced formulas (partial derivatives included) and implement them in code. If you have these prereqs, then learning the Hadoop and Spark ecosystems shouldn't be too bad. Do not take with another high-load class (like IOS, ML, or CCA).
Rating: N/ADifficulty: 5 / 5Workload: 30 hours / week

Big Data Analytics for Healthcare

Quick Facts and Resources

To SURVIVE this class, come in with these 4 criteria, all completed before starting.

If you lack any of these points, you will regret it, instantly.

So, you have been warned.

But why did I like this class?

How to survive the exams?

BD4H is a beast