Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: general fixes #45

Merged
merged 2 commits into from
Oct 25, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions ds-with-mac/content/posts/unpopular-opinion-hard-good-ds/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ Here are some key challenges this hype has created:
**The Rise of the Self-Proclaimed "AI Specialist" 🧑‍💻**:
* *Overnight Experts*: The commoditization of AI, primarily through LLMs, has made these powerful tools much more accessible. With platforms like Cursor[^2] simplifying coding tasks, it has become easier for anyone to claim technical expertise. This has led to a surge of people rebranding themselves as "AI specialists" after taking a short course in "Prompt Engineering" [^3] or reading a few blogs. However, knowing how to ask ChatGPT relevant questions doesn't make someone an expert in, say, finance, healthcare, or any other <cite>domain[^4]</cite>. This overconfidence and influx of self-proclaimed experts can dilute the quality of AI projects, leading to a false sense of competence that ultimately hampers real progress.

* *Misaligned Skills*: Companies often have teams that can use AI tools but need more expertise to build, fine-tune, or deploy robust models. This disconnect can lead to poorly executed projects that fail to deliver business value. For instance, I've seen scenarios where senior engineers dismiss AI as a "fad" because it threatens the traditional ways they've worked, a mental defense mechanism to avoid learning new skills. However, on the flip side, many organizations can use a full-fledged data scientist or AI specialist from the start. Instead, they might be better off getting their data infrastructure in place and starting with simpler rule-based systems or hiring a well-rounded "AI Engineer" [^5] who can bridge the gap between full-stack engineering and AI deployment.
* *Misaligned Skills*: Companies often have teams that can use AI tools but need more expertise to build, fine-tune, or deploy robust models. This disconnect can lead to poorly executed projects that fail to deliver business value. For instance, I've seen scenarios where senior engineers dismiss AI as a "fad" because it threatens the traditional ways they've worked, a mental defense mechanism to avoid learning new skills. However, on the flip side, many organizations **can't or should** use a full-fledged data scientist or AI specialist from the start. Instead, they might be better off getting their data infrastructure in place and starting with simpler rule-based systems or hiring a well-rounded "AI Engineer" [^5] who can bridge the gap between full-stack engineering and AI deployment.

**Over-Reliance on Plug-and-Play AI Solutions 🔌**:
As many of you have undoubtedly noticed, AI—particularly Generative AI (GenAI)—has become increasingly commoditized. This means that AI can now be integrated into existing systems as a module or component, making it more accessible and widespread. While this democratization is a positive step for increasing AI adoption, it has also led to a surge in so-called **"OpenAI or GPT wrappers"**—applications that add a basic layer over pre-existing models, like ChatGPT, without offering significant value beyond the core functionalities.
Expand Down Expand Up @@ -193,7 +193,7 @@ We've already discussed the hype vs. reality problem surrounding the AI craze. H

* **The Role of Translational Roles**: Bridging the gap between technical teams and business units is essential. Roles like data translators or analytics leads can help set realistic expectations, articulate business needs to technical teams, and ensure that AI initiatives are developed with a clear understanding of their intended outcomes. These roles are vital in translating business problems into technical solutions and vice versa, preventing misalignment from the start.

These issues aren't going to disappear just because we're using LLMs. This is one reason why AI agents are not yet ready to replace coders or developers. As long as project managers and business leaders in 2024 struggle to communicate what they want to build, the most sophisticated AI (yeah, talking about you, Devin) will still require skilled professionals to bridge these gaps and deliver meaningful, well-aligned solutions.
These issues aren't going to disappear just because we're using LLMs. This is one reason why AI agents are not yet ready to replace coders or developers. As long as project managers and business leaders in 2024 struggle to communicate what they want to build, the most sophisticated AI (yeah, talking about you, Devin and similar coding agents) will still require skilled professionals to bridge these gaps and deliver meaningful, well-aligned solutions.

[^2]: https://www.cursor.com/
[^3]: Mark my words, you are not an "AI" specialist because you know some prompt engineering.
Expand Down Expand Up @@ -250,9 +250,9 @@ Data Scientist was once hailed as the sexiest job of the 21st century [^7], but

Garbage in equals garbage out—let's repeat it: GIGO, GIGO. Data quality is and will remain a critical issue at many companies, even if you use all the cool LLM-based features available today. If there's no data strategy or plan to make data accessible, the quality of the model doesn't matter. From my experience, almost every place I've worked has had issues with data, whether it's about quality, accessibility, or integration.

There's a long-standing belief that a Data Scientist spends 80% of their time cleaning data and only 20% on actual analysis and modeling. This idea, popularized through various surveys, still holds some truth, even though things have drastically improved over recent years. For example, a study by CrowdFlower (now Figure Eight) found that data scientists typically spent around 60% of their time on cleaning and organizing data, which contributed to the often-quoted "80/20" split when considering other data preparation tasks like data collection[[4](https://www.figure-eight.com/data-scientists-spend-most-time-cleaning-data/)][[5](https://www.forbes.com/sites/bernardmarr/2018/01/19/data-preparation-is-still-80-of-the-work-in-data-science-ai/)].
There's a long-standing belief that a Data Scientist spends **80%** of their time cleaning data and only **20%** on actual analysis and modeling. This idea, popularized through various surveys, still holds some truth, even though things have drastically improved over recent years. For example, a study by CrowdFlower (now Figure Eight) found that data scientists typically spent around **60%** of their time on cleaning and organizing data, which contributed to the often-quoted "80/20" split when considering other data preparation tasks like data collection[[4](https://www.figure-eight.com/data-scientists-spend-most-time-cleaning-data/)][[5](https://www.forbes.com/sites/bernardmarr/2018/01/19/data-preparation-is-still-80-of-the-work-in-data-science-ai/)]. I don't think I have ever put 80% of my time on cleaning data but still, I have had to put a lot of time and effort into doing so.

However, it's still surprising that so many companies don't fully understand their data, where it resides, how it's generated, and its quality. Without a clear data management strategy, even the most advanced machine learning models will struggle to produce reliable, actionable insights.
However, it's still surprising that so many companies don't fully understand their data, where it resides, how it's generated, and its quality. Without a clear data management strategy, even the most advanced machine learning models will struggle to produce reliable, actionable insights. Why not use a data catalog solution you say? Well beats my comprehension.


## Issue #5: The need for deep domain knowledge
Expand All @@ -268,7 +268,9 @@ However, being a data scientist, it's challenging to be a legal or finance exper

### The Role of Domain Experts in AI Projects
* **Contextual Understanding**: Domain experts provide the context often missing in pure data analysis. For instance, a legal expert can help interpret regulatory data accurately, while a healthcare professional can ensure that AI models for medical diagnostics are clinically sound.

* **Fine-Tuning AI Models**: When building LLMs or other AI solutions, domain knowledge can aid in fine-tuning, ensuring that the models generate outputs that align with industry standards and real-world applications. Without this, even the best models may miss critical nuances.

* **Mitigating Risks and Ensuring Compliance**: In sectors like finance, healthcare, and law, there are strict compliance requirements. Domain experts can help data scientists navigate these regulations, ensuring that AI models are effective and legally compliant.

While LLMs and other AI tools continue to advance, deep domain knowledge remains crucial for success. For Data Scientists, collaboration with domain experts is not just a best practice—it's a necessity. The synergy between technical skills and domain expertise will drive innovative, effective, and responsible AI as we push toward more sophisticated AI solutions.
Expand All @@ -294,13 +296,18 @@ However, it can be rather discombobulating for practitioners to differentiate be

### Why These 'Ops' Matter for AI Success
* **End-to-End Integration**: Regardless of the terminology, the key is to think beyond just model development. This means seamlessly integrating your data pipelines, model training processes, deployment strategies, and post-deployment monitoring. Whether it's DataOps for managing data workflows, MLOps for machine learning lifecycle management, or LLMOps for overseeing large language models, the goal remains the same: reliable and scalable operations.

* **Monitoring and Maintenance**: AI models, especially those involving LLMs, are inherently stochastic and can behave unpredictably in production environments. Continuous monitoring and maintenance are essential to catch performance drifts, unexpected behavior, or data quality issues. This includes setting up alerts, tracing problems back to their source, and having protocols to handle model updates or rollbacks. For example, a model trained on static data may begin to underperform as new patterns emerge in live data, requiring retraining or fine-tuning.

* **Version Control and Experimentation Tracking**: An essential part of operationalizing ML systems is maintaining version control for models, datasets, and code. Experimentation tracking allows practitioners to compare different model versions, making reverting to previous iterations easier if new deployments fail. This practice ensures consistency and traceability across the AI lifecycle, which is especially important in regulated industries like finance or healthcare.

* **DevOps Practices Applied to AI**: At its core, MLOps, DataOps, and LLMOps are extensions of DevOps principles. They adopt the best practices of DevOps—such as Continuous Integration/Continuous Deployment (CI/CD), automated testing, and infrastructure as code (IaC)—and apply them to AI workflows. The main difference lies in the need to handle additional layers of complexity, such as data preprocessing, model training, and model performance monitoring. For instance, deploying a traditional software application might not require constant retraining, whereas AI models often do.

### Key Challenges in Operationalizing AI Systems
* **Data Drift and Concept Drift**: One of the most significant issues for production AI systems is data drift—when the data entering the model in production changes over time, leading to performance degradation. Detecting and managing this is a core aspect of MLOps and LLMOps, requiring automated monitoring systems and mechanisms to trigger retraining when necessary.

* **Infrastructure Scaling**: As models, especially LLMs, can be computationally expensive, scaling infrastructure becomes challenging. Companies must ensure their infrastructure can handle peak loads without incurring unsustainable costs. Solutions like autoscaling, serverless architectures, and distributed computing are often integrated to manage this effectively.

* **Cost Management**: The computational cost can be significant, particularly with LLMs. Practitioners must know how to optimize models, compress them, or implement cost-effective solutions such as on-demand cloud services to manage expenses without compromising performance.

While the terminology may vary, the principles remain consistent: AI systems require robust operational frameworks to function effectively in production. Whether it's called MLOps, DataOps, AIOps, or LLMOps, what matters is adopting a holistic approach to ensure these systems are scalable, reliable, and adaptable. After all, production environments are always different from development, and planning for those differences separates successful deployments from failed experiments.
Expand All @@ -311,18 +318,23 @@ While the terminology may vary, the principles remain consistent: AI systems req

{{< figure src="/pic_issue_7_v2.png" alt="#Issue7" title="Fig 7. Too many languages, frameworks and models to keep track off. Source: Author." style="display: block; margin-left: auto; margin-right: auto; width: 50%; max-width: 50%;" >}}

If you have chosen the path of a Data Scientist, you're likely someone who enjoys learning and experimenting with new technology. However, compared to a few years ago, the pace of change in this field has accelerated drastically. We see new research papers released almost daily and new libraries that promise to do things better than before. Emerging programming languages have also entered the mix—should you stick with `Python,` or explore newer ones like `Rust,` `TypeScript,` or even `Zig`? Regarding databases, should you continue using `Postgres` with `pgvector` enabled, or should you switch to a newer vector database like `Qdrant`?
If you have chosen the path of a Data Scientist, you're likely someone who enjoys learning and experimenting with new technology. However, compared to a few years ago, the pace of change in this field has accelerated drastically. We see new research papers released almost daily and new libraries that promise to do things better than before. Emerging programming languages have also entered the mix—should you stick with `Python`, or explore newer ones like `Rust,` `TypeScript,` or even `Zig`? Regarding databases, should you continue using `Postgres` with `pgvector` enabled, or should you switch to a newer vector database like `Qdrant`?

The choices don't stop there. Should you buy or build? Fine-tune or prompt-engineer, especially as LLM capabilities continue to improve? What tasks are still considered core to Data Science? Sure, evaluation (evals) is essential, but how engaging is it to write evals all day? And with the rise of techniques like using LLMs as evaluators, is there even a need for traditional approaches anymore?

My point is that **technology**—and **Data Science** with it—continues to change rapidly. If you caught the recent announcements from Anthropic, demonstrating <cite>Claude taking over your computer [^8]</cite>, it raises an even bigger question: *do we still need programmers*, or will AI soon take over these tasks? We'll likely need to be present and involved, but filtering out what's genuinely relevant from the constant stream of new developments is becoming increasingly challenging.
My point is that **technology**—and **Data Science** with it—continues to change rapidly. And we as practitioners need to stay ahead of the curve and adopt a continuous learning mind set. If you caught the recent announcements from Anthropic, demonstrating <cite>Claude taking over your computer [^8]</cite>, it raises an even bigger question: *do we still need programmers*, or will AI soon take over these tasks? We'll likely need to be present and involved, but filtering out what's genuinely relevant from the constant stream of new developments is becoming increasingly challenging.

### Key Challenges and Considerations
* **Overwhelming Choice of Tools and Technologies**: With the rapid release of new programming languages, frameworks, and libraries, Data Scientists face the daunting task of deciding which tools to invest their time. Should you learn Rust for performance benefits or stick to Python, which remains the industry standard for data science? Every new tool claims to be better, but not all are worth the investment.

* **Fragmentation and Integration**: The sheer number of tools can lead to fragmentation, where teams might struggle to integrate different systems. A new vector database might be faster, but if it doesn't play well with existing data pipelines, it can lead to more issues than it solves.

* **Evolving Skillsets**: The skillset required for Data Scientists continues to evolve. It's no longer just about building models; understanding how to fine-tune LLMs, manage data infrastructure, and even prompt engineering has become part of the job. This broadening of roles can lead to skill overload, where it's difficult to specialize in any area.

* **Balancing Innovation and Practicality**: The fast pace of change means that businesses often feel pressured to adopt the latest technologies, even if they don't fully understand their benefits. This can lead to premature adoption of tools that might not be the best fit, resulting in wasted resources and time. Finding the balance between staying innovative and sticking with practical, proven solutions is critical.

* **Filtering Noise from Genuine Innovation**: With the constant stream of new research, it's hard to distinguish what is genuinely innovative from what is just noise. Not every new paper or tool will be groundbreaking, and data professionals must develop a critical eye for what will benefit their work versus what is a distraction.

* **The Future of Programming Roles**: Technologies like Anthropic's Claude, which demonstrate AI taking over complex tasks like operating a computer, raise questions about the future of programming and data science roles. Will the role of a Data Scientist evolve into something entirely different, where human input is more about guiding and supervising AI systems rather than building them?

The rapid pace of technological change presents both opportunities and challenges. While it drives the field of Data Science forward, it also demands constant learning, adaptation, and discernment. Success will depend on navigating this evolving landscape, balancing innovation with practicality, and staying focused on core skills while remaining open to new possibilities.
Expand Down
Loading