Is raku Dan RubberSonic?

Intro

Raku is a great software language that draws on the scripting heritage of perl. The newly minted raku Dan modules provide Data ANalytics (geddit?) capabilities to raku for data science and engineering use cases. (Intro slides & demo video here). This is one in an occasional series of blog posts that seek to explore the whys & wherefores of raku Dan.

I you want to check out more, warts and all, visit /r/rakulang at Reddit and ask away!

Disclosure: I am the author of the raku Dan module family. ~p6steve

Why Dan?

It is tempting to say we need Dan for Raku just like Python needs Pandas – to meet the needs of the large community of Data Scientists. And there has been some healthy debate about this in the raku community around how this is valuable functionality that is otherwise lacking. Raku Dan is a necessary, but not a sufficient reason to switch.

The vast majority of Data Scientists are Python+Pandas users. To them, the question Why raku::Dan? is synonymous with the question Why raku? or even Why not Python?

Why not Python?

The majority of this vast majority is either (i) preoccupied with ascending the Python+Pandas learning curve, (ii) preoccupied with getting the data science bit to work (the problem) and not the toolset, (iii) value their team’s time investment in getting here and do not wish to risk a new and unproven path or (iv) simply cannot imagine a better way than Python+Pandas. There is no compelling reason to change that outweighs the perceived negatives and so they won’t.

A significant minority of this vast majority is looking for a better way. Some value speed and are trying Python+Polars. Changing to a similar, but faster library than Pandas is less disruptive than changing language.

Some are trying a purpose-built Data Science language like Julia to get both improved speed and a domain-oriented coding experience.

It seems that the reasons for making a change are a balance of:

  • raw speed
  • coding experience

SIDEBAR:- Notably, there is some limit to “raw speed at all costs” for most changers – otherwise folk would always choose R+data.table or Rust+Polars (the Polars library is written in Rust and uses Arrow2 fast columnar structures – bypassing Python would be an obvious speed up that would avoid the GIL and other overhead).

How Sonic is Raku?

Here, I use the term ‘Sonic’ to reflect raw speed.

Raku is not renowned for its build or execution speed – it’s a scripting language with on-the-fly build-and-run capability, after all. However, it does have very powerful & deep concurrency support such as hyper and map/reduce operations which are available in raku Dan from the outset:

say ~my \quants = Series.new([100, 15, 50, 15, 25]);
say ~my \prices = Series.new([1.1, 4.3, 2.2, 7.41, 2.89]); 
say ~my \costs  = Series.new( quants >>*<< prices );

Yes – you can hyper a Dan::Series!

There are also some promising developments in sight for Raku (such as AST) and better VM optimisations.

This table illustrates how raku fits:

Purpose1990s2010snext…?TechnologyResult
Script codePerlPython3RakuInterpreter / VMFlexible
System codeCC++RustCompilerFast

Polars is fast because Rust is fast. I love Rust and I believe that Rust and Raku is a great fit. That’s why raku Dan::Polars (a binding from Dan to Rust) is a vital member of the raku Dan module family and I am looking forward to its intro soon. With raku Dan::Polars we will get the same access to Rust and Arrow2 raw speed that Python/Polars has.

SIDEBAR:- a Rust compile (‘cargo build’ specifically) for dan/src/lib.rs (the custom Rust library that connects raku Dan::Polars to Rust/Polars) takes 2m 51s from scratch and 19.31s if I make a small change to the source file (incremental build). In contrast, the similar sized raku Dan/Polars.rakumod build never has a comparable initial compile overhead and an incremental change takes only 1.714s to recompile and run. Boy do I love scripting languages that avoid that 19s dev cycle time!!

So – raku is about as Sonic as Python.

How Rubber is Raku?

Here, I use the term ‘Rubber’ to reflect flexible coding experience.

I have spent the last 6 months learning to Rust. Rust is cool. But Rust is hard. Despite the alleviations of generic types and traits and so on, Rust is a strongly typed world (as you would expect from something that can make system code safely).

But, a Rust newby like me can spend ages to hammer down some type errors and untangle the <T>, into(), clone(), mut, Some(None), unwrap() knots just to get a few lines of code working. Strong typing means thinking as much about the language imposed patterns as about the problem-solution.

Obviously I am born scripter. I like to write a line or two that works in practice and then grow my code from there, refactoring with more structure as I start to repeat stuff and I need to harden against errors. So the raku model of “get some code working, then gradually layer in type safety as required” is a very natural fit.

This, I submit, is a crucial distinction for Data Science coders.

Strong typing puts substantial cognitive load on the coder and makes dealing with real-world data sources very onerous.

No wonder Python became the ‘darling child’ of the data scientist community. No wonder other strongly typed, though performant, languages of the day like Java or C++ didn’t pass muster. Simply put, Data science needs a flexible coding experience – it’s where the rubber hits the road.

So – raku is about as Rubbery as Python.

Why Raku?

Let’s say I have 5m sales records – maybe from a csv or some historic data files or data warehouse. Can I be sure that there are absolutely no errors such as (e.g.) ‘1’ (Int) instead of ‘1.0’ (Double), or an ‘l’ (Str) or maybe ‘πŸ™’ (Unicode)?

Let’s see how that looks in raku:

> my @a = <1 1.0 l πŸ™>
[1 1.0 l πŸ™]                    #look I can suck up anything
> @a.map(*.^name)
(IntStr RatStr Str IntStr)     #I can just store them as allomorphs
> @a.are;
(Str)                          #and check for common parent type
> @a.grep(Real)
(1 1.0 πŸ™)                      #I can extract the numbers
> @a.map({$_~~Real ?? .Num !! NaN})
(1 1 NaN 1)                    #or make them all Num (f64)

The .are method picks the narrowest type that is common to all the array items.

And it plays in the super well-structured raku gradual type system:

OK – so now I have a hierarchical type space for progressive munging and a set of type tests and an easy way to do type coercion / error rejection.

Raku also let’s me step up the enforcement by gradually adding types to my data structures.

Let’s gradually introduce typed variables. my Num @r = @a declares a new array @r whose items must be of type Num. It will reject other item types. This is a powerful way to specify and manage a “contract” between data capture / data munging phases and later data analysis code.

> my @a = <1 1.0 l πŸ™>
[1 1.0 l πŸ™]
> my Num @r = @a
Type check failed in assignment to @r; expected Num but got IntStr (IntStr.new(1, "1"))
  in block <unit> at <unknown file> line 1

Now, I can clean up my act like this (i) to replace unreal items with NaN and then (ii) to coerce all the remaining items to Num. I already know that there will be no coercion failures since they are all matched type Real.

> @a.map({$_=NaN if not $_~~Real})
(NaN)
> my Num() @r = @a
[1 1 NaN 1]

Here are some of the bits of raku I am using to do the work:

  • $_ is raku for “this item” aka the topic variable as you iterate through a .map
  • ~~ is the raku smartmatch operator that simplifies checking types
  • Num() is a raku coercion type that says “I will take anything that will coerce into this type” and auto performs the coercion if needed on assignment
  • IntStr is a raku allomorph that bridges Stringy and Numeric type Roles – so I can read eg. a csv column of decimals without caring if they will be needed as text or numbers later on (and I can contract and curate for that)

Many other horrors lurk in data munging and capture. Using raku and its gradual type system is a great way to contract and curate a data pipeline.

Here we have only scratched the surface with the full range of capabilities of raku and gradual types. What about DateTime formats? What about text extraction, regexes and unicode? So a lot more for next time…

Raku is RubberSonic

In summary, this blog has been trying to work out why raku::Dan in particular and therefore raku in general may have resonance with some the community of Data Scientists that typically use Python+Pandas today.

It has shown that when it comes to raw speed, Raku (when equipped with Dan::Polars) is about as Sonic as Python.

It has shown that when it comes to a flexible coding experience, Raku is about as Rubbery as Python.

This matters, I think, because the RubberSonic combination places raku Dan squarely in the tradition of the first Python/Pandas pioneers. It is a very good impedance match for the needs of data scientists – where the language does not get in the way of the solution.

Here I have set out a case for raku Gradual Typing and other unique capabilities which are not provided by the Python language to substantially improve consistency and control during the data munging and data capture phases of data analysis.

So, yes, raku Dan is definitely RubberSonic and is a very good fit for the needs of data scientists who feel constrained by the limits of the Python language around concurrency, gradual typing, formal OO with encapsulation, unicode and so on…

Please do leave a comment here, or come and join the raku debate over on reddit.

~p6steve

(c)2022 Henley Cloud Consulting Ltd.

2 Comments

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s