Writing an IR from Scratch and survive to write a post (long)

C++’s mascot is an obese sick rat with a missing foot*, because it has 1000+ line compiler errors (the stress makes you overeat and damages your immune system) and footguns.

EDIT: Source (I didn’t make up the C++ part)

armchair_progamer@programming.dev · edit-2 1 month ago

I could understand method = associated function whose first parameter is named self, so it can be called like self.foo(…). This would mean functions like Vec::new aren’t methods. But the author’s requirement also excludes functions that take generic arguments like Extend::extend.

However, even the above definition gives old terminology new meaning. In traditionally OOP languages, all functions in a class are considered methods, those only callable from an instance are “instance methods”, while the others are “static methods”. So translating OOP terminology into Rust, all associated functions are still considered methods, and those with/without method call syntax are instance/static methods.

Unfortunately I think that some people misuse “method” to only refer to “instance method”, even in the OOP languages, so to be 100% unambiguous the terms have to be:

Associated function: function in an impl block.
Static method: associated function whose first argument isn’t self (even if it takes Self under a different name, like Box::leak).
Instance method: associated function whose first argument is self, so it can be called like self.foo(…).
Object-safe method: a method callable from a trait object.

armchair_progamer@programming.dev · edit-2 2 months ago

Java the language, in human form.

armchair_progamer@programming.dev · edit-2 2 months ago

I find writing the parser by hand (recursive descent) to be easiest. Sometimes I use a lexer generator, or if there isn’t a good one (idk for Scheme), write the lexer by hand as well. Define a few helper functions and macros to remove most of the boilerplate (you really benefit from Scheme here), and you almost end up writing the rules out directly.

Yes, you need to manually implement choice and figure out what/when to lookahead. Yes, the final parser won’t be as readable as a BNF specification. But I find writing a hand-rolled parser generator for most grammars, even with the manual choice and lookahead, is surprisingly easy and fast.

The problem with parser generators is that, when they work they work well, but when they don’t work (fail to generate, the generated parser tries to parse the wrong node, the generated parser is very inefficient) it can be really hard to figure out why. A hand-rolled parser is much easier to debug, so when your grammar inevitably has problems, it ends up taking less time in total to go from spec to working hand-rolled vs. spec to working parser-generator-generated.

The hand-rolled rules may look something like (with user-defined macros and functions define-parse, parse, peek, next, and some simple rules like con-id and name-id as individual tokens):

; pattern			::= [ con-id ] factor "begin" expr-list "end"
(define-parse pattern
  (mk-pattern
    (if (con-id? (peek)) (next))
    (parse factor)
    (do (parse “begin”) (parse expr-list) (parse “end”))))

; factor 			::= name-id 
; 			 | symbol-literal
; 			 | long-name-id 
; 			 | numeric-literal 
;	 		 | text-literal 
; 			 | list-literal 
; 			 | function-lambda 
; 			 | tacit-arg
; 			 | '(' expr ')' 
(define-parse factor
  (case (peek)
    [name-id? (if (= “.” (peek2)) (mk-long-name-id …) (next))]
    [literal? (next)]
    …))

Since you’re using Scheme, you can almost certainly optimize the above to reduce even more boilerplate.

Regarding LLMs: if you start to write the parser with the EBNF comments above each rule like above, you can paste the EBNF in and Copilot will generate rules for you. Alternatively, you can feed a couple EBNF/code examples to ChatGPT and ask it to generate more.

In both cases the AI will probably make mistakes on tricky cases, but that’s practically inevitable. An LLM implemented in an error-proof code synthesis engine would be a breakthrough; and while there are techniques like fine-tuning, I suspect they wouldn’t improve the accuracy much, and certainly would amount to more effort in total (in fact most LLM “applications” are just a custom prompt on plain ChatGPT or another baseline model).

armchair_progamer@programming.dev · 2 months ago

armchair_progamer@programming.dev · edit-2 2 months ago

My general take:

A codebase with scalable architecture is one that stays malleable even when it grows large and the people working on it change. At least relative to a codebase without scalable architecture, which devolves into “spaghetti code”, where nobody knows what the code does or where the behaviors are implemented, and small changes break seemingly-unrelated things.

Programming language isn’t the sole determinant of a codebase’s scalability, especially if the language has all the general-purpose features one would expect nowadays (e.g. Java, C++, Haskell, Rust, TypeScript). But it’s a major contributor. A “non-scalable” language makes spaghetti design decisions too easy and scalable design decisions overly convoluted and verbose. A scalable language does the opposite, nudging developers towards building scalable software automatically, at least relative to a “non-scalable” language and when the software already has a scalable foundation.

armchair_progamer@programming.dev · edit-2 2 months ago

public class AbstractBeanVisitorStrategyFactoryBuilderIteratorAdapterProviderObserverGeneratorDecorator {
    // boilerplate goes here
}

armchair_progamer@programming.dev · 3 months ago

The ultimate backdoor runner (and cryptominer)

armchair_progamer@programming.dev · edit-2 4 months ago

I believe the answer is yes, except that we’re talking about languages with currying, and those can’t represent a zero argument function without the “computation” kind (remember: all functions are Arg -> Ret, and a multi-argument function is just Arg1 -> (Arg2 -> Ret)). In the linked article, all functions are in fact “computations” (the two variants of CompType are Thunk ValType and Fun ValType CompType). The author also describes computations as “a way to add side-effects to values”, and the equivalent in an imperative language to “a value which produces side-effects when read” is either a zero-argument function (getXYZ()), or a “getter” which is just syntax sugar for a zero-argument function.

The other reason may be that it’s easier in an IR to represent computations as intrinsic types vs. zero-argument closures. Except if all functions are computations, then your “computation” type is already your closure type. So the difference is again only if you’re writing an IR for a language with currying: without CBPV you could just represent closures as things that take one argument, but CBPV permits zero-argument closures.

armchair_progamer@programming.dev · 4 months ago

Go as a backend language isn’t super unusual, there’s at least one other project (https://borgo-lang.github.io) which chosen it. And there are many languages which compile to JavaScript or C, but Go strikes a balance between being faster than JavaScript but having memory management vs. C.

I don’t think panics revealing the Go backend are much of an issue, because true “panics” that aren’t handled by the language itself are always bad. If you compile to LLVM, you must implement your own debug symbols to get nice-looking stack traces and line-by-line debugging like C and Rust, otherwise debugging is impossible and crashes show you raw assembly. Even in Java or JavaScript, core dumps are hard to debug, ugly, and leak internal details; the reason these languages have nice exceptions, is because they implement exceptions and detect errors on their own before they become “panics”, so that when a program crashes in java (like tries to dereference null) it doesn’t crash the JVM. Golang’s backtrace will probably be much nicer than the default of C or LLVM, and you may be able to implement a system like Java which catches most errors and gives your own stacktrace beforehand.

Elm’s kernel controversy is also something completely different. The problem with Elm is that the language maintainers explicitly prevented people from writing FFI to/from JavaScript except in the maintainers’ own packages, after allowing this feature for a while, so many old packages broke and were unfixable. And there were more issues: the language itself was very limited (meaning JS FFI was essential) and the maintainers’ responses were concerning (see “Why I’m leaving Elm”). Even Rust has features that are only accessible to the standard library and compiler (“nightly”), but they have a mechanism to let you use them if you really want, and none of them are essential like Elm-to-JS FFI, so most people don’t care. Basically, as long as you don’t become very popular and make a massively inconvenient, backwards-incompatible change for purely design reasons, you won’t have this issue: it’s not even “you have to implement Go FFI”, not even “if you do implement Go FFI, don’t restrict it to your own code”, it’s “don’t implement Go FFI and allow it everywhere, become very popular, then suddenly restrict it to your own code with no decent alternatives”.

armchair_progamer@programming.dev · 4 months ago

https://github.com/cell-lang/example-online-forum

https://github.com/cell-lang/example-imdb

A more complex example, the compiler itself is written in Cell: https://github.com/cell-lang/compiler/tree/master

The getting started page has download links for the code generators (C++, Java, and C#) and setup instructions.

armchair_progamer@programming.dev · 4 months ago

Another resource is PLDB which is more of an index: it has many more languages, but less detail and curation.

armchair_progamer@programming.dev · 4 months ago

You probably need to annotate recursive function return values. I know in some languages like Swift and Kotlin, at least in some cases this is required; and other languages have you annotate every function’s return value (or at least, every function which isn’t a lambda expression). So IMO it’s not even as much of a drawback as the other two.

armchair_progamer@programming.dev · 5 months ago

Code is data.

I would just transmit the program as source (literally send y = mx + b) instead of trying to serialize it into JSON or anything. You’ll have to write a parser anyways, printing is very easy, and sending as source minimizes the points of failure.

Your idea isn’t uncommon, but there’s no standard set of operations because it varies depending on what you code needs to achieve. For instance, your code needs to make plots, other programs send code for plugins, deployment steps, or video game enemies.

There is a type of language for what you’re trying to achieve, the embedded scripting language (“embedded” in that it’s easy to interpret and add constants/hooks from a larger program). And the most well-known and well-supported embedded scripting language is Lua. Alternatively, assuming you’re only using basic math operators, I recommend you just make your own tiny language, it may actually be easier to do so than integrate Lua.

armchair_progamer@programming.dev · 6 months ago

IMO the 1 and only important reason is that PHP today is much different than PHP of the past. PHP’s notoriety comes from its early days, but now I hear it’s another general-purpose language with modern design, good IDE support, and tons of online resources. Plus it’s a explicitly designed for server-side scripting, so if that’s your goal it will be the best (most straightforward and supported) choice.

https://www.reddit.com/r/webdev/comments/wt6wam/newbie_here_is_using_php_still_fine/

https://phptherightway.com/

armchair_progamer@programming.dev · edit-2 7 months ago

No, that would probably be shuai.fu@monash.edu. I just saw it on HN.

armchair_progamer@programming.dev · edit-2 7 months ago

The original (C) Cocinelle page may have better use cases. Cocinelle for Rust is in the early stages, Cocinelle for C has been around for a long time.

armchair_progamer@programming.dev · edit-2 7 months ago

Yes. Unfortunately I’m not very familiar with GHC and HLS internals.

What I do know is that when GHC is invoked with a certain flag, as it compiles each Haskell file, it generates a .hie file with intermediate data which can be used by IDEs. So HLS (and maybe JetBrains Haskell) works by invoking GHC to parse and type-check the active haskell file, and then queries the .hie file to get diagnostics, symbol lookups, and completions. There’s also a program called hiedb which can query .hie files on the command line.

Rust’s language server is completely different (salsa.rs). The common detail is that both of them have LLVM IR, but also higher-level IRs, and the compiler passes information in the higher-level IRs to IDEs for completions/lookups/diagnostics/etc. (in Haskell via the generated .hie files)

Moderates